Oral History of Museum Computing: Ruth Cuadra
This oral history of museum computing is provided by Ruth Cuadra, and was recorded on the 22nd of March, 2021, by Paul Marty and Kathy Jones. It is shared under a Creative Commons Attribution 4.0 International license (CC-BY), which allows for unrestricted reuse provided that appropriate credit is given to the original source. For the recording of this oral history, please see https://youtu.be/hD5z2kGVzfY.
My name is Ruth Cuadra, and I’ve been involved with computing and databases and then with museums since the late 1970s. When I graduated from UCLA in 1976, I had a bachelor’s degree in mathematics, which at that time had a kind of limited career path. It was probably mostly cryptography. There was a lot of work being done by government, and there was academia, neither of which I was particularly interested in, but I was fortunate to have a boyfriend who had — now my husband of 40 some years — whose father was involved with early database systems. He was running the system — his name is Carlos Cuadra, who is quite well known in the information science field — and he was running a service called Orbit Search Service, at System Development Corporation in Santa Monica. I live in Los Angeles. And System Development Corporation was an offshoot of RAND, and SDC was mostly doing contract work for the Department of Defense.
Now, Orbit was one of the first online searching systems that was available through public dialup connections. This was long before the Internet, so you’d had a telephone that went into the acoustic coupler — there’s a word I haven’t thought of in a long time! — the acoustic coupler, and you waited for the tone and it buzzed, and you typed your login, and it was a very much… it was a keyboard-based system that was operated through a TELNET connection. And the source of databases for Orbit was publishers of Indexing & Abstracting Journals, so SDC contracted the first one was the American Psychological Association, first one… of the first ones. The other big one at the beginning was Chemical Abstracts, which was a product of the American Chemical Society. So, what the Orbit staff did was they got magnetic tapes, big, heavy, expensive-to-ship tapes from the I&A publishers. Those tapes were the same ones that the publishers sent to their typesetting services, to their printers, so they were encoded for printing, and it was the job of the staff in Orbit to pick apart those formats, figure out where the data actually was, and write the specifications for programmers to take those tapes and build files that could be loaded into the Orbit system. And that’s the part that really intrigued me at the beginning.
My, my husband, my boyfriend at the time, was a programmer. He was also working for his father, and so I got involved in writing these kind of specifications, looking at the I&A typesetting tapes that came from publishers, and figuring out where the data was, and how a programmer might approach a conversion program to convert those typesetting tapes to something that could be loaded into Orbit.
I had taken some programming courses in my work at UCLA. I wasn’t a programmer. I didn’t have quite the right mindset for it, but I think I had good enough communication skills and an understanding of what programs would need so that my specifications were effective, and where there were problems. I enjoyed the back and forth with the programmers, so figuring out what the most effective and efficient way to build these conversion programs was. One of the things I remember most from those early days of conversion programs was that I went into this thinking computers are super-fast, right, computers did things for us that we couldn’t do manually, so if we were to give a computer say, 100,000 records of chemical abstracts to load, it would be a snap, and you would turn around in a minute, and you would have those in the computer. That wasn’t really the case. It sometimes took days to get that many records through this conversion process and the subsequent processes to load the databases into Orbit. And that was a real eye opener for me, so the work that I was doing to kind of supervise some of these projects through from the analysis through the conversion programming, and then the loading process, was sometimes months to get it all figured out and then, depending on the size of the database, weeks and weeks to finish the load, to actually let the computer run through this reformatting and processing process that we had developed.
So I keep that in mind, where people, even today, say, well, why is it taking so long? I said because you’ve got a million records, and we’re doing 10,000 things to each piece of information that comes through. And yes, computers are fast, but not that fast. So that’s a that’s a hard thing to convey to people when you’re talking about how to develop a project and what the timeline should be. It’s fast. Yes, it’s fast, a lot of stuff is going on in the background.
So, the goal…. I worked at — let me go back. I worked at SDC until 1978, when Dr. Cuadra, my boyfriend’s father, left to start his own consulting company that was called Cuadra Associates Inc. (incorporated), which we abbreviate as CA. The goal of CA was to build software like Orbit that ran, which ran on IBM mainframe at SDC, to build a system like Orbit for microcomputers that were just coming onto the market in the late ‘70s. Our product was called STAR, which would allow libraries and information centers to run their own database retrieval system, to build their own databases and run a retrieval system, and control everything about their database without depending on any sort of centralized or corporate computing facility and services. And that was a big deal at the time.
We’ve since kind of swung the pendulum back and forth a few times about who controls computer systems in libraries and corporations and government agencies and so forth, whether it’s some sort of central computing, or whether departments can have control over their own information development projects. And so, at the time, having a system like Orbit on a small computer that a library or information center could afford and could keep in the corner of someone’s office, and develop their own databases was an up and coming thing, and STAR was one of the first to do it on a on a large, sustainable scale. So, it freed– having STAR freed libraries from their reliance on, on [centralized] computers and STAR was something that let them control how their databases looked, how the data was entered or edited, and how the searching would work, how reports or research results could be displayed on the terminal, or printed out, or even a little bit before email, this was happening, and how results could be distributed to scientists or researchers or academicians in whatever organization had the system.
So, while STAR was being developed — this is kind of a side note – while STAR was being developed, Cuadra Associates needed some other income, so we did all kinds of consulting projects, but mostly around this idea of online databases and searching and who could make money from such a thing, what kinds of organizations would be interested and so forth. We also developed the Directory of Online Databases, which was a subscription service that was published quarterly, because the industry was growing so fast and the numbers of databases of all different kinds coming online was increasing. I served as the Chief Editor of that product. It was eventually sold off to Gale Research, some years later. We published for about 12 years, so it was a lot. And that was additional background for me about the kinds of databases that are out there. We were used to dealing with bibliographic databases when we worked at SDC, and it became clear that that was just a small part of what was happening in the online database world. Lots of numeric data from government agencies, labor data, census data, economic data of all various kinds were being released through systems with kind of big interfaces, specialized interfaces a lot of the time, but there was a lot of that going on, and we were trying to bring that information to our subscribers, who were primarily libraries and information centers, to say, hey, yeah, you know the things that Orbit’s doing, you’re knowing, you know things that, that other online services were doing commercially at that time, but there’s actually a much wider world out there. So, I think we made a significant contribution in widening the view that people had about what an online database was, and what you might need from using them or even knowing that they exist.
So, let’s see. When STAR came on to the market, it was about 1983-84, so it took a few years to get a system that we thought we could really bring to market. The Getty, where I work now, the Getty in Los Angeles, was one of our first clients. And they had two STAR systems, one for the Provenance Index, and one for their Photo Study Collection, which is now called the Photo Archive. So, in addition to my duties with the directory for Cuadra Associates, I did customer support for STAR, and I learned in that process a lot about what the Getty was doing, and how their database applications were being developed. And I can talk more about that, if you think that’s of interest. Okay.
The Provenance Index was originally — which covers the history of ownership of art from the 15th Century to, at that time, about the end of the 19th Century, primarily Western art — and the original project was designed around printing books of the index. So, there’s a book that covers Dutch sales from 1800 to 1805, and it’s a book that’s two and a half inches thick, and it has a little teeny type of all of the auction sales of art that happened in the Netherlands between that range of years. And then there was another book for 1806 to 1810, and so on down the road. And there were French sales, and there were British sales, and all of this information was being keyboarded by hand into STAR by a team of editors who would go page by page through auction sale catalogs and transcribe information. So those databases were developed with publication in mind, so a lot of the fields that you find in there are, “This is the way we want it to appear in the index. This is the way we want it to appear as the main heading on page 87. This is the way we’re going to standardize the name of artists…” is very important thing. Is this Jay… Jay… let me think of a good example… “is this Rembrandt the same as the one with the J in the front?” that kind of thing. And so catalogs contain all kinds of variations. They look different, they have different kinds of information, so it was like huge standardization job to develop the Provenance Index, even as far as to make it publishable in print.
Once it was in print, all of this information is now in electronic databases, and so the web was starting to appear, and the idea was to make the data available online. STAR, during that period, had developed a web interface, because of course, we too could see that that was the future of online searching. So it was going to go beyond just what you can search within your own organization, as the original STAR implementation was. A library would have a STAR system and people in that library or in that organization can connect to it, and now we’re going to branch out we’re going to offer that database via the web to the world community.
So STAR developed a companion set of software called STAR Web, and the Getty had STAR Web, and so there was a project to develop an interface that would allow people to search the Provenance Index over the internet, where the databases were stored in STAR. And that first went online about 1995-6, somewhere in there. And it worked really well, and people had to be taught, of course, what it meant to search online, and what it was that they were getting, but there was a lot of work in the background to figure out which of these fields that had been entered for the purposes of publication made sense for the online interpretation of the data. And that was a point at which I was helping from customer services, client services side to help them, help the Getty, figure out how to do this. And what the interface should look like, and how is the searching going to work. And how are the back and forth between “Now I’m going to search” and “Now I’m looking at my results. How do I go back?” and all of that stuff we kind of take for granted now has not yet been thought through when we were first building those interfaces. So, I kind of knew where they were going, where they had come from, what data look like, and what their online environment, what was going to be. And that went on for a while.
Similarly, with the Photo Archive, then it was called the Photo Study Collection, they had metadata records for about 300,000 photographs in the Getty’s archive, which altogether is about – now I’m trying to get this number right — 3 million photographs, and 300,000 of them had been catalogued in STAR, and the goal originally was to catalog them all, but of course the labor for that quickly became too expensive and we’ve since gone on to other methods of, of accessing that collection. But in the meantime, we have this database, around 300,000 records, and it too needed an interface, but it was being handled by a different department. And the Provenance Department and the Photo Study Department didn’t really talk to each other that much, so their– the Photo Study databases, were designed very different. They had different people working on different parts of the photo collection, so there were people working on photographs of medieval art, there were people working on photographs of paintings and drawings, European paintings and drawings, and each one went slightly differently, and they wanted, when they got to the online part of it, that collection was never printed as the Provenance Index was, like, when they wanted the photo archive to be online, they had to figure out a way to essentially create a union catalog of what were 17 different source databases, so we developed for different components of the Photo Collection. So, I think you can imagine what the back end of that might have looked like, figuring out how to put these pieces together so that what appeared on the web looked like a single database. And what you see now online is the result of that work over the years. It’s currently kind of in statis, because both for the Provenance Index and for the Photo archive, the Getty is currently working on exporting those databases from STAR, and transforming them to a linked open data environment. STAR is very powerful. It has great features for data entry and editing and cleaning and refining and formatting and outputting, but it doesn’t handle graph databases that linked data requires. So, we’re at the turning point in use of STAR at the Getty as far as those two big data collections work, and eventually over the next couple of years, those will appear online as linked data in a different environment, so my work continues.
Go ahead. Paul, you have a question?
[Marty]: A quick question before I forget. When those, when that early digitization work was going on, who was the intended audiences of those digital databases?
That’s a good question. The intended audiences were primarily the users of the Getty Research Institute, who were academics, researchers, Ph.D. students, professors building coursework, art historians, who would come and use the library and want the attendant information that was in either the Provenance Index for the art historians, or in the Photo Archive, for people building exhibitions, for curators interested in researching a particular collection that overlapped the content of those databases. So, they weren’t really general public databases. We’ve tried to make them more so as time has gone on. But really, the main interest for those databases are our academic and research oriented.
Where do we go from here?
[Marty]: I’m sorry, I guess, I derailed the conversation, but I really was curious to know, right, that, that perspective, right?
[Jones]: Go back to where you were. Where it’s a turning point for the Getty systems now, and what that might mean for future access, and so on of those records.
Yeah, so the goal for linked data is to encourage access in terms of making this data linkable to other data collections that are out in the world, museum collection information, both from the Getty and museums worldwide, and to other kinds of cultural, historical information that may be out there, so the linked data world is big and small at the same time, right at the moment. It’s a huge undertaking, much bigger than we imagined when we first started our Provenance Index remodel project, which is now in its fifth year of what was originally imagined to be a three-year project. It probably has at least two more years to get to what its original goals were. We hope to have a beta release of our linked data system at the end of June of this year, but it will be very limited in what it covers, compared to what’s actually in the full Provenance Index.
So, while that’s all going on, we’ve had this dilemma of what do we do with the existing databases? They’re still in STAR, they’re still available on the web, we’re continuing to clean data with an eye toward making it more consistent for the linked data transformation and we continue to publish that data on the web, so, as we work on the linked data project, we are continuing to improve the quality of what’s available on the web, as it is now.
It’s kind of a two-pronged thing. The goals of the linked data project are broader and more difficult to understand, honestly, at this point, because not that many organizations have actually done it. The Provenance Index project is kind of pioneering. We are developing a data model in conjunction with some work that was done, of, the, a consortium of 14 libraries worked on something called linked.art, which is the data model for museums that’s been developed, so that applies to several, several different collections to start with.
The goal being everybody who uses linked.art will have systems that can talk to each other. If I say, “I want to know what this museum has that was done by Rembrandt, and what did they say about Rembrandt and how does it compare to what we say about Rembrandt?” We can pull data together from multiple systems and massage it locally and do all kinds of visualizations and, and things that we can’t really do with a just text-based system, which is what STAR is. So, the ability to link between resources that may occur in collections in different locations is really the key to the future of research using cultural heritage materials like this.
[Marty]: So, I’d be, I’d be interested to hear your thoughts on the… at that, at that point in time, in the 1980s when you’re building the original database systems, and how do you, how do you position that information, thinking about how it was going to be used in the, in the future? Were there are a lot of people discussing that at the time?
No, the… I think the original goals were focused on how to get data into the databases. There was no OCR or at least not very much, so, the input of data was really manual, and one of the main early focuses of STAR was to make that data entry as efficient as possible. So, it was a text-based system. Had a screen, a dumb terminal at that time, and you would call up a screen to enter a record in the database and there would be one line for each field in the database, or maybe two lines, or maybe 10 lines, depending on how big the field was expected to be, and you would just begin typing. And there were shortcuts, like a “slash slash” would give you the current date, if you needed to type the date over, “Oh this record was entered on June 30, 1982” and you didn’t want to type that over and over again, on that date, you could type “slash slash” and get the date. But it also could check your date, so it knew that you couldn’t have 35th of February, and you couldn’t have a date, a year that began with zero, so there was a lot of built-in kind of quality checking as you went along. Part of the definition would potentially offer keyboarding shortcuts. Like, if you had a small number of values or a field like a status field, it might say “entered,” “review,” “okay for publication.” A single letter could be defined to produce any of those values, so that when you got to that field, just type your “O” for “this is okay” and the value would come up, but you’d only have to put a single character. So the focus was very much on fast input and quality input, and a lot of work went into building rules into the system that would facilitate that work because all of it was being done by hand. And then, of course, there had to be ways of proofing. So, someone who was proofing might look only a subset of fields. This person is in charge of looking at the way the artist name was entered, and the authority that was chosen, and the nationality of that person, and the birth and death dates. So, you would have a view of your main database that let the reviewer just focus on the fields, out of 100 maybe, the 10 that this reviewer was in charge of. And you could be… that person could just focus without being distracted or having to go down many screens to find the 10 fields that they needed to review. So that we were able to branch out workflows from the main database, depending on the work that a particular individual needed to do.
And as the internet grew, one of the first changes that came to that data entry process was the development of a client. STAR then became the server, and we had a client interface, where we could send software to people in remote locations, and they would see a version of the screen that we were using in the text-based interface, they would see in a window. And the first use of Windows, so they would have a menu, and they would call up, “I’m going to enter records into the French auction sales database,” and up would come a screen that had text boxes. Oh, this was something very new and you put your cursor in that text box, and you typed your value, and the label of the field was in French, and then you tab over to the next box, and there was a drop down. Oh, my goodness, no more typing of those individual letter codes representing the values or even needing to know what the values were in advance; you opened the drop down and you saw what the choices were and you could pick one without typing anything! And oh, you go up to the next one, and here’s a controlled vocabulary. This is a subject terms field and there’s a number, there’s several hundred maybe choices of a subject, and you type in something, and you get a list to pick from. So, you can go up and down the list and see which ones are alphabetically close to what you thought you wanted to use, or, or in a hierarchy, what’s related, and pick your turn without having to type it, and have it pop in to the field. So there again, you’re, you’re emphasizing control, and you’re emphasizing consistency, because the keyboarder, the operator, the editor doesn’t have to actually be typing the values. They are selecting from things that help keep the data consistent.
And we used that, Getty used that for a long time. So, we had people in Europe, for example, who were entering Belgian auction sales information, French auction sales information, British and so on, and that went on for quite a long time. We actually still have one user in Belgium who uses the STAR client project to enter Belgian sales catalog information, so it was very useful. It was the first time that the Getty was able to expand the staff, the workforce for creating the databases, because before that, they were limited to people who were physically at the Getty. It was no way to expand the data entry part. Where there was STAR Web for creating an, an interface and making the data searchable on the web, there was not yet any way to do the data entry part until STAR client came along. That was a huge development. And really helped grow the database as a great deal beyond what the in-person staff could do.
[Marty]: Kathy, it kind of reminds me of some of the stuff that I think we heard David Bridge say at the Smithsonian, and about realizing that you’re building a network community that really is worldwide to solve these problems.
[Jones]: Exactly, yeah.
And, of course, there were, there was a lot of, I wasn’t involved in it, but there were individual contracts with each of these groups, or individuals or consortiums that were doing the data entry, and a whole bunch of an administration grew up around that, and now that we are not doing it that way anymore, that has kind of all, all faded away. The primary way that data entry is done now and has been going on for the past six or seven or more years, is, is very different.
We undertook a project to input the auction sales catalogs from Germany, Austria and Switzerland from 1930-1945, the war years. There’s a tremendous interest, as you know, in restitution of Nazi-looted art, and in that context, we had a joint contract with The University of Heidelberg — Volkswagen was one of the supporters of the early part of that work — and what we developed was a way of scanning the catalogs, which is what happened at Heidelberg. They did the digitization, we took the catalogs and created PDFs with OCR attached. And we got those OCR’ed catalogs and developed software that studied the formatting of those catalogs, and picked out the data that we knew we wanted for the Provenance Index and put it in a spreadsheet.
So, what the editors got, instead of keyboarding from the, from the original catalogs, they got spreadsheets and reviewed them against the catalogs. And in a spreadsheet you know, maybe there’s two or 300 items in an auction. It’s very easy to go down the sheet and see whether the data had been picked out correctly by the transformation program and make corrections or additions. We did a lot of enhancements in the spreadsheet format. Then we took those spreadsheets and converted them very easily to STAR and loaded them in.
So instead of one at a time, we would have never been able to accomplish that project had we stayed with the single editor entering a catalog at a time, we entered about a million records to the STAR database based on this method of digitizing the catalogs, transforming them by studying the formats, creating a spreadsheet, having the editors review the spreadsheet, and then, as the very last step, entering those, that data into STAR. It was hugely successful, and we’ve used it multiple times since then. We’re using it right now for a project that’s going to create data on American sales catalogs, which we’ve never done before. So, branching out from just this central, centralized Western European focus, we’re going to have a batch of catalogs to being funded through a different project enter the Provenance Index having to do with American sales, primarily 19th and 20th Century. We did the second German sales project, drawing on the first one, where we did 1900-1930, so now we’ve got the whole first half of the 20th Century of German sales catalogs online. And that’s been a huge boon to provenance researchers dealing with Nazi-looted art. We get a lot of referrals on, and a lot of questions from those people looking for particular kinds of information or guidance about how to search for what they’re looking for. And I think linked data will go a long way toward improving their access to the information, which is of course, language bound, it’s primarily all in German. And sometimes difficult for us to… to answer questions in without a German expert to interpret the language, so the internationality of the data becomes more of an issue for projects like this.
[Marty]: I’m waiting to see if Kathy had a question. I could jump in with, jump in with another one, right, because there’s a lot… it’s really interesting to think about how we’re breaking down the connections between different institutions through linked open data. Are you seeing… thinking about the time that you’ve been involved with the Getty, and now working together for the past 15 years, are you seeing these projects helping to break down silos within the institution, like within the museum itself? are people coming together to work on these shared initiatives?
We sure have worked hard on that. We’re very aware of the fact that we tend to be siloed. There’s still a lot of siloing, I think, at the Getty, between the museum and databases that are at the Research Institute, which is partly an artifact of the way the Getty is organized, but the GRI has a new director, and we’ve just had a reorg that was announced maybe a month or two ago that’s intended to do more to bring us together, and to have departments within the Getty talk to each other more, and be more prepared to interact, both with the museum and with the larger community outside the Getty. I think it’s happening, slowly but that kind of change is really hard to undertake. The more data that we have online, the more data that we have in linked data form, the better that will go, but the linked data stuff in my experience, even with what I knew about how databases are organized and what it takes to make them searchable, is a huge undertaking. It’s amazing. Every time we think we’ve sorted it out, another layer appears. So, and we’ve had some setbacks, having to do with changing staff that’s delayed the project, which is nothing new to anybody’s work, I don’t think, but in the linked data project, it’s, it’s even more important to keep everybody on a track, but figuring out what the track is even has been hard. How do we do this? What do we need? It’s been really, really difficult.
[Marty]: Well, certainly, I think, one of the themes that we’ve seen a lot of these interviews is people not even understanding what’s happening here behind the scenes. It’s like, “Why are you collecting all this? Why are you building these connections? Who’s going to use this?” Right? Do you think that that’s improved over the, over the years? the understanding of the purpose of these projects?
My experience is limited to the Provenance project, for the most part, and I had originally done some work to imagine what the data model would be, way before linked.art came along. And I had a tool that let me put things together and draw nodes, and I put all of our database fields in one node or another, and I had a nice picture, and it all made sense, and I could explain it, and it wasn’t anything close to what we have now.
The semantics is the thing that has really surprised me, the degree to which that has to be fleshed out is mind blowing. That’s all that’s the only word I can think to describe it. For example, we have in the Provenance Index in our Knoedler stock books, which is the history of the Knoedler Gallery that operated in New York and other locations from the 19th Century, all the way up to 1970, and we have their payment history. So as a work came into the gallery, and was perhaps bought by Knoedler from some collector somewhere, and Knoedler paid a price to that, that owner, that’s recorded in the stock book, and then Knoedler sold that work to somebody out the other end, and that price is recorded. Well, what actually happened to the artwork during that process? Did Knoedler have custody of it when it was bought from the collector? And then did it not, did it transfer to the gallery, or did it wait until the gallery sold it and it went straight to the eventual, the buyer? And all of that is very important for provenance, but what does it mean, how do we interpret the records to say: What happened to the payment? Who paid what, for what, for whom for what, and what actually happened to the artwork? Where did it go? Who had custody? Was the payment transfer of ownership? Did it include a transfer custody or not? And so all of these little details come out of, “Oh, we have, we know what Knoedler paid this person for that artwork.”
It turns into this mass of webby, linked data connections. And that’s the part that no one sees on the front end. We will eventually have something that lets you see what Knoedler paid this collector for this collection of works, and where they went, and when and how and how much, but what went into figuring out from the row in the stock book, what those values meant is huge. And it’s taken a lot of discussion. There were nine of us that meet weekly, [to] talk about each little piece of the model, what does it mean, where is it going to come, where’s the, where’s the data coming from, what does it mean, how do we link it to the next thing? And it just goes on and on.
Eventually, we will come to the end of that. We’re not even talking about interface yet. We’re still just talking about how does the data get into the model, and what does it mean? And does the model need to be changed because now we know something about what this data means that we didn’t know before because we talked about it.
[Marty]: Well, it’s a great example of the complexities of managing these data models for linked open data. We talk about this with our LIS students in the, in the library science program, right. It’s amazing the number of students who come into a degree like this and have no idea about the complexity behind the scenes like you’re describing.
Exactly. I mean, I’m not even sure that you could, I don’t know that students are learning about linked data at this point in time. There are a few live examples and they look very simple if you just look at the search interfaces. And they’re fun to navigate. Some really creative designs have come to let you see how things are connected and drag and have the map shift and all of that, but what it took to get there to me is the more interesting part. It’s the part where I can shift… it’s about the data science, really. It’s about how… what it means to find it and, eventually, what we want to do with the Provenance Index is to give people tools to allow them to visualize what’s happening, so it’s not going to be a textual search, for the most part in the future. You’ll, you’ll start with a search, and then you’ll boil your set down to something manageable, depending on what you’re looking for, and then begin. Show me the pie chart of the nationality distribution of artists in this network. Show me the network graph. Show me… There’s lots of opportunities to either build into the interface visualizations, but I think the bigger thing going forward will be people taking data sets away, and manipulating them locally in conjunction with their own research, and so what we have to think about is do we want the Provenance Index to be credited in someone’s work? How would that happen? How will we know what people are doing with our, with our data as a way of developing it further? We’d like to be able to get feedback on what was useful or what was problematic about working with our data once it’s been downloaded.
So, thinking through those kinds of communications, going forward, is a little bit beyond where we are now, but we know that those are going to be important issues. And the feedback loop with somebody who’s really into a particular artist, and knows the history. Those people find errors in our data that come from transcriptions or mistakes in the original auction catalogs, and so forth. And so they will send us messages and say, “Oh, this record isn’t right. We really know this happened in 1875, not 1885.” And do we accept that? Do we change the database on this person’s word? We need an audit history. Some method of notifications that data has been changed, and here’s the provenance of the data itself is going to be really important, going forward. Here’s where we start. This is what the Getty says, you know, “We stand by this as much as we can,” but once it’s out there, you know, in a wider way, and in a different way, I think we’ll begin to see more of this kind of, “we want the users to be involved with the data. We want people to tell us when they know something that would improve the data, but we have to be able to say ‘here’s where the change came from’,” so that the next person who comes along can vet it or not, depending on whether it’s their particular project.
And that’s all things that were way beyond what we imagined for STAR in the beginning. And STAR will live on at the Getty because there are about 500 databases on that system. Some are the Provenance Index, some are the Photo Archive, but there are other databases in STAR that are on the web and other projects that were once imagined for STAR, or big data sets that nothing ever happened with, so I’m working in the background there too to figure out what we have of value. Is there something that should be archived? Is there something we can make a product out of that had been abandoned because of staffing issues earlier on? So STAR will, will continue to, to survive at the Getty, even after the Provenance Index and the Photo Archive are in linked data.
[Marty]: That’s a great example of something that I have talked about a lot with our students… that a lot of these projects have left these little isolated islands of digital collections behind them over the years, right? So how do you go back and identify those, and clean those up, and bring them back to life?
One of the nice things about STAR is that those collections tend to be grouped together, so I can say, “Oh here’s… somebody put together this group of databases,” and I can see the numbers of records, I can see the definitions, I can see the names of people who created those databases, and try to figure out from my knowledge of the GRI whether anything relevant or related still exists. So, for example, we had a database called Art on Screen years ago in, must have been the ‘80s. There was a contract between the Getty and a thing called the Program for Art on Film, and they had a database of films that discussed art or artists and it was very, it was a funded project, and they did data entry in the old-fashioned way: a record, at a time. It was very clean and tidy, and then the program shut down. It lost its funding. It was hosted at Columbia for a while. And it ended up that the only copy of the database lived on an old Alpha Micro, which was the original microcomputer that STAR was written for, in the apartment of the former director of the program in New York.
So, when I first came to the Getty, one of the first jobs I had was to get that Alpha Micro to Los Angeles, and figure out how to get that database off that old system. And it turned out, the only way to do it was screen scraping. So I set up a series of, a workflow and a thousand records at a time, I would start a screen scrape. The records would display on the monitor and be passed to a file. And because it was fielded, once I had the screen scraping done, it was fairly straightforward to get it into a format, where I could create a new database on the current STAR system at the Getty with the same definition of the fields, because I could see the definition on the Alpha Micro, and I just typed it in again on the Getty system when we loaded the data at the Getty. And we ran that database for years until the… there was no publicity about it, the usage kind of shrank and it became our first candidate for figuring out how to do a proper archive of a database. So, I worked with Institutional Archives. We exported the data in various formats, and we have all the documentation, and it’s now in the GRI’s institutional repository. And we pulled it off the, the online system, so the only place that database currently exists is in the Institutional Archive. And you can search it in the library catalog, and if you were interested you could get the database in a spreadsheet. You could get the documentation. So, it’s actually still there, it’s just not online. We don’t have to maintain it on the system.
That was a very useful model. It took a while to figure out how that was all going to work, and what the… how it was going to be in the — I forget the word — Preservation System – how it was going to be loaded into the Preservation System, so working with Preservation department, with Institutional Archives, and with Getty leadership to sign off on the archiving. It was an interesting side project. And I plan to use that model again with some others that have been selected for archiving.
[Jones]: I have a question for you, Ruth. Did you work at all with the Getty Information Institute? In any of their database work? or the Art History Information Program?
Those were both earlier incarnations of the GRI before I was at the Getty, so the staff that I worked with were in those programs under their earlier names, but I wasn’t really involved in how those programs operated differently than the GRI does, but I know that they existed. They would… at the beginning, I think our original contacts with the Getty were under AHIP (under the Art History Information Program). And then, when it switched to the GII, the Information Institute, that was kind of invisible to the work we were doing at Cuadra Associates with the Getty. By the time I got to the Getty, it had been the GRI for a while.
[Jones]: Great, thank you.
Mmm hmm.
[Marty]: I love the example that you just shared, Ruth, about, about getting the lost data off of the old Alpha Micro, right. It’s… I wonder how many collections like that have just been lost, and all that work, yeah.
We also know that there are databases around the GRI that are connected to projects that went one way or the other, that are sitting in FileMaker Pro, that are sitting in text files. We used to get people who wanted to give us data in Word documents, and after one or two of those we quickly said, “Oh, no, we’re not doing this.” That, that’s way too much. So, I mean I’ve talked to scholars who are setting up database projects, and they have their preliminary notes or their bibliography or whatever in Word and the first thing I say is, “You’ve got to find a different way to do this because it won’t be usable for database purposes without being tagged or fielded or something.”
And that’s even now, people coming from grad school and early in their careers, it’s much less of a problem than it was, maybe even five years ago, because people are coming out of school with much more know-how about what’s needed to create databases out of their research, but it’s still an issue that we deal with talk to new researchers about their projects.
[Marty]: It’s certainly an issue that I know we deal with here at Florida State University, trying to get researchers to, to share data, to organize data. It’s something that our Office of Research here is very much interested in… there’s, there’s very little incentive, I think, for a lot of a separate faculty researchers to put their data in some sort of shared… well, shoot, we can’t even get the faculty to put their research papers into a shared institutional repository, right, let alone the research data.
Same problem, and I think it’s going to require a shift in in how faculty are evaluated, the sciences have been doing it for a long time, sharing their research to move things forward. But art historians typically hold their research very close to the chest, and say “This is mine until I publish it, and no one’s going to know about it until that time.” I’m sure you deal with this all the time. We see that too. We see it for the researchers who come, residential scholars who come to the Getty, and so in our new reorganization, one of the goals of putting the database services people in the same research division as the scholars is to try to break down that silo and say, “Let us help you.” And one of the things the Getty has not done a lot of in the past is capture the research that scholars do on the Getty’s dime, so I’m trying to figure out a way to not only help the scholars create databases that are useful for their projects, but then capture that data in some sort of an institutional repository so that it can be referred to, so that it can be linked to other things that are happening. And that’s a big challenge unto itself, separate from the linked data database project.
[Marty]: That’s a really good idea.
I’m sure everybody’s doing essentially the same thing.
[Marty]: But it’s, it’s hard to do right? It’s… People come in to use your data, use your resource, resources, and then leave, and they tend not to, not to give back, for lack of a better word, the information.
Which is Okay, I mean that’s always been the purpose of a library is to support research, but in these funded relationships between the Getty and its researchers, it seems like there should be a more cooperative agreement. But the problem of not wanting to reveal research until it’s published still stands.
[Marty]: Right, exactly. And I think it’s, I wouldn’t limit that to art history either, right. I mean, in the hard sciences, people are very close to their chest or something. You know, I’ve seen a lot of these data sets that the physicists, for example, will release but it’s all meaningless data unless, unless you’re part of the team, and you know what it means, nobody else could use this to get a scoop on anybody.
Exactly. We also put our data on GitHub, and it’s old out there now, but the intention was that as we were cleaning the data and making decisions about how it was going to be transformed, that we would release the data to people who wanted to do visualizations or numerical analysis or whatever with it. And that idea came from some preliminary conversations we had with people while we’re developing the proposal for the Provenance Index project, that people wanted the data then and now. They said, “Oh, this thing you’re doing is fine. Just give us the data and we’ll clean it up, and we’ll work on it.” And so, we did some of that but, again, there was no sense of feedback. So, we have some statistics from GitHub about how many people have picked up the data sets, but we never really pushed it. And now, there’s some talk again about updating those data sets in advance of the release of the beta of our interface because it’s taken so long. I don’t know where that’s going to go exactly, but it would require efforts to publicize that it was there, and update documentation, and I’m not sure there’s the strength to do that, as we’re deep in the semantic conversation right now.
But I think we’ll see more of that, too. That’s another way to break down the barriers is just say, “Here’s the data. Here’s the documentation. Go.” It’s a matter of publicizing what’s available in that situation.
[Marty]: That’s another important philosophical shift too that we’ve seen over the past few decades, right. The willingness of museums to release data for general widespread use without really knowing who’s going to use it for what, right? I know a lot of Registrars from the, say the 1990s, that would have been very reluctant to do that.
I mean, even the Open Access for Images programs that have been going on. We’ve done a lot. The Frick just announced a huge upload of free images, the British Museum has done it. But, all of that stuff has a lot of things going on in the background, having to do with: What are the decisions about the size of what someone can download? What are the copyright — copyright is a big part of that that people don’t realize it. “Oh, they have all these images! They’re letting us use them.” All of those images were vetted as being available under their particular copyright restrictions. Nobody has any idea that all that’s going on in the background.
[Marty]: A tremendous amount of work in the background for sure, but it, it also does make me feel better to see that over the past few decades we’ve, we’ve had that philosophical shift toward accepting this… providing access to information in this way. You mentioned Open Access of images. Thinking about some of the conversations I had with Ken Hamma, speaking of people at the Getty, right in the early 2000s, and the fights that Ken Hamma was always having to get people to… right?
Exactly.
[Marty]: We’ve come a long way.
So, there’s a connection, to take this thread we’re talking about, images, and the Provenance Index. One of the goals of the Provenance Index from way at the beginning, was to eventually to connect those provenance records: This is the sale of a painting by Rembrandt to an image of that painting. It’s the obvious thing. People come to the Provenance Index, and they think we’re going to see the pictures of the art that’s being described. That turns out to be nearly impossible, except for a small handful of Old Masters, because we have a lot of records, it says “landscape with a cow.” And we know the artist, but which landscape? And which cow? And which version? And who owns it? And there are 10 copies of that image of that painting. Which one does the Provenance Index think is the one to be shown? You can’t really do all of that, and that’s where linked data will be very valuable when we get to it. Because out there, there are 10 museums that have an image of the landscape with the cow, and we can link to them. We don’t have to say, “We think this one is the authority, or necessarily, this is the right one to go with this provenance record, but here’s some possibilities,” and we link to them by the artist’s name, by the title, by the owning organization, and so it’s a way of expanding the root of what’s in the Provenance Index to a much larger universe and potential for research.