Networked Information: Finding What's Out There

Clifford A. Lynch Interview

By Educom Review Staff

Sequence: Volume 32, Number 6

Clifford A. Lynch is executive director for the Coalition for Networked Information. Previously he was the director of library automation at the University of California, Office of the President. He is internationally known for his development of MELVYL, an information system that serves all University of California campuses.

ER: What part of your own intellectual background led you to an interest in the problems that you're dealing with now?

LYNCH: Well, I started off studying mathematics in college, and at some point decided that every mathematician should have some vague idea of what a computer is, and got bitten by that bug.

ER: What college?

Lynch: Columbia. So I suddenly got very interested in computers and rapidly found myself gravitating to a set of questions about how to really do information retrieval and organize large textual databases and things like that. While I was in college I picked up a part-time job at New York University, working in the library systems office there, and also hooked up with a professor named Ted Hines, who was in the library school at Columbia teaching text processing and things like that. I finished up a Masters at Columbia, by then having switched from math to computer science, and then went to work full-time at New York University, where I spent about four or five years, working first in the Library Systems Office and then later for the academic computer center there.

I was working for a guy at NYU named Ed Brownrigg, and he took a job as director of library automation for the University of California system and came back with this planning document to New York and said, "You should really read this, because these people are really serious about building a full-scale union catalog for the nine campuses of the University of California and are prepared to put up the resources to really get it done." Now keep in mind: this is 1979 and the notion of online catalogs for end-users was virtually unheard of then. I had always been fascinated with the idea of building computer systems that really made a difference to large numbers of people who were interested in getting access to information, as opposed to simply using them as a computational tool. This was a very new idea at the time.

So I took a look at the University of California plan and decided this was just too good an opportunity to miss. So off I went to California and worked with Ed Brownrigg and the people there, building the MELVYL system. I picked up a Ph.D. in Computer Science at Berkeley in my spare time while I was there.

ER: What was your Ph.D. work about?

LYNCH: Actually it was one of those wonderful things where some of my work and some of the research I was doing complemented each other very well. My doctoral research was in understanding how relational database systems failed to handle information retrieval applications and what could be done to fix that. I was fortunate to have Michael Stonebraker as my thesis advisior, who had a wealth of practical experience implementing database systems and using them in large scale, real world settings.

ER: How do they fail?

LYNCH: Well, it turns out that they fail in a lot of different ways. It's not there's a single killer failure, but really a whole series of them. And I looked at about four or five of them. One set of issues has to do with extensibility and flexibility. Back when I was doing this work, in the mid-1980s, there was a lot of interest in extending relational databases beyond mainstream business applications to handle, for instance, CAD/CAM data, or geospatial data, or textual data.

So there was a lot of looking at how you could build user-extensible systems, doing things like allowing people to add indexing methods into the systems. Well, it turned out that information retrieval systems really were a good case study of this. And, in fact, I believe that some of the ideas I worked out in that dissertation actually have found their way in variant forms into some of the commercial database systems out there now.

Another place where relational databases back then used to mess up is - you know, one of the claims to fame for relational databases is that they are supposed to be doing sophisticated query optimization. But they assume that all values for keys in indexes are sort of equi-probable. So if you look at the kind of thing you run into with a text corpus or a bibliographic database, where some keywords are terribly frequent and most of the rest are very infrequent, it turns out that it's just a disastrous strategy. You can demonstrate, for example, that a query optimizer like that will screw up at least a third of the time.

ER: What were some of other interesting issues that came up building the MELVYL system?

Lynch: Because the MELVYL system that we built was really the first university-wide application that required a lot of network access, we ended up building back in about 1981 or 1982, I guess, a TCP/IP network for the University of California, which was kind of a bizarre thing to do back then. I remember having roaring fights with people who were suggesting that this was really over the edge and that X.25 was the standard and we should be using that. But we wound up building this TCP/IP network that hosted MELVYL, so in that sense I think we were probably the first major catalog on the Internet, and certainly one of the first systems really designed for Internet delivery rather than just handling Internet access as an afterthought. To build the network we also got involved with private leased satellite facilities for a while, since we couldn't get the lines we needed from the phone companies. Later, we did a lot of the pioneering work in implementing Z39.50 to link the MELVYL syst
em to other major information providers.

ER: What is the status of MELVYL now?

LYNCH: It has grown quite a bit. It now encompasses not just the Union Catalog but about 30-odd abstracting and indexing databases. It has about 1,000 journals in ASCII full-text and several million pages of bit-mapped images of engineering material through a joint project that UC has been doing with the IEEE and the IEE. It's very heavily used. During busy weeks MELVYL runs about 900,000 queries a week, which I just find mind-boggling. I can remember back when we first crossed the 100,000-query a week barrier, and I suspect the system will hit the million-a-week barrier some time in the next year or two.

One of the final things that came to culmination shortly before I left is, we also brought up MELVYL on the Web, so it is now accessible through a fairly sophisticated Web interface, as opposed to just slapping a search form in front of a line-mode system.

I think it is important to understand that the Division of Library Automation at the Office of the President did not all by itself design and build MELVYL. MELVYL has really been a cooperative effort with librarians throughout the whole University of California system who've been integrally involved in the design and the testing, the validation, refinement, the development of training programs for the system. And I think the places where MELVYL has really done the best are those places where it has worked really closely with the library community - for example, in the development of this Web interface. And, of course, at the Division of Library Automation there have been many talented people who made important contributions to the MELVYL system over the years.

Back when the system first deployed, in the very early '80s, I think for many, many people, not just students but also faculty, this was their first encounter with a computer, other than perhaps an automatic teller machine. And it played a very important role in broadening their perspective on what computers were about and what they could do for them. Now, of course, most of the incoming students have already used MELVYL, because at least the catalog part of it is already available worldwide, and they regard these kinds of systems as routine. At the same time, I can't tell you the satisfaction I get from running into people occasionally who say, "You know, that system really made a difference to the way I do research" or "I've gone to graduate school some other place and I've really wished they had a system such as I had as a undergraduate at UC," or similar kinds of statements from graduate students when they go on to faculty positions elsewhere.

Really, the issues we get into now are not so much with the system but with content, where it's very hard sometimes for people to understand why content, particularly things like primary scholarly journals, aren't available. And most of that has been issues of economics and publisher readiness. They also have a hard time understanding the coverage and limitations of abstracting and indexing databases, and when to use which database.

ER: Do you see the increasing sophistication of library systems and Web search systems as fundamentally changing - and to use an inflammatory word - diminishing the role of librarian 10 years from now?

LYNCH: There are a couple of different pieces to that question that I think we need to pull apart. Really, the Web indexing systems have grown up almost orthogonal to the traditional role of libraries. And certainly because they were so crude, many librarians were pretty unenthused with them, not recognizing that, crude as they were, they were infinitely better than nothing at all and thus underestimating, on the one hand, how quickly they would be accepted and, secondly, how seductive it is to have an information retrieval system where you always can go directly to primary content.

One of the phenomena we saw as automation moved forward at the University of California was that there's almost a kind of Gresham's Law with electronic information; it drives out print. So for instance, if you look at your typical abstracting and indexing database, it started probably somewhere between 1970 and 1980. And there's a sense in which the journal articles prior to the inception of that electronic abstracting and indexing database may as well not exist, because they are so difficult to find. Now that we are starting to see, in libraries, full-text showing up online, I think we are very shortly going to cross a sort of a critical mass boundary where those publications that are not instantly available in full-text will become kind of second-rate in a sense, not because their quality is low, but just because people will prefer the accessibility of things they can get right away. They will become much less visible to the reader community.

And I think there is a very powerful aspect of that in what we've seen about the acceptance of the Web. I think we are rapidly also going to come up against the limitations of Web indexers. If you look at what's going on on the Web at the moment, there's a sort of static Web that the Web indexers know about - the world of HTML documents sitting there to be indexed. There's also this enormous subterranean web of databases which are observable only as dynamically written HTML screens that are built in response to queries. Web indexers can't see those, so you have phenomena like this: you have MELVYL sitting there with, I don't know, 20 or 30 million records inside it and all that full-text. There are only about 20 or 30 static pages, most of them sort of "Help" and introductory screens that a Web indexer would find for that.

ER: Give an example of what a Web indexer wouldn't find.

LYNCH: It wouldn't find any of the full text. It wouldn't find any of the bibliographic records. It wouldn't find any of the bit-mapped images. All it would find is the Welcome screen and the next few Help screens. But that's all that would be there for a static indexer. So I think we are going to need to see some much more sophisticated indexing approaches on the Web that recognize closed communities, intellecutal property rights, and things that can be described without necessarily being made available to the indexers. All that is going to bring, I think, more traditional information organization and description practices back into play alongside the Web indexers.

To kind of pull apart your question in a couple of other directions - I think that you may see the role of libraries changing a bit and perhaps diminishing a little bit vis-a-vis electronic communication for some communities of practice. I think you may see the role of librarians, although they may be hiding under some other job titles, like information specialist and things like that, vastly increase, and I think you are going to see people with information expertise increasingly scattered through all kinds of organizational settings and all kinds of novel roles. There's already a lot of evidence, when you look, for example, at what happens to graduates of places like the School of Information Management and Systems, which used to be the School of Library and Information Studies, at Berkeley. A very large percentage of those people aren't ending up in traditional library work. They're ending up as entrepreneurs; they're ending up in government; they're ending up in major corporat
ions. And not necessarily running libraries, but doing information policies, or research, or product design, or being CEOs of startups. I think you will see a lot more of that kind of thing.

To go off on one other facet of your question, I think we are starting to see some other things emerge which present some really profound challenges for both libraries trying to plan their role and also for designers of information retrieval systems. If you look at the whole way that librarians and indexers have tended to approach organizing literature, it tends to be for exhaustive retrieval. To just give you a pragmatic example of that, if I want to find out about - pick your subject - I could get a system like MELVYL to pump out an exhaustive list of everything the libraries of the University of California hold on that; and it's probably thousands of items.

What tends to happen more and more in the real world, with an abundance of information and a scarcity of attention and time, is that you've got five hours to get smart about this subject and the question is where can you most usefully spend five hours based on what you know about it already and where you're coming from. And the kind of systems - and by this I don't mean just information retrieval systems but also intellectual systems of classification and organization - that have historically been constructed generally don't get at this at all; yet this is what I increasingly hear that many users need, especially when you go outside the sort of traditional academic world of exhaustive scholarly research. Somebody is going to fill the need and build the systems to do that and get into that kind of evaluative aspect. We have some very interesting technologies starting to surface that promise to give us at least some leverage on this, things like community and collaborative filtering, the sort of stuff that people like Firefly
are doing.

ER: Explain that a little bit.

LYNCH: You've probably seen some of these. The basic idea is that you have a community of users on these systems that are interested in some class of things - you know, sound recordings, or movies, or books, or something like that; and the idea is that you tell it either explicitly or as a byproduct of other things you are doing, like purchasing things, here are some things that I really like. And what it does is it goes and matches your profile with other people's profiles within the system and if it finds a close match it tells you about the differences. It says, basically, I know about other people who also like a lot of the things you like but they also like this, which, if you haven't heard of it, you might want to check out.

ER: How advanced have they gotten with systems like that? The last time I looked at one of those, the idea seemed better than the execution.

LYNCH: Yes, there are a lot of problems with the practical deployment of these systems. One is getting users to put enough information in. Another is sort of the level of precision of the contents set it's dealing with. And I think there are still of lot of things we don't understand about it. On the other hand, I believe that some of the online marketers like amazon.com are starting to use these now. Another issue is that to be most effective, evaulative systems need to know a lot about you - and that raises a whole host of social as well as technical issues.

ER: You remember, I am sure, Licklider's Libraries of the Future? What's your library of the future?

LYNCH: I don't think I have a single vision of a library of the future. I think that I see a much greater diversification among libraries as we go into the next century. Clearly there will be some "libraries" that are primarily electronic information systems - systems for retrieval, collaboration and authoring which support communities of scholarship or more broadly of interests of any kind. I think we will see some institutions that are very recognizable as the libraries of today, just with a bit more electronic information complementing their paper collection, very oriented toward doing the kind of services they do today, perhaps with a greater emphasis on reference work and teaching. I think teaching is really going to be a growth area in many libraries, particularly - well, I was going to say particularly in academic areas, but I find myself questioning that more and more.

ER: Why?

LYNCH: Because I think issues of information literacy and understanding information resources really are going to be at least as important in the public library setting or the corporate library setting. It may be that the instruction is a little less formal, a little more one-on-one, maybe that it's less explicitly instructional and more hidden in reference interactions with librarians. But I think teaching is going to be a really big role for a lot of librarians and a lot of libraries as we go into the next decade.

I think you are going to see some libraries that are going to hold little or no paper and be mostly network presences, although maybe they will be able to deliver paper on demand, or electronic versions of paper, if you need it. I think one of the real interesting things that we are going to see is how much specialization is going to make sense. There is a phenomenon now that we see in this World Wide Web growing up where you can support levels of specialization when you amortize them over a national or global population that you just can't support geographically.

ER: For example?

LYNCH: Just think of all of the special interests in historical things, in specific areas of art collecting, animal husbandry, recreational mathematics - you name it - all of the intellectual and recreational interests you can imagine. In a real big city you might find some kind of club for a given interest group. You might find a national magazine that could gather up enough subscribers to hit some of these niches. Libraries, though, historically have tended to be sort of face-to-face operations, so they don't scale well to serve national-size or international-size communities. On the Net that changes, so in theory we could see the emergence of some really specialized libraries. I don't know if that will happen. I think it's going to be fascinating to watch. We may also see some of them intertwined in funny ways with publications and markets. For example, we can take something like people who are interested in Colonial coinage. One can readily see somebody building a reference li
brary that maybe you've got access to for a small fee per month, publishing a magazine, and also operating an auction-type brokerage for dealers - all kind of hung around the same service. That would be, I suspect, very attractive to that community.

ER: Let me ask you to comment, or at least shrug, on the subject of those persons who are fearful of and/or particularly hostile to the whole idea of digital libraries - for example, Clifford Stoll?

LYNCH: Clifford Stoll is very irritating. I actually had a little e-mail exchange with him prior to the publication of that second book of his, Silicon Snake Oil. And I was a bit stunned to find myself portrayed as this sort of mad digitizer in that book, because I don't think that's a very accurate description of me. Certainly, personally, I'm an appalling accumulator of printed books. I guess I find Clifford Stoll particularly difficult to deal with as much on a rhetoric basis as anything else - because when you read his book you find that he intersperses some things that I think are very valid points (such as 'don't necessarily assume that just because you can get it online that's the most effective way to get it or the place with the most high-quality material') - mixed up with things that are basically just emotional appeals that aren't subject to much rational debate (such as his fond recollections of libraries in his childhood as a place to go sit or that they remind him of chocolate chip cookies, and how sad it
is that the world isn't like that any more).

So I don't quite know how to react to some of where he's coming from other than to say that I sort of regret the way he wrote that book, intermixing these emotional things with some issues that I think reasonable people who care a lot about these issues should be thinking hard about. Obviously, he's entitled to be sad about how the world has changed, but this confuses many of the other issues - which he is hardly the first to have raised.

There are other people, I think, who worry legitimately - and this is a position you hear sometimes from faculty at universities - that digital libraries are dangerous in the sense that in a time when libraries are very starved for funding for collections you are diverting a lot of resources away from building collections. That's a legitimate concern and one that I think really needs to be addressed as a part of a discussion about priorities and strategies for each institution, and also kind of frames this question about: Do digital libraries make a difference, does electronic information make a difference, is it worth the investment? I think those are questions we can't shrug off. Unfortunately they are very, very tough questions to get a handle on in some cases. It is really clear, for example, that there is some advantage to be had at least from opening up libraries so they are accessible 24 hours a day. How you measure that incremental value against other priorities is a really tough thing to do, and I know organizations like the Association of Research Libraries are really struggling intellectually with this whole question of what are the right quality measures for an environment that's increasingly including a lot of electronic content. The Coalition for Networked Information (CNI) has a project that's underway now on helping institutions in assessing the networked information enviornment which again tries to get, at least at some level, at some of these issues.

ER: What is your vision of the role now of CNI?

LYNCH: Well, I think if you go back to the roots of CNI it really was to establish a constructive dialog between information technologists and content specialists to ensure that we had a rich and vibrant assortment of content available on the Net to support scholarship and learning and teaching. I think that to some extent CNI achieved that or at least took some important first steps on it under Paul Peters' leadership. I think that there are a couple of directions CNI needs to go in now as next steps. One, which I think Paul Peters already had made a very important start on before he died, was the notion of bringing other communities into the same dialog that it has fostered between information technologists and librarians - authors, publishers, scholarly societies, the humanities, arts and cultural heritage communities, for example. I think this is a key vision to continue to work toward.

I guess I'm also of the opinion at this point that we really need to get serious about building some infrastructural components to facilitate resource-sharing and network commerce - things like authentication and authorization systems which really can be used on an inter-organizational basis, but at the same time are sensitive to issues of privacy that are very important to the library community and have been championed for decades by that community.

I think that there's lots more work to do in the areas of metadata and network information discovery and retrieval. I'm hopeful that we can move CNI into a role where it can try and help its community with technology assessment. I think it's very difficult now, as institutions try and develop information technology strategies, to get good data on what's likely to be real and what's not. I'm hoping that CNI can help here. So I see CNI both as continuing its role of fostering organizational and cross-professional dialog towards getting things done, but also complementing that with a renewed set of more technically and infrastructure-oriented initiatives. If you look at the things that the sponsor organizations of CNI have been active in, I'm struck that particularly in the Internet 2 area, the CNI community really needs to get much more engaged in talking about applications and what the new Internet 2 services will enable as far as applications. There's been some of that but I think
there needs to be a lot more. Similarly if you look at the National Learning Infrastructure Initiative, I think that we really need to think hard about what kind of content and information access and delivery services can complement and support that kind of learning infrastructure. And I'm eager to try and facilitate some conversations with that community.

There's a tremendous amount to do building on the wonderful start that Paul gave to CNI. Perhaps the greatest challenge in the next few months will be to prioritize among initiatives and opportunities - which is a very exciting position to be in.

Take me to the index