Access Technology for Network Information Resources Copyright 1990 CAUSE From _CAUSE/EFFECT_ Volume 13, Number 2, Summer 1990. Permission to copy or disseminate all or part of this material is granted provided that the copies are not made or distributed for commercial advantage, the CAUSE copyright and its dateappear, and notice is given that copying is by permission of CAUSE, the association for managing and using information resources in higher education. To disseminate otherwise, or to republish, requires written permission. For further information, contact CAUSE, 4840 Pearl East Circle, Suite 302E, Boulder, CO 80301, 303-449-4430, e-mail info@CAUSE.colorado.edu ACCESS TECHNOLOGY FOR NETWORK INFORMATION RESOURCES by Clifford A. Lynch ************************************************************************ Clifford A. Lynch is Director of the Division of Library Automation at the University of California Office of the President. He has worked for the University for ten years. Dr. Lynch is responsible for the UC MELVYL(R) online catalog, which also offers access to the National Library of Medicine MEDLINE(R) database, and is currently putting ISI's Current Contents into production. ************************************************************************ ABSTRACT: This article examines some of the technical access barriers that inhibit the effective use of the information resources available today on higher education networks; the emerging information server technology that promises to alleviate many of these access problems; and the growing popularity of the Z39.50 protocol, including some questions that remain about this information retrieval standard. A sample list of Z39.50-based projects is offered. As networks serving the higher education and research community evolve, emphasis is moving from the networks themselves to the resources available through the networks. These network-accessible resources include databases containing library holdings, journal article citations, images, electronic text, and scientific and technical data. Recent developments indicative of this shift in focus range from the announcement in March 1990 of the formation of the Coalition for Networked Information by the Association of Research Libraries, CAUSE, and EDUCOM to the revised language in the version of Senate Bill 1067 which cleared the Subcommittee on Science, Research, and Technology in early April 1990. This bill, introduced by Senator Albert Gore, establishes the National Research and Education Network (NREN) and calls for the National Science Foundation to encourage the development of digital libraries and other information services (including commercial services) that are accessible through the NREN. At least twenty-eight online library catalogs at major academic institutions, along with a range of more specialized databases and information services, are currently available for public access through the existing Internet. However, substantial barriers exist which inhibit the effective use of the information resources available today and threaten the higher education and research community with complete overload as such resources continue to proliferate. Active research and prototyping is under way at a number of institutions across the country to explore and develop the new technologies necessary to change the way information resources are accessed across networks. This article provides an overview of some of the current problems in network access to information resources and a relatively nontechnical summary of the developing technology to address them. Access Barriers Information resources such as library catalogs are accessed across the Internet by remote log-in. The user appears to the remote system as a terminal. Technically, if the user has an actual terminal, it is connected to a computer functioning as a terminal server and running the TELNET protocol across the network. Users with workstations actually run terminal emulation software and TELNET directly on the workstation in most cases. Users with intermediate class hardware such as PCs or Macintoshes may either run both terminal emulation and TELNET directly on their personal computer if it has a network connection, or may just emulate a terminal and connect (through a modem and the switched-phone network or a hardwired cable) to a terminal server device. Anyone who has tried to use a range of the currently available information resources can attest that this mode of access has serious drawbacks. Each system on the network has a different user interface, and it is necessary to learn new search strategies and commands for each system. Even though some systems are relatively easy to use and offer a good deal of guidance to the new or occasional user, the learning process can be quite tedious. Individual systems have their own "character" -- it may be far from obvious how to do a type of search that one commonly performs on one system on a different system just by reading the help screens. Terminal incompatibilities are another major source of problems. Some systems use cursor addressing and are quite sensitive to the brand of terminal used (or emulated) to access the system. A number of systems will not work at all unless the user has a specific type of terminal (usually a DEC VT100). A few are sufficiently antisocial to assume the accessing user has a VT100 and will fill the screen of any other type of terminal with VT100 cursor addressing sequences. A number of systems on the Internet were designed for use with IBM 3270-type terminals, which have special function keys that are mapped to keyboard sequences on ASCII terminals emulating 3270s. The user may receive a "helpful" message such as "Press PF3 for help" and have to guess what key sequence sends the equivalent of programmed function key 3 to the remote system when there is no such key on his or her keyboard. To some extent these problems are not an intrinsic shortcoming of accessing information resources on the network through remote terminal log-in, and are readily correctable by redesign and reprogramming of existing information retrieval systems. But for many commercial vendors of library automation systems, a solution would involve a great deal of work (in which the vendors seem to have little interest). Further, the entire remote log-in approach is based on line-by-line, character- oriented terminals; it does not exploit the sophisticated capabilities of today's bitmapped high-resolution workstations. Accessing information resources within the remote log-in model would require that the various information systems on the network implement X Windows, which again does not seem to be on the near-term development agenda for most of the major information retrieval system vendors. A more basic set of problems appears when users try to employ multiple information resources on the network to solve problems. The user typically wants to: * search a series of databases that may be located anywhere on the network (preferably without having to reformulate the query for each system); * move the results to some convenient local workstation or timesharing system; * consolidate the results and eliminate duplicates based on some precedence scheme (for example, when searching for books, if the same book is found in multiple catalogs, keep the citation from the library most easily accessible to the user); and * store the consolidated search result, print it, incorporate it in the bibliography of a paper, or place it in a personal database. This is a part of the vision behind the "scholar's workstation" concept that is driving the evolution of academic computing on many university campuses. The reality today falls hopelessly short of the vision. At best, the user can save transcripts of sessions with each of the relevant remote systems in which search results are displayed, and then laboriously edit each transcript, after which the differing display formats of each different remote system must be handled. An increasingly visible problem, as resources multiply, is the selection of appropriate databases to satisfy a given query. Currently, there are several resource directories available; these are simply collections of one or two pages describing each system available, the mechanics of signing on to that system, and database coverage. These directories are intended for human reading and are not (at present) indexed. As we realize a future with hundreds or even thousands of information resources available, these directories will have to become databases rather than printed resources, and will have to be organized for access by programs that help information seekers identify appropriate resources to search. The use of terminal emulation and remote log-in to access network resources makes the proliferation of such resources more of a problem than it should be. Using this means of access, end users must personally search several resources sequentially, then transfer and consolidate the results. If the searching, transfer, and consolidation activities could be turned over to a program, then growth in the number of systems on the network would not increase the workload of users, but simply increase the amount of time they must wait while their system conducts a comprehensive search. Information Server Technology The solution to the problems described above is to convert existing network information resources through information servers -- computers attached to the network that provide services to "clients" (see Figure 1). In this environment, the client is a program -- running on a personal computer, workstation, or timeshared system -- that is accessed through a terminal, workstation, or personal computer. The client operates on behalf of the human end user to insulate the user from database access protocols. All interaction with information resources are through the client, which presents the user with a uniform, consistent interface. The user presents the client with a query and the client selects appropriate information servers and sends the query to each of them. All results found on the remote information servers are then transferred back to the client for consolidation, presentation to the user, or other processing. [FIGURE NOT AVAILABLE IN ASCII TEXT VERSION] Such is the grand design, but the details of implementation are much more complicated. The basis of the work accomplished to date is American National Standard Z39.50, an application-layer protocol for computer-to-computer information retrieval, which was standardized in 1988 under the auspices of the National Information Standards Organization (NISO), the American National Standards Institute (ANSI) accredited standards writing body for the library, publishing, and information industries. International standards closely related to Z39.50 -- ISO 10162 (Search and Retrieval Service) and 10163 (Search and Retrieval Protocol) -- recently achieved draft international standard (DIS) status. Z39.50 explained The Z39.50 protocol is designed to function as an application-layer (layer 7) protocol within the Open Systems Interconnection (OSI) protocol suite, but is now being mapped on top of the Transmission Control Protocol/Internet Protocol (TCP/IP) suite in use in the research and education communities in the United States for current applications. Z39.50 provides several facilities. It allows a client machine to submit a search to a server; manage the search process (for example, the server can inform the client that a search will take a long time to execute and request confirmation that the search should be executed); and learn the results of the search (the number of records matching the search criteria, or the incidence of various error conditions). After a search, the search result is retained on the server; the client can request that records from this search result be transferred from server to client. These functions, along with the ability to delete result sets being held on the server and an initialization process that permits client and server to agree on various parameters for their interaction, form the "machinery" of Z39.50. The other major part of Z39.50 (technically, not entirely part of the U.S. standard but more fully included in ISO 10161/10162) includes specifications of a canonical search format through which searches can be transmitted from client to server. This consists essentially of a series of predicates linked by Boolean operators such as AND and OR; the predicates are composed of field names, relational attributes, and values (for example, SUBJECT-containing-keyword value, or AUTHOR- lastname-equals value). Both the field names and relational attributes are selected from an attribute set that forms part of the "context" of a connection between a server and a client. There is a working set of attributes used for bibliographic retrieval, and it seems likely that new attribute sets will be defined to support other types of databases (such as full-text databases). It is planned that the Library of Congress will maintain a registry of attribute sets and other data related to Z39.50. Various other "codes" are also required for a Z39.50 session, such as an error message set to allow the server to communicate problems with searches back to the client; these are also managed through a registry process. To create a useful interaction between client and server, both must have some common understanding of the semantics of the data -- for example, the client must know that it is searching a bibliographic database and understand a search attribute set that the server also knows in order to construct queries. At present, there is no way within the protocol for the server to tell the client about what it can do -- for example, what attributes within an attribute set are supported for searching a given database (although this is a protocol extension that is currently under discussion). In addition, server and client must both understand a common transfer format for data moving from server to client; this is not part of the protocol but rather is assumed to be defined by separate standards relevant to the specific type of data in the database being searched. For example, for bibliographic data, the MARC (machine- readable cataloging) standards are commonly used, and define data elements and a transfer format for bibliographic description. Unfortunately, no standards exist in common use for many other types of data that one would like to search using Z39.50, such as journal abstracting and indexing, electronic journal articles, and images (though in some cases there are a number of emerging candidate standards). A number of protocol extensions are under discussion, including various requirements for clients to perform such functions as: * obtain descriptive information from servers, to reduce the amount of "prior agreement" required -- in effect, manual configuration -- for a client to communicate usefully with a server; * permit browsing of index values; * support other types of searching, such as relevance feedback searching; and * support sorting of result sets. Image databases in particular are likely to require fairly extensive protocol extensions because of the transformations that one wants the server, rather than the client, to perform on images prior to sending them over the network, in the interests of conserving network bandwidth. Because the standards process moves slowly and there are a number of active prototyping projects under way (see sidebar) to implement and explore the use of Z39.50 in the U.S., a working group of Z39.50 implementors is now meeting on an informal basis to deal with extensions, attribute sets, and similar matters. As implementations stabilize and the existing protocol and proposed extensions are validated in operation, the intent is that results be placed back into the standards pipeline for a revised Z39.50 U.S. standard. Some changes to Z39.50 will also be needed to harmonize it with the work going on in the international arena (ISO 10161 and 10162) as that work reaches stability. Questions about Z39.50 There are many open questions concerning Z39.50 and the extent to which clients interacting with information servers can replace the best of the existing integrated information retrieval systems, which have user interfaces tightly linked to retrieval software and the underlying databases. For example, major problems in bibliographic databases arise when users obtain very large results and need assistance in reducing them to a manageable size, and when users retrieve zero results (due to spelling, typing, or problems with indexing vocabulary, for example). Existing systems can use information derived from analysis of the database to help users who encounter these difficulties. In a zero- result search, the system could explain that there were no records containing a given keyword in a multiple keyword search to aid recovery from a zero result. In the large result case, the system might suggest (based on knowledge of database statistics) that the user limit the results to English language material published in the last ten years. Means for the information servers to pass this information to clients, and for clients to exploit such information in interacting with the users for whom they mediate, has not been well explored within the Z39.50 protocol framework. In some cases, it is not even clear when to assign responsibility for various functions to the client, and when to assign it to the server. It is unlikely that Z39.50-based clients will completely replace remote log-in access soon; there will be situations, particularly in sophisticated information retrieval systems with elaborate user interfaces and incorporating advanced searching techniques, where the framework of the Z39.50 protocol and the generality of a Z39.50 client simply cannot match the quality of service offered by the information retrieval system on the server. But it is reasonable to expect that at least a good proportion of the relatively casual searching that users will want to perform on network information resources can be accommodated through Z39.50. Another important concern is the extent to which the protocol can be extended so that clients can become self-configuring. In existing prototypes to date, clients have been dealing with a very small number of reasonably homogeneous servers, and they require a good deal of manual adaptation when a new server is added with a new database. As we move to an environment with many different sets of both server and client software and many types of information resources, this process must become more automated, with real-time negotiation between server and client. To accomplish this effectively we must face the issues of describing database contents, structures, and access points in a much more standardized fashion. Finally, it is essential to recognize that Z39.50 interfaces can only be used in conjunction with other standards (or at least working agreements) on data elements and transfer formats. As information server technology comes to be applied to more and more different types of data, it will be necessary to come to rapid, parallel working agreements on data elements and transfer formats describing the types of data in question. The sample list of current Z39.50-based projects in the U.S. indicates that a great deal of work is currently under way to validate and advance the development of information server technology. Many of the projects have only started in the last year, suggesting that adoption of the Z39.50 approach may be reaching a critical mass. Notably absent from the current activities, however, seem to be most of the traditional commercial information utilities (such as DIALOG(R)). Conclusions Looking a little farther into the future, the development of information servers is a major step toward more advanced uses of the network. Throughout this article the discussion has focused on direct use of information resources by people -- today by direct remote log-in and in the future through an interface running on a client interacting with information servers. The development of information servers permits the use of increasingly autonomous computer programs (such as the knowledge robots, or "knowbots" proposed by Kahn and Cerf at the Corporation for National Research Initiatives) that can move through the network extracting, correlating, and refining information. Information servers will ultimately form an essential part of a network knowledge and information base that will serve many uses. It will not be limited to the relatively direct end-user searching that characterizes virtually all of today's use of information resources on the network. The analogy is often drawn between the evolving National Research and Education Network and the highway system: The networks are described as information highways and are predicated to have an impact as large as that of other major transportation systems in enhancing commerce and research, and creating social change. In this context, Tom West of the California State University System has spoken of the need for "information plazas" on these highways. The technology for access to information resources on the network described here is a vital step in making these information plazas recognizable and usable by those who will travel the networks. ************************************************************************ For further reading: The National Information Standards Organization. American National Standard Z39.50 -- 1988, Information Retrieval Service Definition and Protocol Specifications for Library Applications. New Brunwick, N.J.: Transaction Publishers, 1988. Fenly, Judith G., and Beacher Wiggins, eds. The Linked Systems Project: A Networking Tool for Libraries. Dublin, Ohio: Ohio Computer Library Center, 1988. Lynch, Clifford A. "The Client-Server Model in Information Retrieval," in Proceedings, 1989 ASIS Mid-Year Meeting (to appear). Lynch, Clifford A., and Cecilia M. Preston. "Internet Access to Information Resources," in Annual Review of Information Science and Technology (ARIST), Volume 2 ( to appear). MELVYL(R) is a registered trademark of The Regents of the University of California. MEDLINE(R) is a registered trademark of the National Library of Medicine. DIALOG(R) is a registered trademark of Dialog Information Services, Inc., Palo Alto, CA. ************************************************************************