Access Technology for Network Information Resources


Copyright 1990 CAUSE From _CAUSE/EFFECT_ Volume 13, Number 2, Summer 
1990. Permission to copy or disseminate all or part of this material is 
granted provided that the copies are not made or distributed for 
commercial advantage, the CAUSE copyright and its dateappear, and notice 
is given that copying is by permission of CAUSE, the association for 
managing and using information resources in higher education. To 
disseminate otherwise, or to republish, requires written permission. For 
further information, contact CAUSE, 4840 Pearl East Circle, Suite 302E, 
Boulder, CO 80301, 303-449-4430, e-mail info@CAUSE.colorado.edu


            ACCESS TECHNOLOGY FOR NETWORK INFORMATION RESOURCES
                          by Clifford A. Lynch

************************************************************************

Clifford A. Lynch is Director of the Division of Library Automation at
the University of California Office of the President. He has worked for
the University for ten years. Dr. Lynch is responsible for the UC
MELVYL(R) online catalog, which also offers access to the National 
Library of Medicine MEDLINE(R) database, and is currently putting ISI's 
Current Contents into production.

************************************************************************

ABSTRACT: This article examines some of the technical access barriers 
that inhibit the effective use of the information resources available 
today on higher education networks; the emerging information server 
technology that promises to alleviate many of these access problems; and 
the growing popularity of the Z39.50 protocol, including some questions 
that remain about this information retrieval standard. A sample list of 
Z39.50-based projects is offered.


As networks serving the higher education and research community evolve,
emphasis is moving from the networks themselves to the resources
available through the networks. These network-accessible resources
include databases containing library holdings, journal article
citations, images, electronic text, and scientific and technical data.
Recent developments indicative of this shift in focus range from the
announcement in March 1990 of the formation of the Coalition for
Networked Information by the Association of Research Libraries, CAUSE,
and EDUCOM to the revised language in the version of Senate Bill 1067
which cleared the Subcommittee on Science, Research, and Technology in
early April 1990. This bill, introduced by Senator Albert Gore,
establishes the National Research and Education Network (NREN) and calls
for the National Science Foundation to encourage the development of
digital libraries and other information services (including commercial
services) that are accessible through the NREN. At least twenty-eight
online library catalogs at major academic institutions, along with a
range of more specialized databases and information services, are
currently available for public access through the existing Internet.

However, substantial barriers exist which inhibit the effective use
of the information resources available today and threaten the higher
education and research community with complete overload as such
resources continue to proliferate. Active research and prototyping is
under way at a number of institutions across the country to explore and
develop the new technologies necessary to change the way information
resources are accessed across networks. This article provides an
overview of some of the current problems in network access to
information resources and a relatively nontechnical summary of the
developing technology to address them.

Access Barriers

Information resources such as library catalogs are accessed across
the Internet by remote log-in. The user appears to the remote system as
a terminal. Technically, if the user has an actual terminal, it is
connected to a computer functioning as a terminal server and running the
TELNET protocol across the network. Users with workstations actually run
terminal emulation software and TELNET directly on the workstation in
most cases. Users with intermediate class hardware such as PCs or
Macintoshes may either run both terminal emulation and TELNET directly
on their personal computer if it has a network connection, or may just
emulate a terminal and connect (through a modem and the switched-phone
network or a hardwired cable) to a terminal server device.

Anyone who has tried to use a range of the currently available
information resources can attest that this mode of access has serious
drawbacks. Each system on the network has a different user interface,
and it is necessary to learn new search strategies and commands for each
system. Even though some systems are relatively easy to use and offer a
good deal of guidance to the new or occasional user, the learning
process can be quite tedious. Individual systems have their own
"character" -- it may be far from obvious how to do a type of search 
that one commonly performs on one system on a different system just by 
reading the help screens.

Terminal incompatibilities are another major source of problems.
Some systems use cursor addressing and are quite sensitive to the brand
of terminal used (or emulated) to access the system. A number of systems
will not work at all unless the user has a specific type of terminal
(usually a DEC VT100). A few are sufficiently antisocial to assume the
accessing user has a VT100 and will fill the screen of any other type of
terminal with VT100 cursor addressing sequences. A number of systems on
the Internet were designed for use with IBM 3270-type terminals, which
have special function keys that are mapped to keyboard sequences on
ASCII terminals emulating 3270s. The user may receive a "helpful"
message such as "Press PF3 for help" and have to guess what key sequence
sends the equivalent of programmed function key 3 to the remote system
when there is no such key on his or her keyboard.

To some extent these problems are not an intrinsic shortcoming of
accessing information resources on the network through remote terminal
log-in, and are readily correctable by redesign and reprogramming of
existing information retrieval systems. But for many commercial vendors
of library automation systems, a solution would involve a great deal of
work (in which the vendors seem to have little interest). Further, the
entire remote log-in approach is based on line-by-line, character-
oriented terminals; it does not exploit the sophisticated capabilities
of today's bitmapped high-resolution workstations. Accessing information
resources within the remote log-in model would require that the various
information systems on the network implement X Windows, which again does
not seem to be on the near-term development agenda for most of the major
information retrieval system vendors.

A more basic set of problems appears when users try to employ
multiple information resources on the network to solve problems. The
user typically wants to:

  *  search a series of databases that may be located anywhere on the
network (preferably without having to reformulate the query for each
system);

  *  move the results to some convenient local workstation or
timesharing system;

  *  consolidate the results and eliminate duplicates based on some
precedence scheme (for example, when searching for books, if the same
book is found in multiple catalogs, keep the citation from the library
most easily accessible to the user); and

  *  store the consolidated search result, print it, incorporate it in
the bibliography of a paper, or place it in a personal database.

This is a part of the vision behind the "scholar's workstation" concept
that is driving the evolution of academic computing on many university
campuses.

The reality today falls hopelessly short of the vision. At best,
the user can save transcripts of sessions with each of the relevant
remote systems in which search results are displayed, and then
laboriously edit each transcript, after which the differing display
formats of each different remote system must be handled.

An increasingly visible problem, as resources multiply, is the
selection of appropriate databases to satisfy a given query. Currently,
there are several resource directories available; these are simply
collections of one or two pages describing each system available, the
mechanics of signing on to that system, and database coverage. These
directories are intended for human reading and are not (at present)
indexed. As we realize a future with hundreds or even thousands of
information resources available, these directories will have to become
databases rather than printed resources, and will have to be organized
for access by programs that help information seekers identify
appropriate resources to search.

The use of terminal emulation and remote log-in to access network
resources makes the proliferation of such resources more of a problem
than it should be. Using this means of access, end users must personally
search several resources sequentially, then transfer and consolidate the
results. If the searching, transfer, and consolidation activities could
be turned over to a program, then growth in the number of systems on the
network would not increase the workload of users, but simply increase
the amount of time they must wait while their system conducts a
comprehensive search.

Information Server Technology

The solution to the problems described above is to convert existing
network information resources through information servers -- computers
attached to the network that provide services to "clients" (see Figure
1). In this environment, the client is a program -- running on a 
personal computer, workstation, or timeshared system -- that is accessed 
through a terminal, workstation, or personal computer. The client 
operates on behalf of the human end user to insulate the user from 
database access protocols. All interaction with information resources 
are through the client, which presents the user with a uniform, 
consistent interface. The user presents the client with a query and the 
client selects appropriate information servers and sends the query to 
each of them. All results found on the remote information servers are 
then transferred back to the client for consolidation, presentation to 
the user, or other processing.

[FIGURE NOT AVAILABLE IN ASCII TEXT VERSION]

Such is the grand design, but the details of implementation are
much more complicated. The basis of the work accomplished to date is
American National Standard Z39.50, an application-layer protocol for
computer-to-computer information retrieval, which was standardized in
1988 under the auspices of the National Information Standards
Organization (NISO), the American National Standards Institute (ANSI)
accredited standards writing body for the library, publishing, and
information industries. International standards closely related to
Z39.50 -- ISO 10162 (Search and Retrieval Service) and 10163 (Search and
Retrieval Protocol) -- recently achieved draft international standard 
(DIS)
status.

     Z39.50 explained

The Z39.50 protocol is designed to function as an application-layer
(layer 7) protocol within the Open Systems Interconnection (OSI)
protocol suite, but is now being mapped on top of the Transmission
Control Protocol/Internet Protocol (TCP/IP) suite in use in the research
and education communities in the United States for current applications.

Z39.50 provides several facilities. It allows a client machine to
submit a search to a server; manage the search process (for example, the
server can inform the client that a search will take a long time to
execute and request confirmation that the search should be executed);
and learn the results of the search (the number of records matching the
search criteria, or the incidence of various error conditions). After a
search, the search result is retained on the server; the client can
request that records from this search result be transferred from server
to client. These functions, along with the ability to delete result sets
being held on the server and an initialization process that permits
client and server to agree on various parameters for their interaction,
form the "machinery" of Z39.50.

The other major part of Z39.50 (technically, not entirely part of
the U.S. standard but more fully included in ISO 10161/10162) includes
specifications of a canonical search format through which searches can
be transmitted from client to server. This consists essentially of a
series of predicates linked by Boolean operators such as AND and OR; the
predicates are composed of field names, relational attributes, and
values (for example, SUBJECT-containing-keyword value, or AUTHOR-
lastname-equals value). Both the field names and relational attributes
are selected from an attribute set that forms part of the "context" of a
connection between a server and a client. There is a working set of
attributes used for bibliographic retrieval, and it seems likely that
new attribute sets will be defined to support other types of databases
(such as full-text databases). It is planned that the Library of
Congress will maintain a registry of attribute sets and other data
related to Z39.50. Various other "codes" are also required for a Z39.50
session, such as an error message set to allow the server to communicate
problems with searches back to the client; these are also managed
through a registry process.

To create a useful interaction between client and server, both must
have some common understanding of the semantics of the data -- for 
example, the client must know that it is searching a bibliographic 
database and understand a search attribute set that the server also 
knows in order to construct queries. At present, there is no way within 
the protocol for the server to tell the client about what it can do -- 
for example, what attributes within an attribute set are supported for 
searching a given database (although this is a protocol extension that 
is currently under discussion). In addition, server and client must both 
understand a common transfer format for data moving from server to 
client; this is not part of the protocol but rather is assumed to be 
defined by separate standards relevant to the specific type of data in 
the database being searched. For example, for bibliographic data, the 
MARC (machine- readable cataloging) standards are commonly used, and 
define data elements and a transfer format for bibliographic 
description. Unfortunately, no standards exist in common use for many 
other types of data that one would like to search using Z39.50, such as 
journal abstracting and indexing, electronic journal articles, and 
images (though in some cases there are a number of emerging candidate 
standards).

A number of protocol extensions are under discussion, including
various requirements for clients to perform such functions as:

  *  obtain descriptive information from servers, to reduce the amount
of "prior agreement" required -- in effect, manual configuration -- for 
a client to communicate usefully with a server;
  *  permit browsing of index values;
  *  support other types of searching, such as relevance feedback
searching; and
  *  support sorting of result sets.

Image databases in particular are likely to require fairly extensive
protocol extensions because of the transformations that one wants the
server, rather than the client, to perform on images prior to sending
them over the network, in the interests of conserving network bandwidth.

Because the standards process moves slowly and there are a number
of active prototyping projects under way (see sidebar) to implement and
explore the use of Z39.50 in the U.S., a working group of Z39.50
implementors is now meeting on an informal basis to deal with
extensions, attribute sets, and similar matters. As implementations
stabilize and the existing protocol and proposed extensions are
validated in operation, the intent is that results be placed back into
the standards pipeline for a revised Z39.50 U.S. standard. Some changes
to Z39.50 will also be needed to harmonize it with the work going on in
the international arena (ISO 10161 and 10162) as that work reaches
stability.

     Questions about Z39.50

There are many open questions concerning Z39.50 and the extent to
which clients interacting with information servers can replace the best
of the existing integrated information retrieval systems, which have
user interfaces tightly linked to retrieval software and the underlying
databases. For example, major problems in bibliographic databases arise
when users obtain very large results and need assistance in reducing
them to a manageable size, and when users retrieve zero results (due to
spelling, typing, or problems with indexing vocabulary, for example).

Existing systems can use information derived from analysis of the
database to help users who encounter these difficulties. In a zero-
result search, the system could explain that there were no records
containing a given keyword in a multiple keyword search to aid recovery
from a zero result. In the large result case, the system might suggest
(based on knowledge of database statistics) that the user limit the
results to English language material published in the last ten years.
Means for the information servers to pass this information to clients,
and for clients to exploit such information in interacting with the
users for whom they mediate, has not been well explored within the
Z39.50 protocol framework. In some cases, it is not even clear when to
assign responsibility for various functions to the client, and when to
assign it to the server.

It is unlikely that Z39.50-based clients will completely replace
remote log-in access soon; there will be situations, particularly in
sophisticated information retrieval systems with elaborate user
interfaces and incorporating advanced searching techniques, where the
framework of the Z39.50 protocol and the generality of a Z39.50 client
simply cannot match the quality of service offered by the information
retrieval system on the server. But it is reasonable to expect that at
least a good proportion of the relatively casual searching that users
will want to perform on network information resources can be
accommodated through Z39.50.

Another important concern is the extent to which the protocol can
be extended so that clients can become self-configuring. In existing
prototypes to date, clients have been dealing with a very small number
of reasonably homogeneous servers, and they require a good deal of
manual adaptation when a new server is added with a new database. As we
move to an environment with many different sets of both server and
client software and many types of information resources, this process
must become more automated, with real-time negotiation between server
and client. To accomplish this effectively we must face the issues of
describing database contents, structures, and access points in a much
more standardized fashion.

Finally, it is essential to recognize that Z39.50 interfaces can
only be used in conjunction with other standards (or at least working
agreements) on data elements and transfer formats. As information server
technology comes to be applied to more and more different types of data,
it will be necessary to come to rapid, parallel working agreements on
data elements and transfer formats describing the types of data in
question.

The sample list of current Z39.50-based projects in the U.S. indicates
that a great deal of work is currently under way to validate and advance
the development of information server technology. Many of the projects
have only started in the last year, suggesting that adoption of the
Z39.50 approach may be reaching a critical mass. Notably absent from the
current activities, however, seem to be most of the traditional
commercial information utilities (such as DIALOG(R)).

Conclusions

Looking a little farther into the future, the development of
information servers is a major step toward more advanced uses of the
network. Throughout this article the discussion has focused on direct
use of information resources by people -- today by direct remote log-in 
and in the future through an interface running on a client interacting 
with information servers. The development of information servers permits 
the use of increasingly autonomous computer programs (such as the 
knowledge robots, or "knowbots" proposed by Kahn and Cerf at the 
Corporation for National Research Initiatives) that can move through the 
network extracting, correlating, and refining information. Information 
servers will ultimately form an essential part of a network knowledge 
and information base that will serve many uses. It will not be limited 
to the relatively direct end-user searching that characterizes virtually 
all of today's use of information resources on the network.

The analogy is often drawn between the evolving National Research
and Education Network and the highway system: The networks are described
as information highways and are predicated to have an impact as large as
that of other major transportation systems in enhancing commerce and
research, and creating social change. In this context, Tom West of the
California State University System has spoken of the need for
"information plazas" on these highways. The technology for access to
information resources on the network described here is a vital step in
making these information plazas recognizable and usable by those who
will travel the networks.

************************************************************************

For further reading:

The National Information Standards Organization. American National
Standard Z39.50 -- 1988, Information Retrieval Service Definition and
Protocol Specifications for Library Applications. New Brunwick, N.J.:
Transaction Publishers, 1988.

Fenly, Judith G., and Beacher Wiggins, eds. The Linked Systems Project:
A Networking Tool for Libraries. Dublin, Ohio: Ohio Computer Library
Center, 1988.

Lynch, Clifford A. "The Client-Server Model in Information Retrieval,"
in Proceedings, 1989 ASIS Mid-Year Meeting (to appear).

Lynch, Clifford A., and Cecilia M. Preston. "Internet Access to
Information Resources," in Annual Review of Information Science and
Technology (ARIST), Volume 2 ( to appear).

MELVYL(R) is a registered trademark of The Regents of the University of
California. MEDLINE(R) is a registered trademark of the National Library
of Medicine.

DIALOG(R) is a registered trademark of Dialog Information Services, 
Inc.,
Palo Alto, CA.

************************************************************************