Don Gourley
Washington Research Library Consortium
gourley@wrlc.org
| Digital library technologies are evolving rapidly. A middleware architecture based on a standard framework and Web services can help institutions with limited resources manage the integration and maintenance of disparate software components and new technology. This paper describes the requirements of such an architecture and presents as an example the ALADIN platform developed by the Washington Research Library Consortium to support access to digital collections, subscription databases, and library catalogs. |
In a broad sense a digital library is simply an on-line system providing access to a wide variety of content and services. Content can include virtually any kind of electronic material, such as various kinds of electronic media (images, video, etc.), licensed databases of journals, articles and abstracts, and descriptions of physical collections. Digital Library services are also varied, but typically serve the same roles that traditional collection development and access services have in physical libraries: selection, specialization and administration.1 Figure 1 shows some of the kinds of content and services that might be part of a real-world digital library.
In a digital library, collection selection means acquiring, describing, storing and delivering electronic resources. Metadata is used to describe the intellectual and technical attributes of the resource's objects. Storage is not only distributed across an institution but around the world through subscription or collaboration with remote partners. Many digital objects can be delivered directly over the Web, while some may require special viewing applications.
Specialization motivates the creation of a variety of access services for searching, browsing and discovering resources. Digital objects need to be organized and indexed for different purposes. Obvious structures for organizing and indexing digital objects include collection catalogs, finding aids and databases -- but virtual collections can also be built from multiple resources. For example, an instructor may want to combine references to objects in the library catalog, electronic journals and electronic reserves in an on-line course page. Mechanisms to find and access these objects must work in different on-line contexts. In addition, patrons should be able to perform their own specialization by accessing digital library content and services through personalized portals.
Administration is handled by core services that usually interact with other digital library services rather than directly with patrons. These services must often be recreated for each new environment and digital library system. How they are developed and maintained is critical to the long-term success of a digital library project, especially when technical resources are limited. For example, a naming service can provide a technology- and location-independent mechanism for identifying and retrieving digital objects; without it, access services must be modified when resources move or are ported to a new platform. Access control presents a growing challenge as digital libraries include an increasing variety of electronic resources from different sources. Different mechanisms (e.g. proxying, login names/passwords, certificates or other vouchers) are required to grant access depending on who and where the patron is and what resource is being accessed. Without a core service, each special access service must handle the task of choosing and performing the specific login or else pass the responsibility on to the patron.
The technology required to provide digital library services continues to change rapidly as researchers and commercial vendors expand the kinds of content and access services that might be included. As the technology changes and improves, an institution's vision and requirements for a digital library evolve accordingly. The software to support these services is complex and dynamic. How does the average academic library, with limited technical resources, manage the development and growth of a practical, real-world, production digital library that meets its unique requirements? There are several factors that limit an institution's ability to build a digital library, including significant issues and expenses related to the acquisition (digitization and licensing) of content. But the factor we will focus on here is the development and maintenance of software to support the delivery of digital library content and services.
Large academic libraries may be able to write their own software, perhaps in collaboration with university researchers and IT departments. But many libraries do not have the resources to build a complete solution, and a comprehensive commercial solution is often too expensive. Moreover, due to the various and expanding content and services, no single product can meet all the requirements of even a basic function of a digital library.2
In practice, real-world digital library systems are collections of loosely coupled services made to appear more or less integrated when accessed via library Web pages. The disparate systems that make up the digital library include commercial products, components constructed with specialized tool kits, open source applications, and homegrown programs. Some of these components are local applications and many others are distributed across the Internet and provided by numerous independent sources. Constructing a comprehensive production digital library has become a complex systems integration job.3
If done in an ad-hoc manner, solving each integration problem individually, even integrating off-the-shelf components can require resources out of the reach of many institutions. Multiple interfaces to components make a distributed application fragile and difficult to maintain. Core administrative services like naming and access control can simplify integration by reducing and standardizing the interfaces between components, but can themselves be difficult to engineer for maximum adaptability and flexibility.
A middleware architecture based on a standard framework can help manage the digital library integration problem with minimal resources. A software framework supports the common tasks and functions that are needed for distributed Web-based applications; the richer the framework, the less custom programming is required. Some of the functions supported by a framework for a digital library might include:
There are many Web application frameworks that provide these tools. A simple (and free) way to build such a framework is to use the Apache Web server with an application server module for a language with strong support for Web applications, such as Java, Perl or PHP. There are also several commercial application server products that provide most of the necessary functions.
When technical resources are limited, it is important to select a framework with which staff have some level of experience and comfort. Consider not only current staff but how easy it will be to find personnel in the future who can work on that platform. Almost always a standard framework that is supported on multiple operating systems and hardware is better in this regard than a proprietary framework tied to one vendor's systems. Also consider the community that has developed around the platform; an active community helps ensure that support for new relevant technology will be quickly added to the platform. This is especially true of platforms based on open source software, where user-programmers contribute and enhance new tools.4
Middleware is an application layer that provides uniform interfaces to distributed components of the digital library.5 It helps tie together the storage, delivery, searching, and browsing of electronic resources. By wrapping the digital library's core services inside a middleware layer, existing and new resources can be more easily integrated into the digital library. This scalability is achieved through a middleware architecture that "brokers" communication between components.
The broker model has been used to consolidate and generalize access management in digital library systems.6 Components providing access to patrons who might be authorized in various directories can communicate (via one interface) with a broker that handles the complexity of multiple directory interfaces. When a directory is added or changed only the broker is affected rather than every component requiring patron authentication. Similarly, a broker can handle authentication into various remote resources so that access service components (and patrons) don't have to know how to login to all the resources comprising the library.
We have found this model also helps solve other kinds of digital library integration problems. For example, a component can query a broker when a static name is needed for a resource, rather than getting the name from different services depending on what kind of resource is being named. Similarly, components that aren't designed to lookup shared information can be synchronized through an attribute broker when the information they need changes.
To keep the interfaces simple, brokers should not require special protocols for communication. Programming inter-process communication, especially in a distributed environment, can be difficult and error-prone unless your framework takes care of the messy details of sending request/response messages between programs. HTTP is a common request/response transport protocol supported by frameworks which is especially useful when Web pages need to link to the broker service. For example, a broker service can take a static name embedded in a URL and redirect the requester (which could be a patron's Web browser or another digital library service) to the correct resource.
The idea of using HTTP to carry messages between software components is fundamental to the emerging concept of Web services.7 The Simple Object Access Protocol (SOAP) has been developed to send XML requests and responses over standard protocols like HTTP between Web services requester and provider components.8 Using SOAP to communicate between digital library components and brokers allows them to be integrated more rapidly and less expensively because you don't need to be concerned with network protocols or message formats. In addition, SOAP is proving to be a truly cross-platform and cross-language remote procedure call protocol, with tool kits available for many languages including Java, C++, Perl, PHP, and Visual Basic and other Microsoft languages.
The Washington Research Library Consortium (WRLC) digital library system is called ALADIN (Access to Library And Database Information Network). It provides content and services for seven medium-sized academic research libraries, including over 500 subscription databases, digital collections (images, audio and metadata), and library catalogs. The full-time equivalent of one and a half systems programmers are dedicated to developing and maintaining ALADIN. To manage the digital library integration problem with this relatively small technical staff, we have developed a middleware architecture employing a standard framework and several broker services.
The Apache Web and Java application servers provide basic framework tools for ALADIN.9 Additional Java class libraries provide tools for HTML templates, XML processing, and connectivity to Oracle and MySQL databases. Electronic resources are defined in a configuration database which is used to build menus of resources, resolve resource names, and authenticate patrons to remote resources. Integrating a new resource can often be done by simply defining in the configuration database which menus it should appear on, where on the Internet it is located, and how to login to it.
Patron access to the resource involves the following steps:
The naming and access control services provide seamless access for these kinds of resources by using standard HTTP get, post and redirection messages. ALADIN services that combine resources in different ways or provide specialized functions require more complex integration solutions. Some examples show how the broker model, Web services and a standard framework have helped integrate these components:
To support a consortium of universities, ALADIN must have a flexible authentication service capable of searching multiple directories. Currently patron information from member institutions is loaded into a union catalog, but we are exploring the possibility of accessing patron information as needed from campus directories. Also, patrons must be able to login explicitly with different IDs (library barcode, institution ID, SSN) or implicitly based on their on-campus IP address. Several ALADIN components require authentication information including the main menu system and a personal portal (both written in Java), and a telnet-based menu for ASCII terminals (written in Perl).
To isolate ALADIN components from authentication changes such as new campus IP networks or student directories, a broker service was created to handle patron lookups. The broker interface has been exposed as an Apache SOAP service so it can be accessed by Java servlets and Perl scripts. Lookup requests consist of an identification string which can be an IP address or any kind of patron ID, and the broker takes care of choosing which directories to use and handling the various interfaces (database or file searching).
Our audio collections use a free version of RealServer which does not support any access control. To restrict access to authenticated patrons (while providing them access from anywhere in the digital library), we created a RealServer authorization broker that logs in the patron and returns a metadata file to stream the requested audio file from the collection via a temporary file system link to the content. The file system link is destroyed when the patron's session is closed or expires.
This broker takes advantage of the ALADIN scripted login mechanism described above, so no new interface was required. When an authenticated patron requests an audio file, ALADIN retrieves the login variables from the configuration database and redirects the patron's browser to post them to the authorization broker via HTTP.
A special ALADIN service allows patrons to request articles from journals in the catalog. Prospero is used to deliver scanned articles via the Web. Prospero is designed as a stand-alone delivery system with its own user file and login ID and password. It uses the patron's email address internally to associate patrons and documents. In order to allow patrons to request and access the documents through ALADIN, an attribute broker was created to synchronize user information between ALADIN and Prospero.
The attribute broker passes data both ways: Patron email updates are written to Prospero's user and manifest files, and information about available documents is summarized on patrons' portal home pages. In this case the broker is not managing multiple resource interfaces, but is translating the location-dependent file system interface to a SOAP service that can be accessed by distributed components. ALADIN components that manage patron information have a framework-supported interface for synchronizing that information with Prospero and are isolated from changes or upgrades to the Web document delivery system.
The Electronic Journal Title Finder (EJTF) is an ALADIN service that lets patrons search for specific journal titles in order to find electronic or physical copies among the digital library collections. The information comes from the union catalog, a half dozen vendors of aggregated databases and other sources. Not only are there different interfaces to each data source, but some of the interfaces are quite slow to perform in real-time for every query. So, instead of using a broker to perform a federated search over the data sources, a harvester is run every week to collect the information and consolidate duplicate titles.
An interface to the harvested metadata is required. For performance, an XML file is created which the discovery service can load at startup, so it doesn't need to perform a network request or file system operation for each search. XML support is supplied by the framework via the JDOM class library for reading and searching the data.
As institutions develop campus portals to consolidate access to services, the digital library may be asked to provide content. ALADIN's framework includes the Freemarker template engine. With this support we were able to provide patron information to the campus portal at American University by simply creating a new template that wrapped the data in XML tags instead of HTML. The American portal then formats an HTML page that conforms to their display standards.
When myALADIN receives a request for patron information, it gathers it from various ALADIN services and instantiates a Template class object to format the response. If the request came from my.american.edu, the Template object is constructed from an XML template which uses ColdFusion's Web Distributed Data eXchange (WDDX) tags; otherwise the response will be formatted with HTML markup for a Web browser.
The WRLC has found that a middleware architecture based on a Java framework and the "broker model" of component interfaces reduces the effort required to integrate disparate digital library components. The framework allows us to leverage existing tools and minimize the custom code required to program component interfaces. Borrowing technology from the arena of Web services, we have constructed broker services that present a single, uniform interface for multiple components to access a variety of content and services. We have been able to integrate many new and updated resources and services in the digital library with relatively minor changes that are limited to the configuration database or a middleware broker service. The core services we have built have been relatively stable and unchanged since they were first developed using this architecture 1½ years ago.