Data Architecture in an Open Systems Environment Copyright 1993 CAUSE. From _CAUSE/EFFECT_ Volume 16, Number 4, Winter 1993. Permission to copy or disseminate all or part of this material is granted provided that the copies are not made or distributed for commercial advantage, the CAUSE copyright and its date appear, and notice is given that copying is by permission of CAUSE, the association for managing and using information resources in higher education. To disseminate otherwise, or to republish, requires written permission. For further information, contact Julia Rudy at CAUSE, 4840 Pearl East Circle, Suite 302E, Boulder, CO 80301 USA; 303-939-0308; e-mail: jrudy@CAUSE.colorado.edu DATA ARCHITECTURE IN AN OPEN SYSTEMS ENVIRONMENT by Gerald Bernbom and Dennis Cromwell ABSTRACT: This article describes the conceptual basis for a structured data architecture, and its integration with the deployment of open systems technology at Indiana University. The key strategic initiatives which brought these efforts together are discussed: commitment to improved data access, migration to relational database technology, deployment of a high-speed, multi-protocol network, and orientation to workstation-centered computing. Three major goals in Indiana University's strategic planning for University Computing Services (UCS) are: to extend the network to every user's desktop; to make each user's workstation a window to the entire array of the University's computing and information resources; and to improve access to information for students, faculty, and staff. Several projects and initiatives have been developed from these strategic planning goals. The goal to improve access to information has led to new investments in relational database technology, the introduction of data-oriented development methodologies, and the focus of organizational resources on data administration and information access. The development of a University-wide data architecture is part of the response to the goal of widespread information access. The goals of network-based and workstation-centered computing have led to initiatives in network infrastructure installation, designation and support of standard workstation and local area network products, innovative user-support programs based on a distributed support model, and the focus of organizational resources on network and workstation technologies. Development of a model for an open systems environment is part of the response to these goals, and is a recognition of the technological diversity that is inherent in a networked and workstation-based computing environment on a college or university campus. Taken together, our work in data architecture and open systems is intended to apply the best technology available to meeting individual users' needs for access to information. DATA ARCHITECTURE The purpose of a data architecture is to organize and present the institution's data in the way best suited to the varying needs of its many users. Data architecture is a design and analysis technique based on the macro-level analysis of data within an enterprise. The result is a general purpose model for the structure and deployment of data on an enterprise-wide basis. A data architecture model describes variations in the general nature of data within the enterprise: detail data vs. summarized data, real-time data vs. point-in-time ("snapshot") data, and so forth. The data architecture model also takes note of the variations in intended uses of data: transaction processing, customer service, management reporting, decision support, institutional research, external reporting, etc. In addition, the data architecture specifies standards for information synthesis: how summarized data are derived from detail data, how standard points-in-time are determined for creating "snapshot" databases, and so on. The resulting model helps to establish authoritative data sources, to define stable relationships among multiple instances of the same data, and to assure predictable transformations of data throughout the steps of information synthesis. Finally, the data architecture model assists in technology selection and evaluation. By knowing the general form of data, its intended use, and some of the associated performance requirements, the data architecture can help specify the standard or recommended technologies for deployment of different categories of data. Several threads of discussion in the information technology industry have influenced our thinking about this set of data architecture issues. The first to come to our attention was the work of Bill Inmon on data architecture and, more recently, the data warehouse concept.[1] Subsequent to much of Inmon's pioneering work, IBM unveiled its "Information Warehouse" framework.[2] And, in partnership with IBM, Information Builders, Inc. developed their product offering of Enterprise Data Access/SQL (EDA/SQL). The final industry thread that we have followed, and one that looks especially promising in terms of where we want to take the data architecture of Indiana University, is Apple Computer's VITAL (Virtually Integrated Technical Architecture Lifecycle), a conceptual model of information systems design and information access[3] An approach to data architecture design In some ways, the clearest way to introduce the concept of data architecture is to look at one. Figure 1 is the basic data architecture model developed for institutional data at Indiana University. Figure 1: Data Architecture Model [FIGURE NOT AVAILABLE IN ASCII TEXT VERSION] The first key point to note is that the model divides data into three strata: operational, information detail, and what are called "collections." At the lowest level (operational) are those data used in the day-to-day operations of the enterprise: business transaction processing, data capture/data entry, real-time customer service, and so forth. At the second level (information detail) are the idealized representations of an enterprise's data, minus the redundancies and anomalies that are a natural part of any set of operational information systems, but still having a nearly one-to-one correspondence with the business objects represented by operational data and still maintained at the finest level of granularity. At the highest level (collections) are found synthesized information products, derived from lower order and more detailed data sources. These collections may be summarizations of detail data, selected extracts, aggregations from multiple detail sources, etc. The metaphor applied to this three-level model is: Collections -- Retail Information detail -- Wholesale Operational -- Manufacturing Three other points about the model are notable. First, data move in one direction within the model; the process is one of refinement and synthesis, and there is no downward flow of data in this model and no back-loading of data from higher-order forms to lower. Second, all business transaction (update) processing occurs at the lowest level in the model, the operational; information detail and standard collections are read-only data and are never updated directly. Finally, the model recognizes the need to integrate local data with enterprise data, both as a point-of-origin (in local operational data) and as a final destination (in collections of information for local analysis and decision support). Data architecture and Indiana University The development of the data architecture concepts and their implementation is a long-term process for Indiana University. From our current vantage point we see the process as having four steps: (1) physically separate operational data from information detail, as necessary; (2) refine the information detail; (3) develop standard collections; and (4) integrate enterprise data with local data. Step 1: Separate operational data from information detail In this step, separate physical storage locations and formats are created for operational data and information detail data. Separate storage is often justified, and sometimes required, because of the way these two classes of data differ from each other in terms of typical use and performance requirements. Operational data are typically retrieved one record at a time; retrieval is of a specific record or set of records, identifiable by key; and records are frequently inserted or retrieved for update processing. Also, operational data are typically retrieved interactively--for online transaction processing or real-time information display--and so require response time in seconds or faster. Information detail data are typically retrieved in sets of many records; defined sets of records are typically retrieved based on secondary (non-key) characteristics; and records are virtually never retrieved for update processing--all access is read-only. For most information detail purposes--analysis, reporting, and decision support--response time in minutes, if not hours, is sufficient. Given the different uses and performance requirements, separate physical implementations can make use of data design techniques appropriate to the needs of each class of data. For example, operational data can be stored in highly navigable databases tuned for online performance and single-record retrieval, while information detail might be stored in relational databases having multiple secondary indexes for searching and set selection. One more reason for separate physical implementations can be the need for information detail to have a stable set of values over some period of time. Operational databases are in a constant state of flux; they are the place where the business transactions of the enterprise are recorded in real time. Reporting, analysis, and decision support require some stability of data so that comparative analyses can be performed, iterative analysis can pursue a line of inquiry on a stable set of data values, and the same results can be replicated by multiple users of the same data. So for some applications, a requirement of the information detail may be that it is a time-variant replication of its operational data counterpart, having stable data values from a defined point in time. At Indiana University, a two-part environment of transaction- oriented, operational data and fixed-value, replicated information detail data was first implemented in the early 1980s. This initial investment in data design has had great value over the past ten years providing a widely used end-user reporting environment. It also forms the foundation on which further stages of the data architecture model can be built. Step 2: Refine the information detail The second step, at least for Indiana University, is to refine the information detail data environment. In an organization just beginning to implement a data architecture model, many of these refining tasks might be accomplished in the initial design of information detail data. Our chief tasks of data refinement are development of a University-wide, conceptual data model; migration of the information detail data environment from indexed and sequential storage formats to relational database formats; and resolution of anomalies and redundancies in the data structures by determining authoritative sources for each data element. A project to develop a University-wide conceptual data model, in four component pieces, was begun in 1991. The first two components, University-wide conceptual models for student data and for University financial data, have been completed. Also begun in 1991 was the migration of data from older storage formats to relational database format. Our first efforts were focused on creating a relational design and implementation for faculty and staff data, and associated payroll data, in an employee database. Current efforts are focused on implementing a relational design for general ledger, chart of accounts, and financial transaction processing data, and on resolving anomalies that exist between University accounting and budget data. Relational designs have also been implemented for other, smaller information systems, including travel, contracts and grants, capital assets, and others. Step 3: Develop standard collections Using the three-stage model described earlier, this step in a data architecture development is intended to organize and package data in ways that are more ready to use. The process is one of synthesizing information from the information detail. The chief tools of synthesis are summarization, aggregation, selection, and point-in-time representation. All these are transformations of data that one typically finds in user-designed reports, queries, or data selection/extract programs. The term "collection" refers to any selected or synthesized set of data derived from the underlying information detail. Collections may be implemented as physical entities, where the results of selection, summarization, and so on are saved and stored as a physical database. Or collections may be virtual entities--data views--where the same rules of selection, summarization, etc. are applied to a relational database at query time using stored SQL. The chief intent behind "standard" collections is to establish shared definitions for certain transformations of data: the calculation of grade-point average based on hours and honor points, the set definition of a full-time student or part-time employee, the accepted intersection of faculty data and course data for use in measurement of teaching load, and so on. Shared definitions, and their implementation in a data architecture, benefit the data user by providing commonly needed information, already synthesized and ready to use; they benefit the institution by helping assure that all reports and analyses have access to the same interpretation of the underlying detail data. Moreover, a standard collection, implemented as a physical database, is desirable when some set of summarizations, selections, or aggregations is repeated so often that significant computing resources are being spent many times over by many different users executing the same program code. This is especially true of complex and often-used set selections like: find all students enrolled in the current semester, find all accounts with outstanding balances greater than thirty days, etc. Finally, implementation of a collection as a physical database in the data architecture is simply a practical requirement when a collection is meant to reflect a "snapshot" point in time. Our progress toward formalizing this step has come slowly. Within selected data subject areas Indiana University has had a long history of using standard collections, in particular for point-in-time snapshots (e.g., student census reporting files). The introduction of relational database technology created the opportunity to use logical data views as a means of implementing standard collections. In our current work with designing financial data in relational format we are evaluating the costs and efficiencies of storing collections as physical files vs. the creation of logical data views, in order to refine our decision criteria for how best to implement these standard collections. Step 4: Integrate with local data The data architecture as a representation for University-wide data needs to be seen as an open architecture with respect to local stores of data that exist throughout the organization. These local data can be a primary source for the University's operational data; they feed the operational information systems of the enterprise. They can also be the final destination for synthesized information as used in decision support and analysis at the local level. The enterprise cannot pretend to manage data centrally that are really managed locally. The purpose for placing local data within the architecture is to show the points of interface and to identify the need for these to be managed. As source for enterprise-wide operational data, the key requirement for interface to local operational data is that local data must be passed through steps of validation and authentication as prerequisite for their acceptance as enterprise-wide data. As final destination for data, the key requirements are to encourage use of standard collections as the basis for building local collections (i.e., encourage use of shared definitions, where such exist) and, if this cannot meet the local need, to assure that information detail (and not operational data) is used to populate local collections. OPEN SYSTEMS The mission of IU's University Computing Services and its initiatives for information access, network expansion, and workstation-based computing were factors that moved us to choose an "open systems" approach as the best direction for our information technology environment. Based on a belief that individual institutions must create their own working definition for open systems, Indiana University has identified five key points to define the open systems environment we are developing. First, the technology must be standards-based, whether these are national or international standards, such as are sanctioned by ANSI and ISO, or whether they are de facto standards based on industry acceptance and widespread use. Second, the technology must support a multi-vendor environment, providing interoperability between platforms and, in some cases, portability of applications across platforms. Third, the open systems environment must make use of available technologies, those that are on the market and generally available. Fourth, the technologies chosen should show signs of being sustainable over time within the industry as evidenced by such factors as market share, degree of penetration, vendor viability, and vendor commitment. Finally, the technologies should be oriented to the workstation, enabling the integration of information at the desktop. Indiana has committed to support a heterogeneous (multi-vendor) desktop environment, and selections of open systems technologies must support this with a user interface that is native to each of the supported desktop devices. To provide some context for the "open systems" criteria we are using, here are two definitions from industry sources, each of which provided some basis for the definition in use at Indiana: From Hewlett-Packard: "[Open systems are] software environments consisting of products and technologies which are designed and implemented in accordance with standards--established and de facto-- that are vendor independent and commonly available."4 From Gartner Group: "[An open system is] an information processing environment based upon application programming interfaces and communication protocol standards which transcend any one vendor and have significant enough vendor product investment to assure both multiple platform availability and industry sustainability."5 Strategy for open systems: technology coexistence The strategy that Indiana University is pursuing for implementing its open systems environment is one best described as "technology coexistence," and is based on a framework of five guiding principles. Chief among these is that the network is the key to deploying open systems. It is the integrating and communicating force in the environment, providing access to multiple services and carrying several protocols in support of these services. A second principle is concentration on the user view of the technology: each workstation performs multiple functions in our view of an open systems environment; the functions are performed with an interface that is native to the user's desktop device; and the desktop represents the integration point for services to the user. Migration of data to relational database format is a third principle. In order for a coexistence strategy to work, and to enable network access to data across multiple technology platforms, a common relational format for the University's data is a requirement. And a related principle is the deployment of gateway technology that links proprietary systems with the open systems environment. Especially as pertains to IBM's role in an open systems picture, gateways are the path to more open communication with MVS-ESA (the operating system of our largest central server mainframe); they also permit us to take advantage of DB2 as a database engine and its outstanding ability to manage and serve large volumes of data. A final principle, which encapsulates much of the rest, is that a mix of proprietary and non-proprietary technologies is required for a successful open systems environment. This is the heart of the coexistence strategy. It builds on current strengths, it leverages past investments without being bound by them, it makes use of available or obtainable skills, and it tries not to induce trauma while introducing change to the organization. In particular, it tries to protect user investment in technology (the purchase of desktop devices, design and implementation of traditional information systems, training in tools of information access and analysis, etc.), while trying not to shackle the computing organization to unlimited and open-ended support of all conceivable technologies. Other open systems strategies Technology coexistence is not the only path to an open systems solution. Three other strategies for transition to open systems can be defined and might be considered by any institution planning a move to open systems. "Freeze and Rebuild" is a fairly flexible transition strategy, at least in terms of the pace at which it is implemented. All information systems and applications based on "old" technologies are frozen; no new investment is made in their maintenance or enhancement. All new investment is made in applications based entirely on open systems technologies. This strategy works best in situations where there are a full suite of well-defined open standards and a clear direction toward their implementation. The strategy does carry with it a relatively high cost, both in terms of lost opportunities, as legacy systems fail to change with the changing needs of their users, and in terms of resource outlay for redevelopment of a completely new set of enterprise applications. The second transition strategy, "Clearcut and Reseed," is the most extreme. The plug is pulled on the old system and the new technology is rolled in. One day there are legacy systems, and the next day nothing but open systems. This strategy can work at the level of an individual application or a closely related set of applications, or in a small and relatively simple organization. But it is very expensive, and is probably impractical to think about as a viable alternative for the revolution of technology on an enterprise-wide basis. A third strategy is to simply "Wait and See." If open systems are not well enough defined (if the problem with standards is that you have too many to pick from), just wait until the picture clarifies itself. In truth, this strategy may not be as safe as it sounds. One of the benefits in an open systems environment is the leverage it provides in negotiation with hardware and software vendors; even with some proprietary components, there are choices and alternate paths to almost any implementation. This leverage is lost to an enterprise that is not pursuing some type of multi-vendor approach to open systems. Moreover, the organization that waits too long may end up being forced into open systems anyway, at a time and pace not of its own choosing, and will ultimately be forced to play catch-up. Technology suite at Indiana University In establishing its open systems strategy of coexistence, Indiana University has selected and deployed products and technologies at four levels: the workstation, the network, the central server, and the database management system. Figure 2 shows the mix of these technologies and their interrelationships. Figure 2: Technology Suite at Indiana University [FIGURE NOT AVAILABLE IN ASCII TEXT VERSION] At the workstation level, Indiana is committed to supporting a heterogeneous mix of desktop computers. Machines running DOS or Windows 3.x are the most prevalent systems on our campuses. There is also a strong Macintosh presence on more than one of our campuses, and the last two years have seen a growing base of UNIX workstations deployed as individual desktop devices. Indiana University operates a state-wide data network among its eight campuses and has promoted the deployment on each campus of a high-speed intra-campus network. The network carries multiple protocols, most significant among them TCP/IP, IPX, and AppleTalk. At the level of central servers, Indiana operates a large scale MVS/ESA processor which, although proprietary in nature, does support TCP/IP access using Interlink Access MVS. The University sees an increasing role for UNIX systems as host processors for enterprise data and application servers. We have never seen UNIX as an absolute prerequisite for open systems, but its impressive price-for- performance characteristic makes it very attractive. We have most recently begun using HP-UX systems from Hewlett-Packard, based on their ability to support a secure and production-quality enterprise computing environment. Finally, Novell is a critical component today as a network file server, and as we watch developments in the arena of Novell/SQL database technology, we remain aware of its possibilities as a database server. Regarding central server database technology, our basic investments today are in Sybase as the relational database on UNIX platforms, and in DB2 as the relational database on the MVS platform, with continued support for VSAM as a data storage and access technology where needed. This technology suite represents products chosen by Indiana University using our definition of open systems. Not every technology can score high on every factor, but each technology is among the best in its class for meeting the criteria of an open systems environment. For instance, DB2 and Novell are products that we consider best in their respective classes, although they are not always considered within the context of open systems. These products are very robust; they have tremendous market share, which makes them available and sustainable technologies; they offer support of de facto industry standards; and they provide interoperability, often based on third- party vendor products. MAPPING TECHNOLOGIES TO THE DATA ARCHITECTURE In its evaluation and selection of technologies for its open systems future, and especially as regards the use of technology to support a data architecture, Indiana University faces questions at three levels of specificity: 1. What will be the suite of technologies we support, and how will we choose them? 2. How will we select one (or more) of these technologies to deploy for one of the functions within the data architecture? [Examples: What database(s) will be supported for enterprise-wide operational information systems? What technology might be recommended for local data collections?] 3. How will we select from among the chosen and supported technologies (assuming more than one is supported for any general function) the correct technology for a specific application? We have made good progress on questions at the first two levels of specificity. A suite of technologies has been chosen for most key technology families: networks, central server databases, etc. As each technology is chosen (Question 1) it is also evaluated and slotted for deployment in one or more of the general data architecture functions (Question 2). The criteria we have used to select and assign a technology include the basic criteria for any technology in an open systems environment: (1) standards based; (2) workstation oriented; (3) supports a multi-vendor environment; (4) based on available technologies; and (5) based on sustainable technologies. In addition, we have applied criteria related to resources and constraints: (1) user needs and functionality; (2) performance and cost; (3) established expertise within the organization, or its availability; (4) degree of integration with existing technologies required, and the degree actually possible; and (5) user preference. One of the next critical steps, as mentioned below, is our need to develop better decision rules for the selection from among available technologies, where more than one technology is deployed in support of a single data architecture function. In the area of database technology, Figure 3 shows how DBMS products have been mapped to each component of the data architecture. Figure 3: Mapping Data Technologies to the Architecture [FIGURE NOT AVAILABLE IN ASCII TEXT VERSION] Next Critical Events The data architecture and the open systems environment continue to evolve in their implementations at Indiana University. Some next critical events in the evolution of data architecture are summarized as follows. * We need to continue to migrate information detail data from sequential and hierarchical file formats to relational format. * We need a methodology for specification and creation of standard collections--how they are defined, by whom, and at what stage of an information systems life cycle. * Decision rules are needed to select from among available technologies, when more than one exists within any component of the data architecture. * User access to data needs to move from a mode of terminal login to central servers to a mode of desktop delivery and integration. * The preferred source of data needs to move from the detail level to the collections, where significant synthesis will have already occurred and where shared definitions are accessible. There are similar critical events in the evolution of an open systems environment. Developments in Novell/SQL database technology need to be tracked and evaluated; even if no central support for a single technology is envisioned, integration with user-selected technologies in this field will almost certainly be a requirement of our environment. Technologies for gateway access to and from the MVS world will continue to be evaluated, both Remote Procedure Call (RPC) gateways for predictable and static interactions between platforms, and dynamic SQL gateways for data interaction and interchange. A key goal for the next year or so will be analysis and resolution of issues related to security in a client/server implementation. We will need to monitor and seek opportunities for the integration of "standard" Distributed Computing Environment (DCE) functions in our open systems environment: time servers, name servers, authentication servers, and so on. Most important, we need to keep our focus on why data architecture and open systems technology matter in a campus computing environment. Neither is an end in itself. Data architecture should lead to improved user access to information through such outcomes as better data quality, shared definitions of data, shared standards for data interpretation, and a better fit between user needs and the performance of data-related technologies. Open systems should lead to better service for all users through such outcomes as greater flexibility to make vendor-independent selection of the best technology for a specific need, better connectivity to central computing resources, and improved access to central resources from local computing environments and individual users' desktop workstations. ======================================================================== Footnotes: 1 W. H. Inmon, Data Architecture: The Information Paradigm (Wellesley, Mass.: QED Information Sciences, Inc., 1989); and W. H. Inmon, What Is A Data Warehouse? (Sunnyvale, Calif.: Prism Solutions, Inc., 1992). 2 An Introduction to IBM's Information Warehouse Framework. Material from an IBM corporate presentation. 3 Introduction to VITAL: Designing Information Systems for the 1990s (Cupertino, Calif.: Apple Computer, Inc., 1992). 4 Open Systems Concepts and Capabilities Seminar: Student Workbook (Indianapolis, Ind.: Hewlett-Packard Corporation, 1992). 5 Gartner Group, Open Systems: Cutting Through the Confusion. Material from a 1991 Gartner Conference. ======================================================================== Why Develop a Data Architecture Model? Improve data access ... * describe variations in the general nature of data in the enterprise * identify variations in intended use of data * organize data in the ways best suited to the varying needs of users Improve data quality ... * establish authoritative data sources * define stable relationships among multiple copies of the same data * offer shared definitions for standard transformations of data: calculations, summarizations, selections Improve database performance ... * identify performance needs for data use, and select the best technology ********************************************************************* Principles of Open Systems "Technology Coexistence" Strategy * The network--the integrating force in an open systems strategy * The users' view--the desktop as integration point for service delivery * Relational data--the common language for a shared data environment * Gateway technology--the bridge between the "old" and the "new" * Proprietary and non-proprietary mix--leverage past investments while introducing change ************************************************************************ Gerald Bernbom is Assistant Director, Data Administration and Access, at Indiana University. As part of University Computing Services, his unit is responsible for data administration, database administration, security administration, data dictionary management, campus-wide information systems, and the information center. Dennis Cromwell is Manager, Information Technology and Standards, at Indiana University. As part of the University computing organization, his team is responsible for technology planning, technical architecture, information system standards, and development methodology. ************************************************************************