A Warehouse Design; Keeping it Simple

This paper is the intellectual property of the author(s). It was presented at EDUCAUSE '99, an EDUCAUSE conference, and is part of that conference's online proceedings. See http://www.educause.edu/copyright.html for additional copyright information.

A Warehouse Design; Keeping it Simple

Scott Thorne
MIT
Cambridge, MA

MIT has implemented a warehouse, which brings several elements of function and design together. MIT’s warehouse supports reporting, and acts as a hub for data distribution between systems. We’ve tried to satisfy both needs with an open flexible design. We’ve restructured our information, which comes nightly from source systems such as SAP, for easier reporting. The warehouse has row level access control which is driven from an enterprise authorization system (Roles). Other design features are; a metadata driven data conversion and load process, distributed report creation, encrypted transport, and a system of automated integrity checks and controls. This paper will cover the major design points and some of our experiences implementing and maintaining a warehouse.

Introduction

MIT’s data warehouse has been in existence for about four years, but only in the past year as we’ve added our SAP financial information has it reached maturity. Currently we have a thousand registered users who have about 3,000 sessions a month. We have Personnel, Financial, Purchasing, Telephone, Student, and Award information in the warehouse currently.

Vision

The basic vision of the Data Warehouse is to make information at MIT more accessible and easier to use. If people can get data easily, they will spend less time gathering information and more time analyzing it. Having information from several different sources allows people to create reports easily. Once similar reports would have taken days to construct. This allows time for information to be used in new and creative ways. In the past, even if individuals invested the time and energy to put together information for themselves, it was not easily shared. With the technologies available today, once a user figures out a good way to look at the information (by generating a report definition in BrioQuery, for example), they can easily share it with others.

By making information available in this way, users should be able to combine warehouse information with local information. This is a powerful concept, because it frees the users to keep track of the information unique to them, and not recreate information available in the warehouse. In this way the warehouse and local systems compliment each other. Making data more accessible also serves several other goals. First, it will improve data quality over time. As people use the data, errors can be corrected as they are found. Additionally, using the information for more purposes will help us improve the design of our systems in the future, so that we will have the information that we need on hand.

Definition & Functions

The Data Warehouse is a database with information coming from many areas, which is structured in a way to make it easy to use the information.

Many Sources; integrates data from various administrative systems and stores them in one location.
Read Only; is a read only database. Information represented in the warehouse is maintained in other systems, called "systems of record".
Restructured Information; presents data in a simple form so that reports are easy to construct and the warehouse is easy to use.

The main purpose of the warehouse is a reporting and data distribution environment for Departments, Labs and Centers at MIT. In addition to this, the warehouse will support some of the reporting needs of Central departments. Also, the warehouse acts as a hub, to facilitate the exchange of information between systems. Taking all of these functions together, the warehouse serves as the enterprise information infrastructure.

The warehouse is not the place to solve problems like "Did this invoice get paid?" This type of question should still be directed to the transactional systems such as SAP. A warehouse system is not meant to take the place of the transactional system, but to complement it. It will take some training for users to understand when they should be using the warehouse and when it would be better to query the transactional system.

The goal is to put all administrative information in the Warehouse. The two basic requirements we have for determining which information goes into the warehouse are:

That the information is of use to more than one group of people.
That the information has a source system of record.

Since the community has diverse needs, the warehouse is being designed to support three separate access mechanisms:

1. End user query and reporting tools: MIT has obtained a site license for BrioQuery, one of a large number of commercial products in this area. Ad Hoc reports are very easily created. Standard reports can be defined and shared easily through the Web or an email attachment.

2. Creation of file extracts: Data files can be created and transferred using FTP'd from the warehouse to local systems. In addition database links, snapshots, and other database to database transfers can be done.

3. Custom programs: Programs can be written to access the warehouse using SQL. These can range from simple Perl scripts which just extract data, to full applications written in Powerbuilder, C, Java, etc.

User Engagement

We have a diverse set of users at MIT. Some people need to be able to just push a button to get the standard report they need. Others need to analyze information in a much more dynamic way; asking a series of questions (ad hoc queries) to investigate something. We tried to give users an easy way to get started, but put no limits on what they could do.

We don't expect all users to learn how to build reports themselves. What we hope to do is get users comfortable with running prebuilt queries. Then when the need arises, they can learn more and modify the reports to suit themselves. As users use and understand information they will start creating reports themselves, hopefully sharing this work with others.

A warehouse should ultimately change the way people use information. It takes time for this change to happen, and people change at different rates, so a warehouse needs to evolve over time. At the outset people want to be able to look at the information in as close to the same way as they’ve been seeing it. Allowing them to do this makes them become familiar with the warehouse, and have confidence that it is providing the same information they used to get. Later they can begin at their own pace looking at information in new ways, such as in new layouts, at different levels of aggregation, or in exception reports.

Pitfalls

Not restructuring data is one of the most common mistakes made in warehouse implementations. If you just move the data to the warehouse without changing the structure all you really achieve is off loading some reporting load from the transactional system. Some people make the argument that if you don’t keep it the same how will you know that you’ve got an accurate representation of the source system? There are other ways to ensure integrity, and the benefit of doing this restructuring work initially will save enormous amounts of report development time and execution time, as well as allow mere mortals to construct queries.

Managing expectations is a pitfall in every IT development project, but a warehouse project can be worse in that it ends up as all things to all people. The variety of subjects, users and uses makes expectations hard to manage.

Scale and Operating tasks -- You need to plan for success. This cliché has real meaning in a warehouse project. Many people advice to get something out fast to get buy in. However, once you’ve demonstrated the value, users will quickly want everything in the warehouse. If you have not worked out scaleable processes for handling all the new data and users you won’t be able to meet those expectations. Unfortunately this means taking more time up front to design processes that will scale. Another method which we ended up using, was to put something up initially to test our ideas, and then go off and figure out how to do it right. This leaves a gap in the delivery of new information, which needs to be explained to people before you start.

Design Points

The design of the Warehouse has many aspects. As in most designs the considerations need to be balanced. For example in the case of simplicity vs. flexibility and functionality, every time we provide alternatives we make the system a bit more complicated. A warehouse can get complicated quickly, so at all times you must strive for simplicity or things can get quickly out of hand. In each area of design we tried to keep things as simple as possible.

Easy to Use

The success of the warehouse depends on our ability to present the data in as simple a form as possible and to make interactions with the data warehouse as simple and straightforward as possible. To generate common reports, the end users will have access to data that are in an easy-to-understand and easy-to-use structure. Unlike a traditional transactional system, which minimizes the storage locations of data to make updates more efficient (normalizing), the data warehouse duplicates data where appropriate, so that reports can be generated more quickly and more easily. Although this strategy uses more disk space, it makes reporting access much easier and faster.

Star Schemas

In particular, we’ve organized the information in star schemas. Star schemas consist of a central fact table joined with several dimension tables. The fact table is extremely large and has all the transactional data and numbers, the dimension tables are smaller and have descriptive information. An example of a fact table is our financial_detail table. Examples of dimension tables are time_month, gl_account, cost_collector. The dimensions allow a user to sort, group and limit the fact records easily. Since the dimension tables are smaller, the users can display all possible values of a field when deciding how to limit their request. Users use many of the same dimension tables with different fact tables, so although it initially can seem overwhelming most users can become familiar with one set of tables quickly. This type schema is both flexible and efficient.

Star schemas have other benefits. Since all the amounts and details are in the fact table, this is the only one that usually needs row level access control. The dimensions can be left open and viewable to anyone, since they have only descriptive information like cost collector names and numbers.

Another feature made possible be star schemas is a flexible way of viewing historical information. For example, when viewing old financial information do you want to see the numbers with the account information the way it was when it happened, or recast in the way your organized now. Since the descriptive information is in the dimension table it is possible to create snapshot versions of dimension records periodically. Then by joining the dimension table to the fact table using different key fields, you can get either view from the same fact table.

Quality of Information

The implementation of a subject in the warehouse goes through a progression of stages. These stages can be years long. The quality and usefulness of data improve through each successive stage of implementation.

Getting accurate detailed data within a subject area.
Integrating accurately detailed information among subject areas, e.g., combining Personnel and Payroll information.
Creating useful summary, aggregate, and history information.

The quality of the information in the warehouse must be high for users to perform their reporting needs. There may be problems with data quality, because some of the data being delivered has not been accessible to the community before, and it has not been reviewed and corrected previously. Getting the data published and having a well-documented procedure for making corrections to data should go a long way towards making the warehouse information accurate.

Another problem with data is that we are attempting to combine information from several different sources. These source systems rarely had a need in the past to make sure their information was prepared to combine with data from another system. Therefore, problems such as having unique identifiers on records, or having several similar but different data elements will cause some problems. The solution is rethinking and altering of some of the source systems along with the warehouse, which will take some time to do correctly.

Because the warehouse is read only and not updated during the day, consistent reports can be generated from a stable data set. Users generating reports can be assured that they are obtaining information from stable data.

Integrity Checks

Knowing the information in the warehouse is a truly accurate reflection of the source system is extremely important, is people are going to rely on the warehouse for reporting. The warehouse implements several methods to try to assure this.

Users need accurate clear definitions of all the data presented in the warehouse if they are to make use of it. Beyond the definitions, they also need easy access to information concerning, for example, when the data was last loaded, where it came from, how to report errors, or how to get changes made to a particular field and record.

Open

Direct SQL access for users gives the Data Warehouse the openness (ability to use a variety of tools) and flexibility (putting information together in new ways. Many warehouses are designed with a front-end (such as the web), and no direct SQL access. Users will ultimately be limited in how they use the warehouse. Viewing information is fine, but many users actually need the data to manipulate further on their own or combine with local information. Using SQL doesn't preclude us from presenting the information via the web or some other front-end application in the future.

Metadata

Metadata or data about data is a critical component in a Warehouse environment. Metadata has many uses. Users need to know what data is available, how it’s organized and what it means. Data transformation and loading tools or programs need to know where the data comes from and how it needs to change. There is an overlap here, both users and conversion programs or tools need to know the structure the data is going into. Since we were building our own tools, we also designed our own structure for maintaining metadata. In this way we do all the maintenance in one place and avoid problems of different representations of metadata which are out of synch. It also allows us to design programs for the conversion and loading of the warehouse, which are generic and get the specifics from the metadata. This allows for a lot of code reuse and minimal code changes as we make changes to conversions and loads. Most of the time we don’t actually write new code to implement a new subject, it is only a matter of creating the metadata correctly in order to automate the loads with our existing generic programs. If an error occurs it most likely caused by either a data or metadata problem. The "metadata" makes it easy to change the way data is being loaded without having to recode. The same software gets reused in many different places making it unlikely that a problem would remain undetected in this area.

Access Control

Institute data must be handled with the proper security and access control. Information policies need to be worked out in advance and simplified as much as possible. Our warehouse design maintains security at the database level. All transmissions of data across the network is encrypted. Additionally, for users to view only the information they are allowed to see, such as their department's information, the warehouse will present most data through "views." With this scheme, users will see what looks like a table, i.e., "employees." However, in actuality, this is a view that shows each user a different set of data depending on the access control that has been granted.

This can be done by creating access control tables, which are joined with the base data tables in the view. The access control table would have the user’s username as one field and some identifier in the other. The identifier would also be present in the base data table. The view would only select rows from the base table where the identifier matched the identifier in the access control table and the username matched the person connected. This way a single view definition could yield all Department B employees for Sue, while giving Tom only Department C employees.

Not only is the maintenance reduced, but this approach allows the users to share reporting templates and queries easily, since the objects they’re using are the same. Because the access control table structure, and content is so simple these tables can be created easily or imported from another source. Using a view, which joins two tables, will impact performance, but this can work well on fact tables with millions of rows, and access tables with hundreds of thousands of entries, as long as the information is structured correctly.

Roles

Maintaining consistent access control mechanisms is a hard problem for most organizations. The variety of different systems sometimes containing similar information make implementing rational consistent access control hard. When each system has separate mechanisms for establishing access it takes a great deal of work to keep them in synch. In order to simplify this area at MIT, we designed a central authorization system called Roles. This system allows for the maintenance of authorizations in a single system. Then this system can drive the authorizations in both the system of record and the warehouse. In our case the financial reporting authorizations are maintained in Roles and fed to both SAP and the Warehouse always keeping them in synch. Furthermore, the interface for maintaining them can be distributed out to the departments, so that the people who know what the authorizations should be are maintaining them.

An enterprise access control infrastructure has the advantage of:

Achieving consistent access control over multiple systems
Distributing the maintenance of access controls to a person close enough to know and care whether the authorizations are correct.
Having a simple way to maintain and view authorizations across many systems.
Granting authorizations at the highest possible level, but enforcing them at a detailed level, which minimizes the changes to authorizations due to account or organizational restructuring.

State of the Warehouse

The warehouse is now enjoying some success. The usage has been steadily climbing this past year. More and more people are incorporating the warehouse in their normal work.

The central offices who are the information providers, are finding many benefits as well. They don’t have to write and maintain as many feeds now, since people can get information directly from the warehouse. Also, having other ways to analyze their own data helps uncover problems sooner. Some of their common reports can be run against the warehouse in far less time, because of the more efficient reporting structures.

We’re feeding many systems currently using a variety of methods, such as database links, file extracts, and Perl programs. Most of the maintenance of this is done by the person getting the feed and so as they need changes they can just do it themselves, since their program define the format, sort etc.

Next Steps

Most major subjects exist in some form with the exception of payroll. However, most subject areas cannot be easily joined to others. As noted earlier this was anticipated, and we will now turn our attention to this aspect of data administration. We also will be building better historical information, and creating some higher level summaries of existing information.