Enterprise Systems Management: What Do You Watch When There Is No Mainframe?

Enterprise Systems Management:

What Do You Watch When There Is No Mainframe?

by

Daniel J. Oberst

Charles A. Augustine

Princeton University

Princeton
New Jersey

Despite the acknowledged need, few universities deploy comprehensive enterprise systems management, largely because of software and staff costs. Working through vendor partnerships, Princeton University has been able to successfully build an enterprise systems management infrastructure to augment or replace point and homegrown solutions, providing monitoring, job scheduling, and output management.

Enterprise Systems Management: What Do You Watch When There Is No Mainframe?

By Daniel J. Oberst & Charles Augustine

This paper focuses on Princeton’s experience in implementing the Tivoli monitoring products. We are currently monitoring all of our production Unix and Windows NT/2000 servers, which is a total of about 130 systems. We are also monitoring status of eleven production applications, with plans to monitor all of our applications, focusing first on those with the largest user base

Early administrative and enterprise applications grew up in mainframe environments, where integrated management and monitoring tools provided status information as well as performance tracking and tuning. Answers to questions such as “How come I can’t get to the Human Resources System?” or “Why is the Student System slow?” could be ascertained by looking at the mainframe, its processes, or the network between the user and the central computer. But in today’s distributed systems, these kinds of applications depend on multiple NT and UNIX file servers, back-end database servers, front- and back-end Web servers, distributed output devices, and sometimes application servers or transaction-processing monitors, along with all the network path dependencies among these elements. Answering the same type of questions today is a much more difficult task because of the complexity of the systems and the variety of management tools available.

Each platform in a system (NT, UNIX, Windows, network hubs and routers, etc.) might have a different tool for monitoring its operation and performance. Above the operating system, each of the services and applications might itself have a separate reporting and monitoring tool. Even though the groups responsible for the operation of each of these systems, services, and applications might be well-versed in the use of their particular tool, the information gathered may not be understood or even accessible outside of the group. For managers, tracking down the source of problems involves repeated calls and queries to each of the systems in an attempt to pinpoint the problem, with no easy way to consolidate and aggregate the information from the underlying layers.

Enterprise Systems Management (ESM) vendors attempt to solve this dilemma with a coordinated set of monitoring tools that work all the way up the protocol stack, from network connectivity and operating systems to complex enterprise resource planning (ERP) suites, and that provide a common repository for operation and performance monitoring at all levels. At the highest layer, ESM tools attempt to model and monitor business systems and practices so that the overall health of a system or application (e.g., Human Resources, Student Records, Finance) can be determined. When problems occur, drill-down tools permit recursive querying of the supporting components to identify underlying causes.

Tivoli is a brand under which IBM markets a suite of over 60 products oriented to some aspect of systems management. The core of the Tivoli suite is a set of products that share a common infrastructure and provide basic system management services such as monitoring, software inventory and distribution, and remote control. An additional set of products is much more loosely integrated into the Tivoli infrastructure. Examples are the Tivoli Storage Manager (TSM), formerly ADSM, and the Tivoli Workload Scheduler (TWS).

As part of a broad implementation of new administrative systems, Princeton University began to deploy Tivoli Systems’ ESM framework and tools in 1998. Few colleges or universities had experience with these tools, largely because of the high cost, which large businesses could justify in terms of revenue growth potential and income liability, and because of the overall complexity of these systems. Most institutions have monitoring in place for their earlier mainframe solutions and have developed or acquired point-solution tools for managing networks and UNIX systems. Until the growth of new, complex systems, most have not felt the need for broader ESM tools. With Tivoli, Princeton began a partnership that provided an affordable path to determine the viability of these products in the campus environment. In addition to the systems-monitoring tools, Princeton also acquired network monitoring (NetView) as well as helpdesk (Tivoli Service Desk), job scheduling (Maestro), and output management tools (Destiny).

Our initial efforts focused on designing and implementing the underlying framework for Tivoli and establishing independent implementation of the other products. The last three were especially time-critical, since they were replacing systems being phased out, and efforts at integration took a backseat. For the initial deployment, three staff positions were loaned to the rollout effort. A year later a reorganization created a three-person ESM group to run systems management, job scheduling, and output management. In many of the large-industry Tivoli implementations, each of these efforts would have three to five people assigned to it, so our implementation took longer than anticipated. In addition, the rollout was slowed by staff turnover as we moved from pilot to production, by the pressure to deploy Maestro for production control, and by the need to choose an alternative vendor, Dazel, to remediate Y2K production printing (after an unsuccessful attempt at implementing Destiny).

Nevertheless, we are now monitoring 150 hosts and eleven applications with Tivloli’s distributed management tools and framework and are using the Tivoli Event Console (TEC) and underlying database to produce regular reports and drill-down Web pages for event tracking. Princeton has installed Workload Scheduler, Storage Manager and a set of the monitoring products. TWS is a key part of Princeton’s administrative production infrastructure. It schedules all of our administrative production jobs on Unix and Windows NT servers. We are currently scheduling 3500 jobs per week on 28 Unix and NT hosts.

Production faxing and administrative output needs are being managed on 70 queues for 20 printers through Dazel,. Another major Tivoli application at Princeton is TSM. It is used to backup more than 7000 systems, including all of our Unix and NT servers, and all of our faculty and staff Windows desktop system. We are managing hundreds of terabytes of offline data.

In understanding the issues involved in implementing Tivoli monitoring, it is helpful to understand the architecture of the Tivoli monitoring product. A monitoring engine is installed on each host that executes the monitoring probes and performs limited evaluation of monitoring results. A central server hosts a database that stores monitoring configuration for groups of client systems and a user interface from which an administrator controls the configuration and distribution of monitoring probes. The power of this architecture is in how easily it scales to large numbers of systems. It is not much more difficult to managed 100 or even 1000 systems than it is to manage 10.

The events generated on the individual systems are sent back to a central event-processing engine. Tivoli also provides adapters to take other event streams and convert them for use by Tivoli, for example messages in a system or application log file, or events from a network management application like NetView. It is also easy to add your own locally developed monitors.

The process of implementing Tivoli monitoring is not very different from any other system development task. First, you need to identify your users and their requirements. Then you plan what you will monitor to meet those requirements. If the requirements are fairly high-level, such as “Tell me whether the payroll application is operating normally,” it may take a lot of work to turn that into a specification for a set of particular monitors and their error thresholds.

Having decided what to monitor, you need to figure out how you are going to do it by selecting a set of monitors from the ones supplied by Tivoli and creating custom monitors for needs that are not covered by Tivoli’s collection.

If you are creating events and sending them to the central event processing engine, you will need to develop rules for the processing engine, and a central notification process. The ideal would be to have the notification process encapsulate the organizational and technical knowledge of a trained operator in routing an event to the correct person.

After two years of following this process, we have learned some lessons about what to avoid. The first lesson is not to underestimate the time and staffing needs. This is not a project that can be done with part time effort. We started with borrowed staff. We needed to create a dedicated staff.

Another lesson is not to try to replace existing tools. If you have monitoring tools that are working well, begin by trying to integrate them into your enterprise solution. And look for those areas with the fewest existing tools. In our organization, the NT system administrators have been more receptive to Tivoli, partly because they have fewer monitoring tools than the Unix staff.

In general, you will probably find that the application owners are the ones with the least information and the fewest monitoring tools. This leads to the final lesson. Start with application availability and work down. You will find that you will need to deploy a set of lower level monitors, but your choice and architecture will be guided by the need to explain application failures.

Why did we underestimate the effort involved? There are several reasons. First, we didn’t understand the extent to which Tivoli is a toolkit, not a turnkey application. It provides a monitoring infrastructure including the agent on the individual systems and a mechanism for managing centrally what monitors are run, how often, what error thresholds are defined, and what actions are taken when an error is detected. A basic set of operating system-level monitors is provided, along with a handful of application monitoring packages and tools for creating custom monitors. As we have switched to an application focus, we have found that we usually have to write several custom monitors to watch for the error conditions we care about.

The most useful action that a monitor can take when an error is detected is to send an event to the central event processing engine. This is a rules engine that use a version of the Prolog programming language to express its rules. Only the most basic rules are provided, and you will certainly need to develop additional rules to get any kind of sophisticated event processing. Since the rules language is different than a traditional procedural language, you need to plan for a significant time to develop expertise in using this capability, or you will need to hire consultants to do the rules development for you. Some combination of these two may be the most effective, if you can arrange for knowledge transfer from the consultants to your staff.

An interesting alternative implementation process is the Tivoli Rapid Deployment Program, which has been used at Duke University. This is a value-added service offered by IBM that provides a pre-configured Tivoli system that will get you up and running more quickly if your configuration is simple enough to fit into their model.

Another reason that we underestimate the effort involved in implementing Tivoli is that we did not realize how hard it would be to change existing monitoring practices and tools. Our experience was that we needed to look for users with relatively few tools, and that we needed to seek ways to integrate existing tools into the Tivoli infrastructure rather than replacing them.

Finally, we found that after the initial rollout, continuing operation competes with development for staff time. This is, of course, a common experience in developing a software system.

During our implementation, we made a major shift in our development strategy. Initially, we had been following a “bottom-up” strategy. We planned to begin by with monitoring the network and operating system levels, and moving up through services layers to application status. This approach did not produce value to the organization quickly, for two principle reasons. First, it depends on replacing existing tools at the lower levels, which turns out to be quite difficult, and second, it assumes that it is possible to synthesize application status from lower level status information. Unfortunately, it is not in general possible to know, a priori, what all the possible causes of application failure, let along monitor for them all. So even if this strategy could be implemented, it would not provide a comprehensive monitoring of application status.

Our current strategy is “top-down”, starting with the application. We decide how can we probe the application from the users perspective to determine if it is operating correctly. Then we ask what information an expert troubleshooter would need to diagnose a failure, and we seek to provide that information at the lower levels. Where we know of common failure modes, we monitor for them specifically. The users of this level of information are the application owners and maintainers. Since we have begun this strategy, we have successfully monitored our Time Collection, Blackboard, and DataMall applications.

In addition to extending top-level application monitoring to additional applications, we have a number of other projects underway. We have installed the Tivoli Manager for Oracle, and we will be extending it to all of our production databases. We will also be feeding events to our help desk system, which will provide a richer notification capability, and we want to make use of the two-way capability of our current paging hardware. Finally, as we send more events to the help desk and to application owners, we need to find better ways to filter out transient conditions and notify only on legitimate errors.

Overall, our Tivoli implementation has moved more slowly than we anticipated, for various reasons, but after we have adjusted our approach and gained more mastery of the product, we are now beginning to provide value to application owners in the ways for which Tivoli was purchased, and we expect to see significant additional monitoring capabilities provided for application owners, system administrators and help desk staff over the next 6-12 month.

ESM software products are complex and require a large staff investment and lead-time. Small organizations may have difficulty justifying the overhead and expense. And even though upper management may be convinced of the need for these products, selling the systems to line staff can be difficult. Many system administrators have their own set of “point solution” monitoring tools, which they are reluctant to abandon. Successful implementation of an ESM thus may be held up in an underlying catch-22: system administrators see little value added, and yet the value comes only when all the individual systems can be monitored and an overall aggregated view presented. A hybrid approach—wrapping the point solutions to incorporate them into the ESM framework—can leverage existing monitoring and get buy-in once the benefits of aggregation can be demonstrated.

Systems like Tivoli provide rich functionality and robustness at the cost of complexity: a distributed object structure allows Tivoli agents to function independently and at a layer of abstraction that allows for complex aggregation and analysis. The TEC, which monitors log activity, provides a rich correlation engine to provide higher-level analysis. And although graphical user interface tools and rule-builders simplify much of the interaction with these components, working with the underlying object structure, creating programmatic command-line interactions with the system, and digging into the Prolog code of the TEC’s rules engine are almost essential to a successful implementation.

Expect to dedicate staff to your ESM efforts: the systems are complex and training is essential. Outside consultants can speed up architecture and installation issues, since few of these systems operate “out of the box,” and an experienced engineer can help steer you around many potential pitfalls.

A large percentage of ESM efforts fail, in large part because of the complexity of these systems. Establishing targeted six-month goals and concentrating efforts on particular phases of the roll-out or systems to monitor can increase chances of success. These phases can include overall systems architecture, framework deployment, UNIX distributed monitoring, job scheduling, output management, NT distributed monitoring, database monitoring, Web and e-mail monitoring, and finally, administrative systems monitoring.

Is it worth it? Several years into the ESM deployment at Princeton, we can now detect anomalies and problems that our local monitors don’t track, and application owners are coming to us for help in monitoring their systems. We’re starting to tackle our databases and higher-level systems that none of the in-house systems can track, and integrate the information in a way that point-solutions can not. The up-front effort to implement the system was high relative to the benefits of existing monitoring, and much momentum was lost in time needed to implement the necessary but peripheral job scheduling and output management systems. But with two major distributed applications being rolled out (Human Resources and Student Systems) in addition to the current Finance, Accounts Receivables, and Alumni Systems, project managers are now beginning to ask for the type of monitoring and management that can be obtained only within an ESM framework.

Lastly, organizations considering ESM should be sure that they understand what goals they want to achieve. Getting buy-in and early cooperation across the organization will help steer the many decision processes along the way and increase chances for successful implementation.