Daniel J.
Oberst
Charles A.
Augustine
Princeton
University
Princeton
New
Jersey
Despite the acknowledged need,
few universities deploy comprehensive enterprise systems management, largely
because of software and staff costs. Working through vendor partnerships,
Princeton University has been able to successfully build an enterprise systems
management infrastructure to augment or replace point and homegrown solutions,
providing monitoring, job scheduling, and output management.
Enterprise
Systems Management: What Do You Watch When There Is No
Mainframe?
By Daniel J. Oberst & Charles
Augustine
This paper focuses
on Princeton’s experience in implementing the Tivoli monitoring products. We are
currently monitoring all of our production Unix and Windows NT/2000 servers,
which is a total of about 130 systems. We are also monitoring status of eleven
production applications, with plans to monitor all of our applications, focusing
first on those with the largest user base
Early
administrative and enterprise applications grew up in mainframe environments,
where integrated management and monitoring tools provided status information as
well as performance tracking and tuning. Answers to questions such as “How come
I can’t get to the Human Resources System?” or “Why is the Student System slow?”
could be ascertained by looking at the mainframe, its processes, or the network
between the user and the central computer. But in today’s distributed systems,
these kinds of applications depend on multiple NT and UNIX file servers,
back-end database servers, front- and back-end Web servers, distributed output
devices, and sometimes application servers or transaction-processing monitors,
along with all the network path dependencies among these elements. Answering the
same type of questions today is a much more difficult task because of the
complexity of the systems and the variety of management tools
available.
Each platform in a
system (NT, UNIX, Windows, network hubs and routers, etc.) might have a
different tool for monitoring its operation and performance. Above the operating
system, each of the services and applications might itself have a separate
reporting and monitoring tool. Even though the groups responsible for the
operation of each of these systems, services, and applications might be
well-versed in the use of their particular tool, the information gathered may
not be understood or even accessible outside of the group. For managers,
tracking down the source of problems involves repeated calls and queries to each
of the systems in an attempt to pinpoint the problem, with no easy way to
consolidate and aggregate the information from the underlying
layers.
Enterprise Systems
Management (ESM) vendors attempt to solve this dilemma with a coordinated set of
monitoring tools that work all the way up the protocol stack, from network
connectivity and operating systems to complex enterprise resource planning (ERP)
suites, and that provide a common repository for operation and performance
monitoring at all levels. At the highest layer, ESM tools attempt to model and
monitor business systems and practices so that the overall health of a system or
application (e.g., Human Resources, Student Records, Finance) can be determined.
When problems occur, drill-down tools permit recursive querying of the
supporting components to identify underlying causes.
Tivoli is a brand
under which IBM markets a suite of over 60 products oriented to some aspect of
systems management. The core of the Tivoli suite is a set of products that share
a common infrastructure and provide basic system management services such as
monitoring, software inventory and distribution, and remote control. An
additional set of products is much more loosely integrated into the Tivoli
infrastructure. Examples are the Tivoli Storage Manager (TSM), formerly ADSM,
and the Tivoli Workload Scheduler (TWS).
As part of a broad
implementation of new administrative systems, Princeton University began to
deploy Tivoli Systems’ ESM framework and tools in 1998. Few colleges or
universities had experience with these tools, largely because of the high cost,
which large businesses could justify in terms of revenue growth potential and
income liability, and because of the overall complexity of these systems. Most
institutions have monitoring in place for their earlier mainframe solutions and
have developed or acquired point-solution tools for managing networks and UNIX
systems. Until the growth of new, complex systems, most have not felt the need
for broader ESM tools. With Tivoli, Princeton began a partnership that provided
an affordable path to determine the viability of these products in the campus
environment. In addition to the systems-monitoring tools, Princeton also
acquired network monitoring (NetView) as well as helpdesk (Tivoli Service Desk),
job scheduling (Maestro), and output management tools
(Destiny).
Our initial efforts
focused on designing and implementing the underlying framework for Tivoli and
establishing independent implementation of the other products. The last three
were especially time-critical, since they were replacing systems being phased
out, and efforts at integration took a backseat. For the initial deployment,
three staff positions were loaned to the rollout effort. A year later a
reorganization created a three-person ESM group to run systems management, job
scheduling, and output management. In many of the large-industry Tivoli
implementations, each of these efforts would have three to five people assigned
to it, so our implementation took longer than anticipated. In addition, the
rollout was slowed by staff turnover as we moved from pilot to production, by
the pressure to deploy Maestro for production control, and by the need to choose
an alternative vendor, Dazel, to remediate Y2K production printing (after an
unsuccessful attempt at implementing Destiny).
Nevertheless, we
are now monitoring 150 hosts and eleven applications with Tivloli’s distributed
management tools and framework and are using the Tivoli Event Console (TEC) and
underlying database to produce regular reports and drill-down Web pages for
event tracking. Princeton has installed Workload Scheduler, Storage Manager and
a set of the monitoring products. TWS is a key part of Princeton’s
administrative production infrastructure. It schedules all of our administrative
production jobs on Unix and Windows NT servers. We are currently scheduling 3500
jobs per week on 28 Unix and NT hosts.
Production faxing
and administrative output needs are being managed on 70 queues for 20 printers
through Dazel,. Another major Tivoli application at Princeton is TSM. It is used
to backup more than 7000 systems, including all of our Unix and NT servers, and
all of our faculty and staff Windows desktop system. We are managing hundreds of
terabytes of offline data.
In understanding
the issues involved in implementing Tivoli monitoring, it is helpful to
understand the architecture of the Tivoli monitoring product. A monitoring
engine is installed on each host that executes the monitoring probes and
performs limited evaluation of monitoring results. A central server hosts a
database that stores monitoring configuration for groups of client systems and a
user interface from which an administrator controls the configuration and
distribution of monitoring probes. The power of this architecture is in how
easily it scales to large numbers of systems. It is not much more difficult to
managed 100 or even 1000 systems than it is to manage 10.
The events
generated on the individual systems are sent back to a central event-processing
engine. Tivoli also provides adapters to take other event streams and convert
them for use by Tivoli, for example messages in a system or application log
file, or events from a network management application like NetView. It is also
easy to add your own locally developed monitors.
The process of
implementing Tivoli monitoring is not very different from any other system
development task. First, you need to identify your users and their requirements.
Then you plan what you will monitor to meet those requirements. If the
requirements are fairly high-level, such as “Tell me whether the payroll
application is operating normally,” it may take a lot of work to turn that into
a specification for a set of particular monitors and their error thresholds.
Having decided what
to monitor, you need to figure out how you are going to do it by selecting a set
of monitors from the ones supplied by Tivoli and creating custom monitors for
needs that are not covered by Tivoli’s collection.
If you are creating
events and sending them to the central event processing engine, you will need to
develop rules for the processing engine, and a central notification process. The
ideal would be to have the notification process encapsulate the organizational
and technical knowledge of a trained operator in routing an event to the correct
person.
After two years of
following this process, we have learned some lessons about what to avoid. The
first lesson is not to underestimate the time and staffing needs. This is not a
project that can be done with part time effort. We started with borrowed staff.
We needed to create a dedicated staff.
Another lesson is
not to try to replace existing tools. If you have monitoring tools that are
working well, begin by trying to integrate them into your enterprise solution.
And look for those areas with the fewest existing tools. In our organization,
the NT system administrators have been more receptive to Tivoli, partly because
they have fewer monitoring tools than the Unix staff.
In general, you
will probably find that the application owners are the ones with the least
information and the fewest monitoring tools. This leads to the final lesson.
Start with application availability and work down. You will find that you will
need to deploy a set of lower level monitors, but your choice and architecture
will be guided by the need to explain application
failures.
Why
did we underestimate the effort involved? There are several reasons. First, we
didn’t understand the extent to which Tivoli is a toolkit, not a turnkey
application. It provides a monitoring infrastructure including the agent on the
individual systems and a mechanism for managing centrally what monitors are run,
how often, what error thresholds are defined, and what actions are taken when an
error is detected. A basic set of operating system-level monitors is provided,
along with a handful of application monitoring packages and tools for creating
custom monitors. As we have switched to an application focus, we have found that
we usually have to write several custom monitors to watch for the error
conditions we care about.
The
most useful action that a monitor can take when an error is detected is to send
an event to the central event processing engine. This is a rules engine that use
a version of the Prolog programming language to express its rules. Only the most
basic rules are provided, and you will certainly need to develop additional
rules to get any kind of sophisticated event processing. Since the rules
language is different than a traditional procedural language, you need to plan
for a significant time to develop expertise in using this capability, or you
will need to hire consultants to do the rules development for you. Some
combination of these two may be the most effective, if you can arrange for
knowledge transfer from the consultants to your staff.
An
interesting alternative implementation process is the Tivoli Rapid Deployment
Program, which has been used at Duke University. This is a value-added service
offered by IBM that provides a pre-configured Tivoli system that will get you up
and running more quickly if your configuration is simple enough to fit into
their model.
Another
reason that we underestimate the effort involved in implementing Tivoli is that
we did not realize how hard it would be to change existing monitoring practices
and tools. Our experience was that we needed to look for users with relatively
few tools, and that we needed to seek ways to integrate existing tools into the
Tivoli infrastructure rather than replacing them.
Finally,
we found that after the initial rollout, continuing operation competes with
development for staff time. This is, of course, a common experience in
developing a software system.
During
our implementation, we made a major shift in our development strategy.
Initially, we had been following a “bottom-up” strategy. We planned to begin by
with monitoring the network and operating system levels, and moving up through
services layers to application status. This approach did not produce value to
the organization quickly, for two principle reasons. First, it depends on
replacing existing tools at the lower levels, which turns out to be quite
difficult, and second, it assumes that it is possible to synthesize application
status from lower level status information. Unfortunately, it is not in general
possible to know, a priori, what all the possible causes of application failure,
let along monitor for them all. So even if this strategy could be implemented,
it would not provide a comprehensive monitoring of application
status.
Our
current strategy is “top-down”, starting with the application. We decide how can
we probe the application from the users perspective to determine if it is
operating correctly. Then we ask what information an expert troubleshooter would
need to diagnose a failure, and we seek to provide that information at the lower
levels. Where we know of common failure modes, we monitor for them specifically.
The users of this level of information are the application owners and
maintainers. Since we have begun this strategy, we have successfully monitored
our Time Collection, Blackboard, and DataMall
applications.
In
addition to extending top-level application monitoring to additional
applications, we have a number of other projects underway. We have installed the
Tivoli Manager for Oracle, and we will be extending it to all of our production
databases. We will also be feeding events to our help desk system, which will
provide a richer notification capability, and we want to make use of the two-way
capability of our current paging hardware. Finally, as we send more events to
the help desk and to application owners, we need to find better ways to filter
out transient conditions and notify only on legitimate
errors.
Overall,
our Tivoli implementation has moved more slowly than we anticipated, for various
reasons, but after we have adjusted our approach and gained more mastery of the
product, we are now beginning to provide value to application owners in the ways
for which Tivoli was purchased, and we expect to see significant additional
monitoring capabilities provided for application owners, system administrators
and help desk staff over the next 6-12 month.
ESM software
products are complex and require a large staff investment and lead-time. Small
organizations may have difficulty justifying the overhead and expense. And even
though upper management may be convinced of the need for these products, selling
the systems to line staff can be difficult. Many system administrators have
their own set of “point solution” monitoring tools, which they are reluctant to
abandon. Successful implementation of an ESM thus may be held up in an
underlying catch-22: system administrators see little value added, and yet the
value comes only when all the individual systems can be monitored and an overall
aggregated view presented. A hybrid approach—wrapping the point solutions to
incorporate them into the ESM framework—can leverage existing monitoring and get
buy-in once the benefits of aggregation can be
demonstrated.
Systems like Tivoli
provide rich functionality and robustness at the cost of complexity: a
distributed object structure allows Tivoli agents to function independently and
at a layer of abstraction that allows for complex aggregation and analysis. The
TEC, which monitors log activity, provides a rich correlation engine to provide
higher-level analysis. And although graphical user interface tools and
rule-builders simplify much of the interaction with these components, working
with the underlying object structure, creating programmatic command-line
interactions with the system, and digging into the Prolog code of the TEC’s
rules engine are almost essential to a successful
implementation.
Expect to dedicate
staff to your ESM efforts: the systems are complex and training is essential.
Outside consultants can speed up architecture and installation issues, since few
of these systems operate “out of the box,” and an experienced engineer can help
steer you around many potential pitfalls.
A large percentage
of ESM efforts fail, in large part because of the complexity of these systems.
Establishing targeted six-month goals and concentrating efforts on particular
phases of the roll-out or systems to monitor can increase chances of success.
These phases can include overall systems architecture, framework deployment,
UNIX distributed monitoring, job scheduling, output management, NT distributed
monitoring, database monitoring, Web and e-mail monitoring, and finally,
administrative systems monitoring.
Is it worth it?
Several years into the ESM deployment at Princeton, we can now detect anomalies
and problems that our local monitors don’t track, and application owners are
coming to us for help in monitoring their systems. We’re starting to tackle our
databases and higher-level systems that none of the in-house systems can track,
and integrate the information in a way that point-solutions can not. The
up-front effort to implement the system was high relative to the benefits of
existing monitoring, and much momentum was lost in time needed to implement the
necessary but peripheral job scheduling and output management systems. But with
two major distributed applications being rolled out (Human Resources and Student Systems) in
addition to the current Finance, Accounts Receivables, and Alumni Systems,
project managers are now beginning to ask for the type of monitoring and
management that can be obtained only within an ESM framework.
Lastly,
organizations considering ESM should be sure that they understand what goals
they want to achieve. Getting buy-in and early cooperation across the
organization will help steer the many decision processes along the way and
increase chances for successful implementation.