The Home of TCL Services

Consulting Services specializing in Strategic Management of Larger Networks of un*x Systems.

As a systems administration manager, it is important to keep track of the view that computers are purchased to perform tasks that should increase the net worth of the company, not just provide systems for us to manage!

Strategic systems management is a philosophy. It describes the approach taken to manage the daily system administration tasks required to keep the systems doing the tasks for which they were purchased. The systems must be installed and managed adhering to a corporate wide set of standards for this philosophy to succeed. The corporate wide standards should primarily address consistency in networked systems management. These standards are intended to enhance the usefulness and effectiveness of the systems and the management of them, not to exclude vendors, or systems, or to restrict new thoughts or processes.

Generally, goals are provided by upper management along with the management task. They almost always include the phrase "do more, better and faster, with fewer people"!

Often, it is possible to improve system administration to meet some of these goals. It is possible to measure the performance of the system administration tasks. Improvement can be measured as the reduction of the:

The view of strategic system management will be first from deployment, and secondarily from operations. There is a tendency to concentrate in operations because the performance measurements can be made easily there. However, a greater change in administrative performance can often be made by re-engineering initial deployment than operations. The more forethought given to system configuration at install time, the easier the management of the systems will be.

Deployment and Installation

Consistency is the key to increasing the system up time. Administrators and operators will need to learn a only few configurations if the systems are divided into a small number of "classes". Each member of a class is then installed and configured exactly the same as the other members of its class. This will increase the number of systems the administrators and operators are "familiar" with and reduce the research necessary each time a problem occurs. It also allows for a problem to be detected and diagnosed once on the first system to demonstrate the symptom. The resolution to the problem can be applied the the entire class of systems, because they are configured the same. It is not necessary to wait for that problem to eventually surface on each of the other systems in the class.

The tools available on standard un*x allow software packages to be constructed. These packages can include files and optionally scripts to execute during the installation of the package. The configuration of a class of systems can be controlled by the installation of a set of software packages and the execution of their associated configuration scripts. (This also makes possible the beginning elements of a configuration management system for verifying the continued adherence of the configuration initially installed.)

Packages should be developed which address the entire system. The packages should include:

  • the vendor's OS and the OS configuration
  • your corporate standard tools for all systems
  • your corporate standard documentation
  • network configuration
  • application installation
  • application configuration
  • application documentation
  • class specific tools, configuration and documentation

    Separate packaging tools or techniques may be desired or required for various types of packages. The system vendor's OS installation tools will probably be required for the OS and possibly the subsystems that are purchased from that vendor.

    Package tools range from tar or cpio archives to rdist or ninstall to UI's pkgtool and OSF's software distributor. The extreme could be considered Tivoli's Net Courier or HP's Openview Software Distributor components of their network distributed system management environments. Since the lowest level of a package can be a tar archive, sometimes known as a "tar ball", it should not be considered as the end result. The higher level packaging tools have the capability to install files on the target system and execute some form of configuration commands to complete the installation. The flexibility and ease of developing these configuration or installation time scripts should be the guide for selecting the packaging environment.

    The package tools may also be combined in order to obtain a more consistent and manageable environment. For example one package tool may be utilized to schedule or execute the commands from another.

    The packaging environment must be operational over the network as well as locally from a media. It needs to be able to install or reinstall packages from a central site to remote sites.

    Operations

    Managing larger networks of heterogeneous un*x systems boils down to just diagnosing problems and implementing changes. Operations is the hot seat of problem management.

    The basic philosophy is to automate all of the system changes possible. And then to process the remainder of the system administrative actions via change control. The more the logging or collection, correlation and recognition of these changes is automated, the easier it is to diagnose the root cause of problems. The easier the diagnosis, the quicker the problems are resolved and the sooner the network environment is productive again. Once the root cause of a problem is understood, there is a better chance of reducing the probability having that problem again, . It should reduce the down time due to problem diagnosis. And possibly reduce the severity or impact of the problem on production.

    First, problem diagnosis and then change management.

    Normal problem diagnostic flow is similar to:

    	It broke.
    	What broke?
    	I tried "this" and it didn't do it.
    	When was the last time it did?
    	Yesterday.
    	What has changed since yesterday?
    	Nothing.
    	later
    	You didn't tell me you changed ...
    	Oh! Well. Err, ahh, ... I didn't think that it ...

    Step One

    There are three possible answers to what changed since it worked last.
    1. the system had a hardware failure
    2. the operating system or application hit a previously undetected defect, ie. it didn't work before either!
    3. a change was made to the system or the network environment

    This sounds easy in print, you say!
    However; it can be simplified in real life as well.

    First, we are dealing with sophisticated systems, capable of testing themselves and telling us if hardware fails. Most systems can even let us know that minor hardware problems are occurring on subsystems and should be examined before actual failures occur. Automate them! Don't be afraid to ask if it is still plugged in. Have one system check another and hold up its hand if it needs help.

    Next, new operating system or application errors are not as easy to resolve, but, thank goodness, they rarely occur. Many problems are blamed on them, but root cause analysis will generally show that failure or change was the real root problem cause.

    This leaves divining "The Change" as the goal for the day.

    The Change

    Changes "since it worked last" fall into two categories, local changes to the system and remote changes to "the network environment".

    An effective change log will reveal local changes. This change log should be simple and concise, with minimal entries from automatic processes (which were debugged and well documented) and complete entries from manual changes. The change log management process must be unrestrictive and simple enough that operators are willing to use and maintain it. Most interaction with the change log should be automated.

    Remote changes to the network are more troublesome to diagnose. They fall into two categories also, changes to remote systems and changes to the network environment. Changes to remote systems should proceed by beginning at "Step One" for that system.

    Changes to the network environment are either changes to the equipment or changes in the environment. For the equipment, start again at "Step One" for that component.

    Many subtle changes are possible for the network environment, that's what makes this job pay so well! The reason that the changes are subtle is that few operations log the changes in a manner that would enable or even suggest a correlation to problem diagnosis. To illustrate changes in this category, here are some examples.

    These interactions or capacities may have been previously unknown or at least not previously correlated.

    Automation of change detection and logging is the only hope to correlate the above types of change.

    Once the problem is successfully diagnosed, the next step is to implement some change to resolve the problem.

    Implementing Change

    The method used to implement changes monumentally affects the time, effort, repeatability and manage-ability of the environment.

    The systems must be installed and managed adhering to a corporate wide set of standards for this philosophy to succeed. The corporate wide standards should primarily address consistency in networked systems management. These standards are intended to enhance the usefulness and effectiveness of the systems and the management of them, not to exclude vendors, or systems, or to restrict new thoughts or processes.

    Stay tuned. More to come as I have time to play. (hopefully it is obvious this is under construction).

    I must have NOT had much time to play, this is 8 years ago!

    The change log needs to be accessible even if the system is "down hard".


    You can reach me by e-mail at: larryl at tcls.com

    This page was current as of Wed Dec 27 15:09:21 PST 1995
    but I took the liberty of updating the e-mail address and a few typos on Tue Apr 15 23:09:59 PDT 2003