DRPDR003: Disaster Preparation
[University of Arkansas][Computing Services]

Disaster Recovery Plan
Disaster Preparation
(DRPDR003)

Last update: Tuesday, 21-Mar-2000 10:32:07 CST

In order to facilitate recovery from a disaster which destroys all or part of the machine room in the Administrative Services Building, certain preparations have been made in advance. This document describes what has been done to lay the way for a quick and orderly restoration of the facilities that Computing Services operates.

The following topics are presented in this document:


Disaster Recovery Planning

The first and most obvious thing to do is to have a plan. The overall plan of which this document is a part is that which Computing Services will use in response to a disaster. The extent to which this plan can be effective, however, depends on disaster recovery plans by other departments and units within the University.

For instance, if the Administration Building were to be involved in the same disaster as the Administrative Services Building, the functions of the Business Manager's Office, or more in particular, the Purchasing Office, could be severely affected. Without access to the appropriate procedures, documents, vendor lists, and approval processes, the Computing Services recovery process could be hampered by delays while Purchasing recovers.

Every other business unit within the University should develop a plan on how they will conduct business, both in the event of a disaster in their own building or a disaster at Computing Services that removes their access to data for a period of time. Those business units need means to function while the computers and networks are down, plus they need a plan to synchronize the data that is restored on the central computers with the current state of affairs. For example, if the Payroll Office is able to produce a payroll while the central computers are down, that payroll data will have to be re-entered into the central computers when they return to service. Having a means of tracking all expenditures such as payroll while the central computers are down is extremely important.

Go back to the top of this document


Recovery Facility

If a central facility operated by Computing Services is destroyed in a disaster, repair or rebuilding of that facility may take an extended period of time. In the interim it will be necessary to restore computer and network services at an alternate site.

The University has a number of options for alternate sites, each having a varying degree of up-front costs.

Hot Site
This is probably the most expensive option for being prepared for a disaster, and is typically most appropriate for very large organizations. A separate computer facility, possibly even located in a different city, can be built, complete with computers and other facilities ready to cut in on a moment's notice in the event the primary facility goes offline. The two facilities must be joined by high speed communications lines so that users at the primary campus can continue to access the computers from their offices and classrooms.

Disaster Recovery Company
A number of companies provide disaster recovery services on a subscription basis. For an annual fee (usually quite steep) you have the right to a variety of computer and other recovery services on extremely short notice in the event of a disaster. These services may reside at a centralized hot site or sites that the company operates, but it is necessary for you to pack up your backup tapes and physically relocate personnel to restore operations at the company's site. Some companies have mobile services which move the equipment to your site in specially prepared vans. These vans usually contain all of the necessary computer and networking gear already installed, with motor generators for power, ready to go into service almost immediately after arrival at your site. (Note: Most disaster recovery companies that provide these types of subscription services contractually obligate themselves to their customers to not provide the services to any organization who has not subscribed, so looking to one of these companies for assistance after a disaster strikes will likely be a waste of time.)

Disaster Partnerships
Some organizations will team up with others in a partnership with reciprocal agreements to aid each other in the event of a disaster. These agreements can cover simple manpower sharing all the way up to full use of a computer facility. Often, however, since the assisting partner has to continue its day-to-day operations on its systems, the agreements are limited to providing access for a few key, critical applications that the disabled partner must run to stay afloat while its facilities are restored. The primary drawback to these kinds of partnerships is that it takes continual vigilence on behalf of both parties to communicate the inevitable changes that occur in computer and network systems so that the critical applications can make the necessary upfront changes to remain operational. Learning that you can't run a payroll, for instance, at your partner's site because they no longer use the same computer hardware or operating system that you need is a bitter pill that no one should swallow.

One of the most critical issues involved in the recovery process is the availability of qualified staff to oversee and carry out the tasks involved. This is often where disaster partnerships can have their greatest benefit. Through cooperative agreement, if one partner loses key personnel in the disaster, the other partner can provide skilled workers to carry out recovery and restoration tasks until the disabled partner can hire replacements for its staff. Of course, to be completely fair to all parties involved, the disabled partner should fully compensate the assisting partners for use of their workers unless there has been prior agreement not to do so.

Northwest Arkansas has some fairly large mainframe installations that would likely help if needed. (Potential organizations to contact are WalMart, Tyson Foods, JBHunt, and IBM.) Also, the University has a reciprocal agreement with the U of A for Medical Sciences in Little Rock to provide assistance to the other party in the event of a disaster. UAMS may be able to provide computing facilities for short-term, critical applications. The University can also seek assistance from the State Department of Computer Services in Little Rock.

The use of reciprocal disaster agreements of this nature may work well as a low-cost alternative to hiring a disaster recovery company or building a hot site. And they can be used in conjunction with other arrangements, such as the use of a cold recovery site described below. The primary drawback to these agreements is that they usually have no provision for providing computer and network access for anything other than predefined critical applications. So users will be without facilities for a period of time until systems can be returned to operation.

Cold Site
A cold recovery site is an area physically separate from the primary site where space has been identified for use as the temporary home for the computer and network systems while the primary site is being repaired. There are varying degrees of "coldness", ranging from an unfinished basement all the way to space where the necessary raised flooring, electrical hookups, and cooling capacity have already been installed, just waiting for the computers to arrive.

The University of Arkansas has chosen to use the cold site approach for this disaster recovery plan. The necessary agreements are in place for Computing Services to utilize space in the Bell Engineering Building (BELL 108 suite) as its Cold Site. It has adequate space to house the hardware, with some office space available for operating and technical personnel. It has good connectivity to the campus fiber optic network. And a certain amount of preparation has been made for electrical and cooling capacity to support mainframes and network equipment.

More detail on the preparations that have already been done, plus the actual work that needs to be done to renovate the space to be ready to receive the computer equipment is available in the Section DRPCS001: Recovery at the Cold Site.

Go back to the top of this document


Replacement Equipment

This plan contains a complete inventory of the components of each of the computer and network systems and their software that must be restored after a disaster. The inevitable changes that occur in the systems over time require that the plan be periodically updated to reflect the most current configuration. Where possible, agreements have been made with vendors to supply replacements on an emergency basis. To avoid problems and delays in the recovery, every attempt should be made to replicate the current system configuration. However, there will likely be cases where components are not available or the delivery timeframe is unacceptably long. The Recovery Management Team will have the expertise and resources to work through these problems as they are recognized. Although some changes may be required to the procedures documented in the plan, using different models of equipment or equipment from a different vendor may be suitable to expediting the recovery process.

Go back to the top of this document


Backups

New hardware can be purchased. New buildings can be built. New employees can be hired. But the data that was stored on the old equipment cannot be bought at any price. It must be restored from a copy that was not affected by the disaster. There are a number of options available to us to help ensure that such a copy of your data survives a disaster at the primary facility.

Remote Dual Copy
This option calls for a disk subsystem located at a site away from the primary computer facility and fiber optic cabling coupling the remote disk to the disk subsystem at the primary site. Data written to disk at the primary site are automatically transmitted to the remote site and written to disk there as well. This guarantees that you have the most up-to-the-second updates for the databases at the primary site in case it is destroyed. You can simplify the recovery process by locating the remote disk subsystem at the disaster recovery site. This option is somewhat expensive, but not prohibitively so. It does not require that an entire computer system be built at a hot site, just the disk subsystem. This option is typically limited to mainframe disk systems only.

Automated Off-Site Tape Backup
This option calls for a robotic tape subsystem located at a site away from the primary computer facility and fiber optic cabling (the campus backbone network would be suitable) coupling the subsystem to the primary computer facility. Copies of operating system data, application and user programs, and databases can be transmitted to the remote tape subsystem where it is stored on magnetic tape (optical writable disk media can also be used, but may be more expensive).

While this option does not guarantee the up-to-the-second updates available with the remote dual copy disk option, it does provide means for conveniently taking backups and storing them off-site any any time of the day or night. Another huge advantage is that backups can be made from mainframes, file servers, distributed (unix-based) systems, and personal computers. Although such a system is expensive, it is not prohibitively so.

Off-Site Tape Backup Storage
This option calls for the transportation of backup tapes made at the primary computer facility to an off-site location. Choice of the location is important. You want to ensure survivability of the backups in a disaster, but you also need quick availability of the backups.

This option has some drawbacks. First, there is a period of exposure from the time that a backup is made to the time it can be physically removed off-site. A disaster striking at the wrong time may result in the loss of all data changes that have occurred from the time of the last off-site backup. There is also the time, expense, and energy of having to transport the tapes. And there is also the risk that tapes can be physical damaged or lost while transporting them.

Some organizations contract with disaster recovery companies to store their backup tapes in hardened storage facilities. These can be in old salt mines or deep within a mountain cavern. While this certainly provides for more secure data storage, considerable expense is undertaken for regular transportation of the data to the storage facility. Quick access to the data can also be an issue if the storage facility is a long distance away from your recovery facility.

The University has opted to taking periodic backups of its primary mainframe systems, databases, file servers, and unix systems and storing those backups in two locations elsewhere on campus. The primary storage location is in Bell Engineering Room 108M, which is adjacent to the Cold Site recovery suite. The second location is in the Business Administration Building Room 107. The tape vaults at the Administrative Services Building are the final storage location where the oldest generation of system and application backup tapes are kept.

In general, backups for each subsystem are cycled through the three sites. Backups are initially taken to BELL 108M in the Computing Services morning delivery run. These are the first generation backups. Existing tapes at BELL are relocated to BADM 107. These are the second generation backups. Existing tapes at BADM are relocated back to the Administrative Services Building for storage in the tape vaults in the machine room. These are the third generation tapes. They are retained until the next set up backups are made, and then released to scratch status. Then the cycle starts all over again.

The actual backup and cycling procedures vary somewhat depending on the computer platform. Details of these procedures are contained in the following document:

DRPBK001: Backup Procedures

Go back to the top of this document


Disaster Lock Boxes

To ensure that an up-to-date copy of this plan is available when a disaster occurs, procedures have been established to store a copy of the plan with other important recovery information at the Cold Site backup tape storage area. Two Lock Boxes have been purchased to hold these materials. The contents of both lock boxes are identical. One resides at BELL 108M; the other resides in the tape vault just off the machine room in the Administrative Services Building.

When changes to the contents of the lock boxes are necessary, the box at the Administrative Services Building is first updated, then it is take over to BELL and swapped with the box stored there. That box is returned to ADSB and updated and replaced in the tape vault. This ensures that at least one copy of the plan is available at the recovery site.

The lock boxes are to remain locked at all times. Keys to the boxes are kept by several key people within the department, including

In a disaster situation when entry into a lock box is needed but the key is not available, you can physically break the lock with bolt cutters.

The contents of the lock boxes are described in the following document:

DRPDR016: Disaster Lock Boxes Contents

Go back to the top of this document


[Home Page] [Table of Contents] [Send Mail]
Copyright © 1997 University of Arkansas
All rights reserved