|
Introduction Over the past several years Business Communications Systems or
Telecom Systems have evolved into highly computerized operational environments.
This includes the use of microcomputers in offices as well as LAN
infrastructures and servers that provide much of the operational support for
telecommunication systems. In addition a WAN ties these various systems together
and provides communications to other computer networks, computer diagnostic
facilities of the various selected telecom vendors involved, including the
operation of local and long distance telephone services and cable TV.
The reliability of computers and computer-based
systems has increased dramatically in the past few years. Computer and telecom
failures that do occur can normally be diagnosed automatically and repaired
promptly using both local and remote diagnostic facilities. Many computer
systems contain redundant parts, which improve their reliability and provide
continual operation when some failures occur.
In the past, most computer operations were
predominantly batch-oriented. Disaster plans were comprised primarily of
reciprocal agreements made between users of similar systems for job processing
(usually at night and/or week-ends). This has become less feasible with the very
complicated on-line and diverse network systems most institutions now have
installed. Although institutions may have similar equipment and operating
systems, they generally do not have the capacity to add a large number of users
from another on-line environment to their systems even if the technical problems
could be solved.
A trend is evolving to provide alternate sites
near the local systems where any additional equipment needed can be shipped in
rapidly, and critical on-line operations for the organization can be resumed in
a reasonable time. Redundancy in the communications network and a tie-in to the
alternate site, or the ability to rapidly tie-in, is an important part of the
disaster plan. This type of site is called a cold backup site, as opposed to a
hot backup site which contains all equipment necessary to start immediate
operations.
For the most part, the major problems that can
cause a telecom system to be inoperable for a length of time result from
environmental problems related to the communications systems. The various
situations or incidents that can disable, partially or completely, or impair
support of your business facilities should identified. A working plan for how to
deal with each situation should be provided.
Contact KTS NETWORK SOLUTIONS for a
customized Disaster Recovery Plan for your Telecom System that includes
alternate routing and equipment or technology kits.
back to top
SAMPLE DISASTER RECOVERY PLAN
Objectives/Constraints
A major objective of the recovery plan document
should be to define procedures for a contingency plan for recovery from
disruption of computer, telecom systems and/or network services. This disruption
may come from total destruction of the central site or from minor disruptive
incidents. There is a great deal of similarity in the procedures to deal with
the different types of incidents affecting different organizations technology
areas. However, special attention and emphasis should be given to an orderly
recovery and resumption of those operations that concern the critical business
of running the organization.
back to top
Assumptions
The plan should contain some general assumptions,
but may not include all special situations that can occur. Any special decisions
for situations not covered in the plan needed at the time of an incident can be
made by senior technology staff or other members on site that have placed
'in-charge.'
The plan is typically
invoked upon the occurrence of an incident. The senior
staff member on site at the time of the incident or the first one on site
following an incident will contact the appropriate levels of senior management
and/or officers.
The senior technology staff member on site at the
time of the incident will assume immediate responsibility. The first
responsibility will be to see that people are evacuated as needed. If injuries
have occurred as a result of the incident, immediate attention will be given to
those persons injured. If the situation allows, attention will be focused
on shutting down systems, turning off power, etc.,
but evacuation should be the highest priority.
Once an incident which is covered by the plan has
been declared, the plan, duties, and responsibilities will remain in effect
until the incident is resolved and appropriate authorities are notified.
Invoking this plan implies that a recovery
operation has begun and will continue with top priority until workable computer
and/or telephone support has been re-established.
back to top
Incidents Requiring
Action
This disaster recovery plan for the organization
should be automatically or statutorily invoked under name circumstances, for
example:
An incident which has disabled or will disable,
partially or completely, the central communications facilities, and/or the
communications network for a period of 24 hours.
back to top
Contingencies
General situations that can destroy or interrupt
computer and telephone services usually occur under the following major
categories:
-
Power/Air Conditioning Interruption
-
Telecommunications
-
Fire
-
Water
-
Weather and Natural Phenomenon
-
Sabotage and Interdiction
There are different levels of severity of these contingencies
necessitating different strategies and different types and levels of recovery.
This plan covers strategies for:
-
Partial recovery - operating at an alternate site
on [the property] and/or other client areas on [the property].
-
Full recovery - operating at the current central
site and client areas, possibly with a degraded level of service for a period of
time.
back to top
Physical Safeguards
Telecommunications
Equipment Room
This room typically houses the telephone switch,
voice mail system, cable television equipment, and data communications
equipment. It is the hub for each of these organization-wide data, voice, and
video networks. There is no protection against water damage.
The telephone equipment is connected to a
_____________________ UPS system. This will maintain the telephone switch for
_______ hours. Other equipment in this room is connected to individual or
clustered UPS. This equipment room is protected by a fire protection system
using FM 200.
back to top
Types of
Communication Service Disruptions
This document includes hardware and software
information, emergency information, and personnel information that will assist
in faster recovery from most types and levels of disruptive incidents that may
involve other computing facilities. Additional information that may be needed is
provided in the appendices of this document. Supporting documents should contain
additional hardware, software and vendor information.
Normal system problems
[Identify]
-
System and Component Type
-
Vendor Name, Contacts & Phone Numbers
-
Spares List; Location of Spares
-
Describes response and recovery goals.
Major computer and
communications system problems
[Identify]
-
System and Component Type
-
Vendor Name, Contacts & Phone Numbers
-
Spares List; Location of Spares
-
Describes response and recovery goals.
Environmental
problems (air conditioning, electrical, fire)
Air Conditioning Outage
[Identify]
-
System and Component Type
-
Vendor Name, Contacts & Phone Numbers
-
Spares List; Location of Spares
-
Describes response and recovery goals.
Electrical
[Identify]
-
System and Component Type
-
Vendor Name, Contacts & Phone Numbers
-
Spares List; Location of Spares
-
Describes response and recovery goals.
Plan Example: In the event of an electrical
outage all servers and other critical equipment is protected from damage by
Uninterruptible Power Supplies (UPSs). These units will maintain electrical
service to our servers long enough for them to be shut down gracefully. Once
electrical power is restored the servers will remain “powered down” until the
UPSs are recharged a sufficient a sufficient amount to ensure the servers could
be gracefully shut down in the event of a second power failure.
Fire
[Identify]
-
System and Component Type
-
Vendor Name, Contacts & Phone Numbers
-
Spares List; Location of Spares
-
Describes response and recovery goals.
Plan Example: Room ### (the Server Room) is
equipped with a halon fire protection system, which will adequately protect the
equipment from fires starting in the machine room itself. If a fire starts, the
halon system should limit damage to the affected piece of equipment and
possibility minor damage to equipment in the immediate vicinity. In the event of
a catastrophic fire involving the entire building, we would most likely have to
replace all our hardware.
List Other Applicable
back to top
Insurance
Considerations
[Identify]
-
Equipment Covered; Make, Model and Serial
Numbers.
-
Underwriter Name; Broker-Agent; Claims contacts
and phone numbers.
Plan Example: All computers which are covered
under a maintenance contract also have a Recover-All Insurance Policy.
The remainder of the equipment (including personal computers) are not covered
under an insurance policy. All major hardware is covered under our standard
property and casualty insurance policy.
back to top
Recovery Team
In case of a disaster, the team will use the emergency call
list. General duties of the disaster recovery coordinator are discussed.
Recovery team leaders have been assigned in each major area and general duties
given. Assignment of personnel in the major areas to specific tasks during the
recovery stage will be made by the team leader over that area.
Organization of the
Disaster/Recovery Team
[Identify]
-
Members name, title/rank, home phone, cell phone
-
Assignments for each member
back to top
Disaster/Recovery
Team Headquarters
-
[Identify]
-
HQ shall be Main Building, 1st floor, Room #
-
If Main Building, 1st floor, Room # is usable,
the recovery team will meet in [next location].
-
If [next location, Room # is usable, the recovery
team will meet in [next location].
-
If none of the above locations are usable, it is
presumed that the disaster is of such proportions that recovery of computer
support will take a lesser priority. The Disaster Recovery coordinator will make
appropriate arrangements.
back to top
Disaster Recovery
Coordinator
Plan Sample: The Executive Director of
Information Technology will serve as Disaster Recovery Coordinator. The major
responsibilities include:
Determining the extent and seriousness of the
disaster, notifying the CIO and Executive Vice President immediately and keeping
them informed of the activities and recovery progress. The Executive Vice
President will in turn keep the President, the other Vice Presidents and
Managers informed.
Invoking the Disaster Recovery Plan after
approval of the Executive Vice President.
Supervising the recovery activities.
Coordinating with the Executive Vice President on
priorities for clients while going from partial to full recovery.
Naming replacements, when needed, to fill in for
any disabled or absent disaster recovery members. Any members who are out of
town and are needed will be notified to return.
The Director, Technology Support will keep
clients informed of the recovery activities.
Administrative Systems/Operations
Recovery Team Leader Responsibilities
The Senior Systems Analyst will serve as
Administrative Systems/Operations Recovery Team Leader.
Responsibilities include:
Coordinating hardware and software replacement
with the administrative hardware and software vendors.
Supervising retrieval of backup media and
materials from the off-site storage location and using these for recovery when
needed.
Coordinating recovery with client departments.
Coordinating appropriate computer and
communications recovery with the Network Communications Recovery Team Leader.
Coordinating recovery of administrative software
with client departments.
Coordinating schedules for administrative
programming, production services, and computer job processing.
Keeping the Disaster Recovery Coordinator
informed of the extent of damage and recovery procedures being implemented.
Network Communications Recovery
Team Leader Responsibilities
The Senior Telecom Analyst will serve as the
Network Communications Recovery Leader.
Responsibilities include:
Coordinating hardware and software replacement
with the communications hardware and software vendors.
Supervising recovery of the computer
communications, telephone system and/or cable TV.
Assigning personnel duties from telecom analysts
to project leaders of disaster recovery tasks as needed.
Coordinating activities of computer and
communications recovery with the other Recovery Team Leaders.
Keeping the Disaster Recovery Coordinator
informed of the extent of damage and recovery procedures being implemented.
Preparing for a Disaster
This section contains the minimum steps necessary to prepare
for a possible disaster and as preparation for implementing the recovery
procedures. An important part of these procedures is ensuring that the
off-site storage facility contains adequate and timely computer backup tapes
and documentation for applications systems, operating systems, support
packages, and operating procedures.
General Procedures
Responsibilities have been given for ensuring
each of following actions have been taken and that any updating needed is
continued.
Maintaining and updating the disaster recovery
plan.
Ensuring that all Organization technology area
personnel are aware of their responsibilities in case of a disaster.
Ensuring that periodic scheduled rotation of
backup media is being followed for the off-site storage facilities.
Maintaining and periodically updating disaster
recovery materials, specifically documentation and systems information, stored
in the off-site areas.
Maintaining a current status of equipment in the
main equipment rooms in ____________________.
Informing all technology personnel of the
appropriate emergency and evacuation procedures from [enter location name].
Ensuring that all security warning systems and
emergency lighting systems are functioning properly and are periodically checked
by operations personnel.
Ensuring that fire protection systems are
functioning properly and that they are checked periodically.
Ensuring that UPS systems are functioning
properly and that they are being checked periodically.
Ensuring that the client community is aware of
appropriate disaster recovery procedures and any potential problems and
consequences that could affect their operations.
Ensuring that the operations procedure manual is
kept current.
Ensuring that proper temperatures are maintained
in equipment areas.
Software Safeguards
Plan Sample: Server software and data are secured
by full backups each week and differential backups each weekday evening. The
full backups are transported each Monday morning to the lower level of the
Library. The first backup of each month is retained for one year. Nightly
differential backups are retained in Systems & Operations until the next full
backup. A copy of the full backups is also stored in a safe deposit box at [bank
branch]. Backups are stored on 4mm DAT tapes and other compact media.
Long Distance software and data are secured by
full backups that run at 3:00 AM Tuesday through Saturday. The Saturday full
backups are transported each Monday morning to the lower level of the
________________. A copy of the full backups is also stored in a safe deposit
box at [bank branch]. The first full backup of each month is retained for one
year. Backups are stored on 4mm DAT tapes and other compact media.
A special backup is done immediately before each
monthly billing cycle. These backups are overwritten before the next monthly
billing cycle. Call records from six months previous are archived to a 4mm DAT
tapes at the end of every billing cycle. Disposal dates for the save sets are
not currently implemented. Call records are routed through a solid state
recorder. This captures call records while the long distance computer is
unavailable. The recorder will capture approximately thousands of calls, which
is around two and one-half to three days of calls during the busy time of the
month.
Telephone switch software and data are secured by
a full backup each night to diskette. The diskette left in the telephone switch
is overwritten each night. Each Monday morning, the diskette is removed and
transported to the vault by [name courier]. A copy of the full backups is also
stored in a safe deposit box at [Bank branch]. There are three diskettes in
rotation for the full backups.
VoiceMail software and data are secured by a full
backup to diskette. Each Monday morning, this diskette is transported to the
vault by [name courier]. A copy of the full backups is also stored in a safe
deposit box at [Bank branch].
back to top
Recovery Procedures
Central Facilities Recovery Plan
Plan Sample: An incident at the central computing/networking facilities in
[enter location name] may place this plan into action. An incident may be of the
magnitude that the facilities are not usable and alternate site plans are
required. In this case, the alternate site portions of this plan must be
implemented. It is obvious that all major support sections in [enter org name]
technology areas will need to function together in a disaster, although a
specific plan of action is written for each section.
Other systems being used in production include
various Intel-based file servers. There is currently a backup system in place
and equipment could be configured and shipped by the vendor in a short period of
time.
back to top
Systems & Operations
This portion of the disaster/recovery plan will
be set into motion for computing services when an incident has occurred that
requires use of the alternate site, or the damage is such that operations can be
restored, but only in a degraded mode at the central site in a reasonable time.
It is assumed a disaster has occurred and the
administrative recovery plan is to be put in effect. This decision will be made
by the Executive Vice President upon advice from the Executive Director of
Information Technology.
In case of either a move to an alternate site, or
a plan to continue operations at the main site, the following general steps must
be taken:
Determine the extent of the damage and if
additional equipment and supplies are needed.
Obtain approval for expenditure of funds to bring
in any needed equipment and supplies.
Notify local vendor marketing and/or service
representatives if there is a need of immediate delivery of components to bring
the computer systems to an operational level even in a degraded mode.
If it is judged advisable, check with third-party
vendors to see if a faster delivery schedule can be obtained.
Notify vendor hardware support personnel that a
priority should be placed on assistance to add and/or replace any additional
components.
Notify vendor systems support personnel that help
is needed immediately to begin procedures to restore systems software at [enter
org name].
Order any additional electrical cables needed
from suppliers.
Rush order any supplies, forms, or media that may
be needed.
In addition to the general steps listed at the
beginning of this section, the following additional major tasks must be followed
in use of the alternate site:
Notify officials that an alternate site will be
needed for an alternate facility.
Coordinate moving of equipment and support
personnel into the alternate site with appropriate personnel.
Bring the recovery materials from the off-site
storage to the alternate site.
As soon as the hardware is up to specifications
to run the operating system, load software and run necessary tests.
Determine the priorities of the client software
that need to be available and load these packages in order. These priorities
often are a factor of the time of the month and semester when the disaster
occurs.
Prepare backup materials and return these to the
off-site storage area.
Set up operations in the alternate site.
Coordinate client activities to ensure the most
critical jobs are being supported as needed.
As production begins, ensure that periodic backup
procedures are being followed and materials are being placed in off-site storage
periodically.
Work out plans to ensure all critical support
will be phased in.
Keep administration and clients informed of the
status, progress, and problems.
Coordinate the longer range plans with the
administration, the alternate site officials, and staff for time of continuing
support and ultimately restoring the Systems & Operations section.
Degraded Operations at Central
Site
In this event, it is assumed that an incident has
occurred but that degraded operations can be set up at [enter location name]. In
addition to the general steps that are followed in either case, special steps
need to be taken.
Evaluate the extent of the damage, and if only
degraded service can be obtained, determine how long it will be before full
service can be restored.
Replace hardware as needed to restore service to
at least a degraded service.
Perform system installation as needed to restore
service. If backup files are needed and are not available from the on-site
backup files, they will be transferred from the off-site storage.
Work with the various vendors, as needed, to
ensure support in restoring full service.
Keep the administration and clients informed of
the status, progress and problems.
Use of Alternate Sites
If the central site is destroyed, support of
critical academic computing activities will be given from the alternate sites.
Additional computer systems will be brought in as needed.
Some steps necessary in this process are listed.
Determine the priorities of client needs and
upgrade computers at the academic labs.
Set up for operations support.
Coordinate installing additional equipment and
moving support personnel.
When additional, needed equipment is available,
move backup materials from the off-site storage area.
Coordinate restoring any network communications
with Computer & Network Services.
Coordinate client computing support with clients.
As production begins, ensure that backup
procedures are followed and periodic backups are stored off site.
Work with the Director of the Center for Teaching
Excellence, the Provost, and clients in coordinating long-range plans for
restoring full support by the academic computing resources.
Network Communications
Redundancy is being built into the computer
communications systems. We do not have complete redundancy, but most systems
have backup equipment and/or cards.
This plan does not, at this time, address the
problem of a need for redundancy in the telephone switch system. Considerable
funds will be needed for an alternate plan in this area in case of a major
disaster in the [organization name] telephone switch. Providing adequate air
conditioning and fire protection are the highest priority.
Since most of the telephone and computer
communications lines are buried and in conduits across [the property],
connecting lines to alternate sites and to critical areas cannot be done
rapidly. For example, it is estimated that if [enter org name] technology areas
had to move, it would take 72 hours to restore critical data and voice
communications lines.
Some general steps that must be taken in case of
a network communications disaster at the central site and/or other parts of the
communications network are given.
Assessment of the damage and an evaluation of
steps needed to restore services.
Assignment of personnel to disaster crews and
assignment of tasks. The priority of repairs will be made by the Disaster
Coordinator after an evaluation of the critical needs of the [organization name]
following the disaster.
If present supplies and equipment on hand are not
adequate to restore service as needed, obtain approval for funds needed and
contact vendors for priority shipment.
Coordinate repairs of data communications
disasters affecting specific areas of technology support with the recovery team
leader of that area.
Keep the Disaster Recovery Coordinator and team
leaders of support areas informed of the extent of the communications damage and
recovery procedures being implemented.
A chart of the communications network at [enter
org name] is being developed. When it is completed, a copy of this chart will be
placed in the off-site storage area and periodically updated.
Microcomputer Recovery Plan
Individual clients should plan backups as
follows:
Daily - This procedure is used to backup all
files created or modified each day. This procedure copies all files to a USB
type drive or local tape for backup storage. It can be performed at the end of the
day or when a client is through using the computer for the day. These backup
diskettes or tapes need to be placed in a locked file cabinet. A methodology for
providing network-based client backups has been devised for the [the property].
Customers can contact Technology Support for more information.
Weekly - This procedure is used to backup all
files. This procedure will also copy all files to a floppy diskette or local
tape for backup storage. This procedure needs to be performed on any week day,
but should be done consistently once a week on the particular day chosen.
NOTE: It is recommended that each
microcomputer workstation retain one set of daily backups. It is also
recommended that two sets of weekly backups be kept.
Provide a protective environment for all disks.
Weekly backup disks should be placed in a
protective area away from the office. This area needs to be fireproof.
Note: the "Three Year and Out" computer replacement will
implement backups for every computer in faculty and staff offices. This will
be fully implemented by ________.
Computer Lab Recovery Plan
In case of an event affecting only a lab, this
section of the disaster plan will be executed. For recovery purposes, labs by
definition will mean a computer area supporting a number of clients as
contrasted to an area containing only a few microcomputers. An event can occur
in an area not defined as a lab; however, it is assumed recovery of services in
this situation can be carried out in a routine manner. An area may be considered
a lab even if it is in an administrative service area and there are a large
number of microcomputers involved.
A disaster will be declared in a lab when a large
portion of the units in the lab are affected to the extent that recovery in that
area in a reasonable time with normal procedures is not possible.
General steps that will be followed in recovery
of a lab are listed. The team leader of the computer area with support duties
over the lab affected will assume prime responsibility in the recovery process.
Determine the extent of the damage in the lab and
whether alternate lab services will be needed while recovery is taking place.
Obtain [organization name] approval for any funds
needed to replace equipment and supplies.
Determine whether adequate equipment is available
on [the property], either from the [storage name] or other areas, to restore
even partial services in the lab affected.
Coordinate recovery of the center with Computer &
Network Services if communications lines are involved in the lab.
If alternate services are to be provided for
clients of the lab, coordinate activities between groups affected.
Keep the Disaster Coordinator informed of the
status of the lab and the recovery process.
Emergency Procedures
In case an incident has happened or is imminent that will
drastically disrupt operations, the following steps should be taken to reduce
the probability of personal injuries and/or limit the extent of the damage, if
there is not a risk to employees. Similar steps should be followed, where
appropriate, in incidents occurring in a satellite center.
An announcement should be made to evacuate the
building, if appropriate, or move to a safe location in the building. As a
preparation for a potential disaster, all [enter org name] technology area
personnel should be aware of the exits available.
If there are injured personnel, ensure their
evacuations and call emergency assistance as needed.
If the computers and air conditioning have not
automatically powered down, initiate procedures to orderly shut down systems
when possible.
When possible and if time is available, set up
damage-limiting measures.
Designate available personnel to initiate lockup
procedures normal to last shift procedures.
Alternate Computing Services
Facility
Plan Example: In an
environment where the primary central site equipment is MICROSOFT WINDOWS 2003
SERVERS, the use of an alternate computing
facility is not as vital. The type of equipment being used can run in an office
environment and requires very little space. Our intention would be to survey the
needs at the time of disaster and place the VAX equipment where we would have
the best access to telephone lines to establish modem access. Access would be
limited to those critical applications at the time when needed.
back to top
Off-site Storage
All central file backups are made on magnetic
tapes or other compact media using an appropriate backup strategy and stored in
a room in the lower level of the Main Building on [the property] at [enter org
name]. Computer & Network Services employees have access to keys both to the
exterior doors and to the room where tapes are stored. A copy of the full
backups is also stored in a safe deposit box at [Bank of Choice] located at
[cross streets]; phone number(s).
Other reference
documents:
List as needed:
|