Disaster Tolerance and Recovery in a Serviceguard Cluster
Evaluating the Need for Disaster Tolerance
Chapter 114
Evaluating the Need for Disaster Tolerance
Disaster tolerance is the ability to restore applications and data within
a reasonable period of time after a disaster. Most people think of fire,
flood, and earthquake as disasters, but a disaster can be any event that
unexpectedly interrupts service or corrupts data in an entire data center:
the backhoe that digs too deep and severs a network connection, or an act
of sabotage.
Disaster tolerant architectures protect against unplanned down time due
to disasters by geographically distributing the nodes in a cluster so that
a disaster at one site does not disable the entire cluster. To evaluate your
need for a disaster tolerant solution, you need to weigh:
• Risk of disaster. Areas prone to tornadoes, floods, or earthquakes
may require a disaster recovery solution. Some industries need to
consider risks other than natural disasters or accidents, such as
terrorist activity or sabotage.
The type of disaster to which your business is prone, whether it is
due to geographical location or the nature of the business, will
determine the type of disaster recovery you choose. For example, if
you live in a region prone to big earthquakes, you are not likely to
put your alternate or backup nodes in the same city as your primary
nodes, because that sort of disaster affects a large area.
The frequency of the disaster also plays an important role in
determining whether to invest in a rapid disaster recovery solution.
For example, you would be more likely to protect from hurricanes
that occur seasonally, rather than protecting from a dormant
volcano.
• Vulnerability of the business. How long can your business afford to be
down? Some parts of a business may be able to endure a 1 or 2 day
recovery time, while others need to recover in a matter of minutes.
Some parts of a business only need local protection from single
outages, such a node failure. Other parts of a business may need both
local protection and protection in case of site failure.
It is important to consider the role applications play in your
business. For example, you may target the assembly line production
servers as most in need of quick recovery. But if the most likely
disaster in your area is an earthquake, it would render the assembly