Fault Tolerance and High Availability (Linktionary term)

Site home page
(news and notices)

Get alerts when Linktionary is updated

Book updates and addendums

Get info about the Encyclopedia of Networking and Telecommunicatons, 3rd edition (2001)

Download the electronic version of the Encyclopedia of Networking, 2nd edition (1996). It's free!

Contribute to this site

Electronic licensing info

Fault Tolerance and High Availability

Note: Many topics at this site are reduced versions of the text in "The Encyclopedia of Networking and Telecommunications." Search results will not be as extensive as a search of the book's CD-ROM.

Fault tolerance and high availability is about keeping systems up and running 24 hours a day, 7 days a week, or at least keeping systems up and running with a reasonable amount of performance. Downed systems can cost an organization thousands of dollars per hour, as outlined in the following table:

The Cost of Internet Commerce Downtime (Source: Forrester Research)
Web Site	Daily Internet Commerce Revenue as of 1/15/99 (U.S. $)	Lost Revenue per Hour of Downtime as of 1/15/99 (U.S. $)*
www.techdata.com	$1,000,000	$18,280
www.amazon.com	$2,700,000	$22,500
www.dell.com	$10,000,000	$91,320
www.cisco.com	$20,000,000	$182,640
www.intel.com	$33,000,000	$274,980

*Lost revenue assumes a U.S. $1-million-per-day site where 20 percent of transactions are lost during downtime.

A fault-tolerant system is designed to keep running even after a fault has occurred. Fault-tolerant features in early network operating systems included mirrored disks, with both disks reading and writing the same information. If one disk failed, the other kept running in what is called "failover" mode. This fault tolerance was expanded to disk duplexing, in which the disks and disk controllers were duplicated. These redundant components not only provided fault tolerance, but also improved performance since disk reads could come from either disk (writes still had to be performed by both disks). Of course, fault-tolerant systems must provide more than just disk failover. Some other examples of redundant systems include the following:

RAID disk systems combine multiple hard drives into fault-protected arrays.

Redundant components (power supplies, I/O boards, and so on).

Multiple servers are clustered to minimize problems if any of the servers should fail.

Alternate pathing and load balancing improve throughput and provide redundant links.

Multiple data centers to protect against local disasters.

This topic continues in "The Encyclopedia of Networking and Telecommunications" with a discussion of the following:

High availability (resiliency) and ways of measuring it (mean time to failure, and mean time to recover)
Classes of availability, including two nines, three nines, four nines, five nines, six nines
Ways to achieve fault tolerance and high availability, including:

Disk-level protection

Trasaction-monitoring systems

Redundant components

uniterruptible power

disk mirroring and duplexing

RAIDs (redundant arrays of inexpensive disks)

Mirrored servers

Clustering

Load balancing

Redundant communication links

Distributed computing

Duplicate data centers

outsourcing and colocation