wiki · home

Fault Tolerance

Definition

Fault Tolerance is the ability a system has to continue functioning even in the presence of failures in some of its components.

Types of failures

From Tanenbaum and van Steen:

Type of failure	Description
Crash failure	A server halts, but is working correctly until it halts
Omission failure	A server fails to respond to incoming requests
Receive omission	A server fails to receive incoming messages
Send omission	A server fails to send messages
Timing failure	A server’s response lies outside a specified time interval
Response failure	A server’s response is incorrect
Value failure	The value of the response is wrong
State transition failure	The server deviates from the correct flow of control
Arbitrary failure	A server may produce arbitrary responses at any time

Failure masking

The best a system can do is to try to hide its failures so that others can’t see it.

This could be done by having redundant copies of the faulty component so once one fails the other (or others) can replace it and the system will continue to work. With this approach you also need to take care of how the copies stay in sync, where in sync might not always mean “have the exact same data at all times”. These are decisions for you to make.

Another way is by acknowledging that a component has failed and the system is prepared to function without it. Take Twitter for example: their “Who to Follow” feature might be down for some reason but it shouldn’t forbid you from accessing your timeline.

Crash-stop vs Crash-recovery

Crash-stop allows that a certain number of processes stops to execute steps forever at some point during the execution. A generalization of this model is called crash-recovery where the same behavior of faulty processes in the crash-stop model is also possible in the crash-recovery model. There is additional behavior regarding the classes of processes:

Always up: processes that never crash
Eventually up: processes that crash at least once but eventually come up and do not crash anymore.
Eventually down: processes that crash at least once but do not recover anymore.
Unstable: processes that crash and recover infinitely often.

Failure detection is not easy. You could think of a mechanism where processes send heartbeats to others to see if they are alive or not. This can work but won’t guarantee that a process stopped execution. If there is a network partition the heartbeats won’t come or reach processes at the other side of the partition but you can’t assume they are down.

References

Andrew S. Tanenbaum, and Maarten van Steen. Distributed Systems: Principles and Paradigms, Second Edition. Pearson Prentice Hall, 2007.