Fault Tolerance
Definition
Fault Tolerance is the ability a system has to continue functioning even in the presence of failures in some of its components.
Types of failures
From Tanenbaum and van Steen:
Type of failure | Description |
---|---|
Crash failure | A server halts, but is working correctly until it halts |
Omission failure | A server fails to respond to incoming requests |
Receive omission | A server fails to receive incoming messages |
Send omission | A server fails to send messages |
Timing failure | A server’s response lies outside a specified time interval |
Response failure | A server’s response is incorrect |
Value failure | The value of the response is wrong |
State transition failure | The server deviates from the correct flow of control |
Arbitrary failure | A server may produce arbitrary responses at any time |
Failure masking
The best a system can do is to try to hide its failures so that others can’t see it.
This could be done by having redundant copies of the faulty component so once one fails the other (or others) can replace it and the system will continue to work. With this approach you also need to take care of how the copies stay in sync, where in sync might not always mean “have the exact same data at all times”. These are decisions for you to make.
Another way is by acknowledging that a component has failed and the system is prepared to function without it. Take Twitter for example: their “Who to Follow” feature might be down for some reason but it shouldn’t forbid you from accessing your timeline.
Crash-stop vs Crash-recovery
Crash-stop allows that a certain number of processes stops to execute steps forever at some point during the execution. A generalization of this model is called crash-recovery where the same behavior of faulty processes in the crash-stop model is also possible in the crash-recovery model. There is additional behavior regarding the classes of processes:
- Always up: processes that never crash
- Eventually up: processes that crash at least once but eventually come up and do not crash anymore.
- Eventually down: processes that crash at least once but do not recover anymore.
- Unstable: processes that crash and recover infinitely often.
Failure detection is not easy. You could think of a mechanism where processes send heartbeats to others to see if they are alive or not. This can work but won’t guarantee that a process stopped execution. If there is a network partition the heartbeats won’t come or reach processes at the other side of the partition but you can’t assume they are down.
References
- Andrew S. Tanenbaum, and Maarten van Steen. Distributed Systems: Principles and Paradigms, Second Edition. Pearson Prentice Hall, 2007.