Byzantine fault tolerance
From Wikipedia, the free encyclopedia
Byzantine fault tolerance is a sub-field of error tolerance research inspired by the Byzantine Generals' Problem, which is a generalized version of the Two Generals' Problem.
The object of Byzantine fault tolerance is to be able to defend against a Byzantine failure, in which a component of some system not only behaves erroneously, but also fails to behave consistently when interacting with multiple other components. Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions assuming there are not too many Byzantine faulty components.
Contents |
[edit] Byzantine failures
A Byzantine fault is an arbitrary fault that occurs during the execution of an algorithm by a distributed system. It encompasses those faults that are commonly referred to as "crash failures" and "send and omission failures". When a Byzantine failure has occurred, the system may respond in any unpredictable way, unless it is designed to have Byzantine fault tolerance.
These arbitrary failures may be loosely categorized as follows:
- a failure to take another step in the algorithm, also known as a crash failure;
- a failure to correctly execute a step of the algorithm; and
- arbitrary execution of a step other than the one indicated by the algorithm.
For example, if the output of one function is the input of another, then small round-off errors in the first function can produce much larger errors in the second. If the second function were fed into a third, the problem could grow even larger, until the values produced are worthless. Another example is in compiling source code. One minor syntactical error early on in the code can produce large numbers of perceived errors later, as the compiler gets out-of-phase with the lexical and syntactic information in the source program.
Steps are taken by processes, the logical abstractions that represent the execution path of the algorithms. A faulty process is one that at some point exhibits one of the above failures. A process that is not faulty is correct.
The Byzantine failure assumption models real-world environments in which computers and networks may behave in unexpected ways due to hardware failures, network congestion and disconnection, as well as malicious attacks. Byzantine failure-tolerant algorithms must cope with such failures and still satisfy the specifications of the problems they are designed to solve. Such algorithms are commonly characterized by their resilience t, the number of faulty processes with which an algorithm can cope.
Many classic agreement problems, such as the Byzantine Generals' Problem, have no solution unless t < n / 3, where n is the number of processes in the system.
The Two Generals' Problem is a specific case which assumes that processes are reliable but communication between processes is not reliable.
[edit] Origin
Byzantine refers to the Byzantine Generals' Problem, an agreement problem in which generals of the Byzantine Empire's army must decide unanimously whether to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: trick some generals into attacking; force a decision that is not consistent with the generals' desires, e.g. forcing an attack when no general wished to attack; or confusing some generals to the point that they are unable to make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory.
Byzantine fault tolerance can be achieved if the loyal (non-faulty) generals have a unanimous agreement on their strategy. Note that if the source general is correct, all loyal generals must agree upon that value. Otherwise, the choice of strategy agreed upon is irrelevant.
[edit] Solutions
Several solutions were originally described by Lamport, Shostak, and Pease in 1982. They began by noting that the Generals' Problem can be reduced to solving a "Commander and Lieutenants" problem where Loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is Loyal. Roughly speaking, the Generals vote by treating each others' orders as votes.
- One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the 1 Commander + 2 Lieutenants problem cannot be solved if the Commander is traitorous. The reason is, if we have three commanders, A, B, and C, and A is the traitor: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it isn't necessarily A - the other commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n is greater than or equal to 3t + 1.
- A second solution requires unforgeable signatures (in modern computer systems, this may be achieved in practice using public-key cryptography), but maintains Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals.
- Also presented is a variation on the first two solutions allowing Byzantine-fault-tolerant behavior in some situations where not all generals can communicate directly with each other.
[edit] See also
[edit] References
- L. Lamport, R. Shostak, and M. Pease (July 1982). "The Byzantine Generals Problem". ACM Trans. Programming Languages and Systems 4 (3): 382–401. doi:. http://research.microsoft.com/users/lamport/pubs/byz.pdf.
- Castro, Miguel and Barbara Liskov (1999). "Practical Byzantine Fault Tolerance". Operating Systems Design and Implementation. http://www.pmg.lcs.mit.edu/~castro/osdi99_html/osdi99.html.
[edit] External links
- Ocean Store replicates data with a Byzantine fault tolerant commit protocol.
- Byzantine Quorum Systems Quorum systems for Byzantine-fault tolerant replication.
- Practical Byzantine Fault Tolerance
- Byzantine Fault Tolerance in the RKBExplorer