Self-stabilization

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Self-stabilization is a concept of fault-tolerance in distributed computing. Distributed computing systems are challenging to debug and analyze. As a result, strong properties (properties that hold under a variety of circumstances) of such systems are especially important to simplify systems analysis and to prove system correctness. Self-stabilization is considered a highly desirable property. A distributed system that is self-stabilizing will end up in a correct state no matter what state it is initialized with, and no matter what execution steps it will take. This property guarantees that the system will end in a correct state after a finite number of execution steps. This is in contrast to typical fault-tolerance algorithms that guarantee that under all state transitions, the system will never deviate from a correct state. E.W. Dijkstra in 1974 presented the first self-stabilizing algorithm, prompting further research in this area.[1]

The ability to recover without external intervention is very desirable in modern computer and telecommunications networks, since it would enable them to repair errors and return to normal operations on their own. Computers and networks can thus be made fault-tolerant. Hence, many years after the seminal paper of Dijkstra, this concept is gaining in importance as it presents an important foundation for self-managing computer systems and fault-tolerant systems. As a result, Dijkstra's paper received the 2002 ACM PODC Influential-Paper Award - one of the highest achievements in the distributed computing community.[2]

Contents

[edit] Overview

A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if starting from this state the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly.

Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring" — a network of computers ordered in a circle, such that exactly one of them is supposed to "hold a token" at any given time. Not holding a token is a correct state for each computer in this network, since the token can be held by another computer. However, if every computer is in the state of "not holding a token" then the network as a whole is not in a correct state. Similarly, if more than one computer "has a token" then this is not a correct state for the network, although it cannot be observed to be incorrect by viewing any computer individually. Since every computer can "see" only the states of two other computers, it is hard for the computers to decide whether the network as a whole is in a correct state.

The time complexity of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles. A round is a shortest execution trace in which each processor executes at least one step. Similarly, a cycle is a shortest execution trace in which each processor executes at least one complete iteration of its repeatedly executed list of commands. It is also interesting to measure the output stabilization time. For that, a subset of the state variables is defined to be externally visible (the output). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) rounds) until the output stabilized.[3]

The first self-stabilizing algorithms did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state, even without explicitly detecting error states. Since traditional methods for detecting an error (e.g.[4]) were often very difficult and time-consuming, such a behaviour was considered desirable.

New methods for light-weight error detection for self-stabilizing systems were suggested in,[5][3] under the names of local detection[5] and local checking.[3] The term local refers to a part of a computer network. When local detection is used, a computer in a network is not required to communicate with the entire network in order to detect an error — the error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mechanism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods turned out to be also much more efficient.

Additional efficiency was introduced with the notion of time-adaptive protocols.[6] The idea behind these is that when only a small number of errors occurs, the recovery time should (and can) be made short. The original algorithms of Dijkstra do not have this property.

A useful property of self-stabilizing algorithms is that they can be composed by layering if they do not exhibit any circular dependencies. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer.

[edit] Definition

A system is self-stabilizing if and only if:

  1. Starting from any state, it is guaranteed that the system will eventually reach a correct state (convergence).
  2. Given that the system is in a correct state, it is guaranteed to stay in a correct state, provided that no fault happens (closure).

A system is said to be randomized self-stabilizing if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant k.[7]

A self-stabilizing algorithm is silent if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed.[8]

[edit] Related work

An extension of the concept of self-stabilization is that of superstabilization.[9] The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a passage predicate that is always satisfied, while the system's topology is reconfigured.

[edit] References

  1. ^ E.W. Dijkstra: Self-stabilizing systems in spite of distributed control. Commun. ACM 17 (1974), 11: 643-644.
  2. ^ Edsger W. Dijkstra Prize in Distributed Computing
  3. ^ a b c Baruch Awerbuch, Boaz Patt-Shamir, George Varghese. Self-Stabilization By Local Checking and Correction (Extended Abstract) FOCS 1991: 268-277.
  4. ^ Shmuel Katz, Kenneth J. Perry. Self-Stabilizing Extensions for Message-Passing Systems. Distributed Computing 7(1): 17-26 (1993).
  5. ^ a b Yehuda Afek, Shay Kutten, Moti Yung. The Local Detection Paradigm and Its Application to Self-Stabilization. Theor. Comput. Sci. 186(1-2): 199-229 (1997).
  6. ^ Shay Kutten, Boaz Patt-Shamir: Stabilizing Time-Adaptive Protocols. Theor. Comput. Sci. 220(1): 93-111 (1999).
  7. ^ Self-Stabilization. Shlomi Dolev, MIT Press, 2000.
  8. ^ Shlomi Dolev, Mohamed G. Gouda, and Marco Schneider. Memory requirements for silent stabilization. In PODC '96: Proceedings of the fifteenth annual ACM Symposium on Principles of Distributed Computing, pages 27--34, New York, NY, USA, 1996. ACM Press. Online extended abstract.
  9. ^ Shlomi Dolev and Ted Herman. Superstabilizing protocols for dynamic distributed systems. Chicago Journal of Theoretical Computer Science, 4, December 1997. Special Issue on Self-Stabilization.
Personal tools
Languages