High availability
From Wikipedia, the free encyclopedia
The introduction to this article provides insufficient context for those unfamiliar with the subject. Please help improve the article with a good introductory style. |
This article does not cite any references or sources. Please help improve this article by adding citations to reliable sources (ideally, using inline citations). Unsourced material may be challenged and removed. (January 2009) |
High availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period.
Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
Contents |
[edit] Planned and unplanned downtime
A distinction needs to be made between planned downtime and unplanned downtime. Typically, planned downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Planned downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, planned downtime is usually the result of some logical, management-initiated event. Unplanned downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unplanned downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application, middleware, and operating system failures.
Many computing sites exclude planned downtime from availability calculations, assuming, correctly or incorrectly, that planned downtime has little or no impact upon the computing user community. By excluding planned downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and they have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements.[citation needed]
[edit] Percentage calculation
Availability is usually expressed as a percentage of uptime in a given year. In a given year, the number of minutes of unplanned downtime is tallied for a system; the aggregate unplanned downtime is divided by the total number of minutes in a year (approximately 525,600), producing a percentage of downtime; the complement is the percentage of uptime, which is what is typically referred to as the availability of the system.
The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime in order to calculate service credits to match monthly billing cycles.
Availability % | Downtime per year | Downtime per month* | Downtime per week |
---|---|---|---|
90% | 36.5 days | 72 hours | 16.8 hours |
95% | 18.25 days | 36 hours | 8.4 hours |
98% | 7.30 days | 14.4 hours | 3.36 hours |
99% | 3.65 days | 7.20 hours | 1.68 hours |
99.5% | 1.83 days | 3.60 hours | 50.4 min |
99.8% | 17.52 hours | 86.23 min | 20.16 min |
99.9% ("three nines") | 8.76 hours | 43.2 min | 10.1 min |
99.95% | 4.38 hours | 21.56 min | 5.04 min |
99.99% ("four nines") | 52.6 min | 4.32 min | 1.01 min |
99.999% ("five nines") | 5.26 min | 25.9 s | 6.05 s |
99.9999% ("six nines") | 31.5 s | 2.59 s | 0.605 s |
* For monthly calculations, a 30-day month is used.
It should be noted that uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.
In general, the number of nines is not often used by engineers when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.[citation needed]
[edit] Measurement and interpretation
Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% "uptime." However, given the true definition of availability, the system will be approximately 99.897% available (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, while administrators might have a different (and probably incorrect, certainly in the business sense) perception. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.
Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand
[edit] Closely related concepts
Recovery time is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.
Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.
A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.
[edit] System design for high availability
Paradoxically, adding more components to an overall system design can actually undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. The most highly available systems hew to a simple design pattern: a single, high quality, multi-purpose physical system with comprehensive internal redundancy running all interdependent functions paired with a second, like system at a separate physical location. This classic design pattern is common among financial institutions, for example.[citation needed] The communications and computing industry has established the Service Availability Forum to foster the creation of high availability network infrastructure products, systems and services. The same basic design principle applies beyond computing in such diverse fields as nuclear power, aeronautics, and medical care.