drink the sweet feeling of the colour zero

El Reg Article – Fault Tolerance

Tags:

This is the fifth contribution to an article on The Register that I was asked to make as a “reader expert.” We were asked to submit a response to the question “Server virtualisation has a number of benefits when it comes to fault tolerance but it also suffers from the ‘eggs-in-one-basket’ syndrome should a server go down. How can fault tolerance be built into the virtualised environment such that availability can be ensured?”

The original article can be found here.

Fault tolerance and high availability were once distinct concepts which have blurred into one single goal. Two decades ago high availability was a pipe dream for even the largest corporations. Basic fault tolerance was itself something that required massive planning. Resumption of service or data availability was limited to the speed at which a system could be rebuilt or a backup restored.

As the ability to back up systems and data became commonplace, the ability to recover from failure quickly became the pressing demand. RAID allowed (with varying degrees of success) disk failures without service interruption. ECC and chipkill were born to help us survive RAM glitches. Technologies to make networks robust proliferated enough to require their own article. Eventually some bright spark lit upon the idea of SANs, rendering entire servers nothing but disposable nodes and finally turning the entire network into one big computer.

Virtualisation, when handled properly can lead to some pretty amazing uptimes. Let us mentally design a very small and basic high availability system using VMs. Take two SANs, each kept in sync with the other and visible to the rest of the network as “a single SAN.” The SANs themselves are attached to network via redundant network interfaces to redundant switches. Three server nodes (A, B and C) are also attached to these switches with their own redundant interfaces. If SAN A has an accident, SAN B takes up the slack. If a switch goes down, or the boss trips over a network cable, the network deals with it.

Populate Servers A and B with virtual machines configured in a cluster, and leave Server C as an idle spare. Go kill Server A with an axe. The virtual machine on Server B senses the loss of its clustered partner, and continues on with its duties. Remap the virtual machine that used to be on Server A over to Server C and bring it online. The cluster members sense each other, sync up, and are fault tolerant once more.

This basic example can be scaled as appropriate, and is something that hardware vendors are eagerly waiting to sell you right now. You can add even more layers of redundancy and there are entire symposiums dedicated to exactly that. Add neat virtualisation tricks like the ability to “VMotion” a live VM from one node to another and even routine maintenance doesn’t have to affect availability.

Those happy thoughts aside, in the real world of High Availability cost is asymptotic to perfect uptime. The closer you attempt to get to it, the more resources you will be forced to expend. Though we now have tools that can allow getting within a fraction of a percentage of perfect uptime, not even a company with Google’s resources has yet achieved it.

Tags:

Comments are closed.

© 2009 drink the sweet feeling of the colour zero. All Rights Reserved.

This blog is powered by the Wordpress platform and beach rentals.