Failure Rates in Google Data Centers

Google fellow Jeff Dean offered an overview of Google's data center operations in this week's Google I/O conference for developers. News.com has an overview of Dean's presentation, most of which revisited general information that is already public and is summarized in our Google Data Center FAQ. But there were a couple of areas where Dean offered data that provided additional insight into Google's operations. In particular, he discussed failure rates within the clusters of 1,800 servers that Google uses as the building block for its infrastructure:

In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.

A 50 percent chance that the cluster will overheat? This suggests that Google's approach, which packs 40 servers into each rack, is running pretty close to the edge in terms of thermal management. Or perhaps that Google has trouble anticipating when an area of its data center may develop cooling challenges.

Stuff fails in data centers, and always has. The art of data center uptime is in architecting systems to survive these failures. Richard Sawyer of EYP Mission Critical Facilities (now part of HP) has some interesting presentations on mean time between failure (MTBF) for data center eqipment and how it can help anticipate failure.

Google's data centers operate on a large enough scale that it can shift redundancy from hardware to software. There's so much hardware that it can use software to identify failures and route around them, even at the cluster level.

Dean also talked about Google's use of multi-core machines and parallel processing. The News.com article includes an image of a Google rack with one of its custom enclosures, but this is not new and was published online during last year's developer conference. For more, check out Steven Shankland's story at News.com.