Posted By Rich Miller On May 30, 2008 @ 7:48 am In Uncategorized | Comments Disabled
In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.A 50 percent chance that the cluster will overheat? This suggests that Google’s approach, which packs 40 servers into each rack, is running pretty close to the edge in terms of thermal management. Or perhaps that Google has trouble anticipating when an area of its data center may develop cooling challenges.
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/
URLs in this post:
[1] News.com : http://news.cnet.com/8301-10784_3-9955184-7.html?part=rss&tag=feed&subj=NewsBlog
[2] Google Data Center FAQ: http://www.datacenterknowledge.com/archives/2008/Mar/27/google_data_center_faq.html
[3] mean time between failure: http://www.datacenterknowledge.com/archives/2007/Nov/08/data_center_failure_as_a_learning_experience.html
[4] was published online: http://www.flickr.com/photos/skylinegtr/525952661/
[5] Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Click here to print.
Copyright © 2011 Data Center Knowledge. All rights reserved.