Study: Server Failures Don’t Rise Along With the Heat
Servers don’t sweat the heat as much as you might think. That’s the takeaway from a new study from researchers at the University of Toronto, who studied data on equipment failures at data centers operated by Google, Los Alamos National Labs, and Canada’s SciNet HPC consortium. The study provides some of the most comprehensive real-world data to date on the impact and implications of raising the temperature in data centers.
The study, Temperature Management in Data Centers: Why Some (Might) Like it Hot (link via James Hamilton), looks at several types of equipment problems in a variety of environmental conditions in these high-performance data centers.
“Based on our study of data spanning more than a dozen data centers at three diﬀerent organizations, and covering a broad range of reliability issues, we ﬁnd that the eﬀect of high data center temperatures on system reliability are smaller than often assumed,” the authors write. “For some of the reliability issues we study, namely DRAM failures and node outages, we do not ﬁnd any evidence for a correlation with higher temperatures (within the range of temperatures in our datasets).
“For those error conditions that show a correlation (latent sector errors in disks and disk failures), the correlation is much weaker than expected,” they continue. “For (device internal) temperatures below 50C, errors tend to grow linearly with temperature, rather than exponentially, as existing models suggest. … We see our results as strong evidence that most organizations could run their data centers hotter than they currently are without making signiﬁcant sacriﬁces in system reliability.”
The findings have implications for data centers who’d like to save energy by reducing their volume of cooling, and could broaden the use of free cooling (the use of fresh air instead of air conditioners to cool servers). Most data centers operate in a temperature range between 68 and 72 degrees Fahrenheit, and some are as cold as 55 degrees. Raising the baseline temperature inside the data center – known as a set point – can save money by reducing the amount of energy used for air conditioning. It’s been estimated that data center managers can save 4 percent in energy costs for every degree of upward change in the set point.
As a result, large Internet companies like Google, Microsoft and Intel have been aggressive in operating their data centers at temperatures above 80 degrees. Despite those potential gains, user surveys show few enterprise data centers are following suit.
Strong Management Key to Capturing Benefits
There are several reasons for this caution. Nudging the thermostat higher is only appropriate for companies with a strong understanding of the cooling conditions in their facility. Warmer set points may allow less time to recover from a cooling failure.
The other major issue, which was reinforced by the Toronto study, is the challenge of managing fan activity. Server fans tend to kick on as the temperature rises, nullifying gains from turning down the cooling. Microsoft, Facebook and Yahoo have each tackled this problem, in some cases by tweaking algorithms that manage fan activity, a step that might be easier for large organizations.
The Toronto study said that heat may be less important than temperature fluctuation in reducing hardware failures. “Even failure conditions, such as node outages, that did not show a correlation with temperature, did show a clear correlation with the variability in temperature,” the authors wrote. “Eﬀorts in controlling such factors might be more important in keeping hardware failure rates low, than keeping temperatures low.”
James Hamilton, a researcher at Amazon Web Services, says the new data is valuable in updating the industry’s understanding of the relationship between temperature and hardware.
“An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F),” Hamilton writes on his Perspectives blog. “This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large data center environment, and the rule of thumb that has emerged from is a linear model of failure to heat.”
For more check out Temperature Management in Data Centers: Why Some (Might) Like it Hot.
There is such a thing as too cold as well. I learned this a couple of years ago when my SAN which had Seagate SATA hard disks in it (around 300 or so) started reporting some disks were performing poorly, after a lengthy investigation and testing the vendor determined that once the disks dropped below ambient temperature of 20C the drive heads slowed down to protect the disk (the specs of the drive showed an operating temperature range that included temps much lower than 20C), the colder it got the slower the heads moved. This behavior was intentional in the firmware of the disks. I assume other disks are similar. So the drives did operate below 20C they just operated very poorly.
As for operating too hot – for a while now most equipment is rated for higher operating temperatures, 104 degrees F is a common stat for ambient operating temperature though few of course run things that hot. So in those devices equipment failures should not occur due to heat up to those levels – if they do well the manufacturer didn’t do good enough testing and/or mis informed customers to the abilities of the system.
Server add-on cards seem to be more problematic for heat vs the integrated stuff for some reason.
AJPosted May 30th, 2012
I don’t know about having cooling at 80 degrees. Your systems would then be running at full fan speed and running hot thus requiring more power.
[...] Warmer Server Temperatures Do Not Mean More Downtime As the temperatures are rising with the arrival of summer, it’s good to note a new study from the University of Toronto which shows there is no correlation between warmer temperatures and server failures. The study did find a trend between fluctuating temperatures and server failures. [...]