Study: Server Failures Don't Rise Along With the Heat

Servers don't sweat the heat as much as you might think. That's the takeaway from a new study from researchers at the University of Toronto, who studied data on equipment failures at data centers operated by Google, Los Alamos National Labs, and Canada's SciNet HPC consortium. The study provides some of the most comprehensive real-world data to date on the impact and implications of raising the temperature in data centers.

The study, Temperature Management in Data Centers: Why Some (Might) Like it Hot (link via James Hamilton), looks at several types of equipment problems in a variety of environmental conditions in these high-performance data centers.

"Based on our study of data spanning more than a dozen data centers at three diﬀerent organizations, and covering a broad range of reliability issues, we ﬁnd that the eﬀect of high data center temperatures on system reliability are smaller than often assumed," the authors write. "For some of the reliability issues we study, namely DRAM failures and node outages, we do not ﬁnd any evidence for a correlation with higher temperatures (within the range of temperatures in our datasets).

"For those error conditions that show a correlation (latent sector errors in disks and disk failures), the correlation is much weaker than expected," they continue. "For (device internal) temperatures below 50C, errors tend to grow linearly with temperature, rather than exponentially, as existing models suggest. ... We see our results as strong evidence that most organizations could run their data centers hotter than they currently are without making signiﬁcant sacriﬁces in system reliability."

The findings have implications for data centers who'd like to save energy by reducing their volume of cooling, and could broaden the use of free cooling (the use of fresh air instead of air conditioners to cool servers). Most data centers operate in a temperature range between 68 and 72 degrees Fahrenheit, and some are as cold as 55 degrees. Raising the baseline temperature inside the data center – known as a set point – can save money by reducing the amount of energy used for air conditioning. It’s been estimated that data center managers can save 4 percent in energy costs for every degree of upward change in the set point.

As a result, large Internet companies like Google, Microsoft and Intel have been aggressive in operating their data centers at temperatures above 80 degrees. Despite those potential gains, user surveys show few enterprise data centers are following suit.

Strong Management Key to Capturing Benefits

There are several reasons for this caution. Nudging the thermostat higher is only appropriate for companies with a strong understanding of the cooling conditions in their facility. Warmer set points may allow less time to recover from a cooling failure.

The other major issue, which was reinforced by the Toronto study, is the challenge of managing fan activity. Server fans tend to kick on as the temperature rises, nullifying gains from turning down the cooling. Microsoft, Facebook and Yahoo have each tackled this problem, in some cases by tweaking algorithms that manage fan activity, a step that might be easier for large organizations.

The Toronto study said that heat may be less important than temperature fluctuation in reducing hardware failures. "Even failure conditions, such as node outages, that did not show a correlation with temperature, did show a clear correlation with the variability in temperature," the authors wrote. "Eﬀorts in controlling such factors might be more important in keeping hardware failure rates low, than keeping temperatures low."

James Hamilton, a researcher at Amazon Web Services, says the new data is valuable in updating the industry's understanding of the relationship between temperature and hardware.

"An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F)," Hamilton writes on his Perspectives blog. "This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large data center environment, and the rule of thumb that has emerged from is a linear model of failure to heat."

For more check out Temperature Management in Data Centers: Why Some (Might) Like it Hot.

Comments

Plain text