Maintenance, Downtime and Outages
Is the pressure to be “always on” leaving data center power systems more vulnerable to failure? And are historic tensions between IT and facilities contributing to the problem? Ken Brill of The Uptime Institute raises these issues this week in his Forbes column, Avoiding Data Center Disasters. Brill suggests that resistance to off-hours maintenance downtime increases the likelihood of a prime-time outage.
“The No. 1 reason for catastrophic facility failure is lack of electrical maintenance,” Brill writes. “Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.”
Electrical systems featured prominently in the unusual series of outages in early July at major data centers and carrier hotels. While it’s hard to know whether maintenance schedules contributed to these incidents, the heightened awareness of the risks of downtime can be a moment of opportunity as well as accountability.
It’s a good time to be mindful of the advice from Richard Sawyer of HP, who has made a number of conference presentations on Failure As A Learning Experience. While the period after an outage can be stressful, the data center suddenly has the attention of key decision-makers. “Did you ever notice that in the data center world there is no money for anything, but the day after a failure, you have all the money in the world?” Sawyer said. “Let everybody know what happened, why it happened, and what you’re doing about it.”
JeffPosted July 30th, 2009
It’s ultimately an issue of understanding the levels of redundancy available, and then justifying the cost. So many colo spaces (not to name names) provide little more than an “uptime guarantee” on paper without actually using the numerous standards available for designing a proper redundancy path for power and communication. If customers knew to ask things like “Do you provide dual bus power at the rack? Is it fed from independent switchgear? Independent generators? Independent utility feeds and transformers?” These questions would clarify instantly the ability of a colo provider to provide uptime in the case of a mechanical failure. Too many companies, in an effort to provide the lowest cost (or highest profit) simply ignore redundancy, but Murphy’s law catches up with them sooner or later. If the colo space you are in is not offering full redundancy, a failure WILL happen and it WILL take down your gear for a non-negligible amount of time.
Even with good maintenance, *things go wrong*. It should come as no surprise that companies with no redundancy plan WILL have outages, painful outages, and all too often.
Why do we rely on such centralized power in the datacenter anyway? I think google has a very good idea in providing a small battery for each separate machine.
MikePosted August 3rd, 2009
The important thing is to understand your equipment: some things run better if left alone, while others DO require maintenance, but it isn’t as cut and dried as some people would have you believe. Most systems work under greater stress when they are being started up than under normal load, so too much maintenance can cause systems to fail prematurely.