Maintenance, Downtime and Outages

Is the pressure to be "always on" leaving data center power systems more vulnerable to failure? And are historic tensions between IT and facilities contributing to the problem? Ken Brill of The Uptime Institute raises these issues this week in his Forbes column, Avoiding Data Center Disasters. Brill suggests that resistance to off-hours maintenance downtime increases the likelihood of a prime-time outage.

"The No. 1 reason for catastrophic facility failure is lack of electrical maintenance," Brill writes. "Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT's need for uptime and the facility department's need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins."

Electrical systems featured prominently in the unusual series of outages in early July at major data centers and carrier hotels. While it's hard to know whether maintenance schedules contributed to these incidents, the heightened awareness of the risks of downtime can be a moment of opportunity as well as accountability.

It's a good time to be mindful of the advice from Richard Sawyer of HP, who has made a number of conference presentations on Failure As A Learning Experience. While the period after an outage can be stressful, the data center suddenly has the attention of key decision-makers. "Did you ever notice that in the data center world there is no money for anything, but the day after a failure, you have all the money in the world?" Sawyer said. "Let everybody know what happened, why it happened, and what you’re doing about it."

Comments

Plain text