Most Data Center Outages aren't Caused by Tech Failure

Many critical industries such as nuclear energy, commercial and military airlines—even drivers' education—invest significant time and resources to developing processes. The data center industry ... not so much.

That can be problematic, considering that two-thirds of data center outages are related to processes, not infrastructure systems, says David Boston, director of facility operations solutions for TiePoint-bkm Engineering.

“Most are quite aware that processes cause most of the downtime, but few have taken the initiative to comprehensively address them. This is somewhat unique to our industry.”

Boston is scheduled to speak about strategies to prevent data center outages at the Data Center World local conference at the Art Institute of Chicago on July 12. More about the event here.

He suggests that management is constantly compelled to replace aging infrastructure systems and components, or systems that have caused repetitive problems, and they are accustomed to adding system capacity to accommodate load growth. In terms of infrastructure, mechanical failure in cooling systems is the biggest generator of failures, but electrical system failures cause far more downtime events because of such a short time to react.

“Each of these efforts involve outside engineering support, so the time required of management is often only that of defining the project and overseeing it.”

While developing processes associated with the most common causes of data center outages may be more time-consuming for management, it’s time well spent. Here are the top three offenses and best practices that Boston recommends following:

Failure to match a facility’s staff size and shift coverage with objectives for critical operations uptime.

Best practice: Quantify uptime objectives with senior IT management and ensure staffing matches it. Boston suggests keeping two individuals per shift on every shift, with additional personnel responsible for training and procedures programs only if maximum uptime is desired. Only use single shift coverage if an occasional downtime event is acceptable.

No site-specific training program, including dedicated practice time before the facility begins operation.

Best practice: Assign a single team member as the training program owner, with time to coordinate monthly emergency response training for all team members. Rotate each team member through hands-on practice, isolating an infrastructure system before a maintenance activity and restoring the system to service as activities pop up on the preventive maintenance calendar.

Inadequate site-specific procedures.

Best practice: Assign a single team member as the procedures program owner, with time to develop (or work with a consultant to develop) the 100 to 200 critical procedures needed for virtually every critical facility. Have each one confirmed for technical accuracy and verify all are clearly understood by the least knowledgeable person on the team.

“I have long suspected that there is a reluctance to devote the initial time required to implement the programs described above,” comments Boston.

These processes should absolutely be implemented with respect to critical operations—those that would negatively impact an organization’s revenue or credibility if they fail. However, for non-critical operations, he suggests focusing on methods for quick restoration.

Data Center World Local, Chicago, is taking place July 12 at the Art Institute of Chicago. Register here for the conference.

Comments

Plain text