Data Center Failure As A Learning Experience

Acknowledging the the likelihood of failure is important in developing effective mitigation strategies.

Richard Sawyer's message to a roomful of data center professionals, whose jobs focus on 100 percent uptime, may have seemed slightly heretical. "You have to plan for failure," Sawyer told attendees at AFCOM's Data Center World in September. "Failure is predictable. Failure is manageable."

Sawyer, a vice president at EYP Mission Critical Facilities, was making a point: data center professionals don't spend enough time thinking about the unthinkable. Being honest about the likelihood of failure is important in developing effective mitigation strategies, according to Sawyer. With the right approach, he said, you can "fail small" and limit the liability of any failure. Planning is also critical to diagnosing and recovering from failures.

Data center failure has been in the headlines this year, from the power outages at 365 Main and ServerBeach to data center migration snafus at ValueWeb and Alabanza. They're not alone. A 2006 survey of AFCOM members found that 81 percent of respondents had experienced a failure in the past five years, and 20 percent had been hit with at least five failures.

Sawyer noted that data center equipment failures typically are concentrated in two periods - shortly after being put in service and as it approaches its end of life. This creates a higher probability of failures in these "wear-in" and "wear-out" phases. Strategies to address these tendencies include comprehensive testing before a piece of equipment is put in service, and monitoring as it gets old. "Be aware when (equipment) failures are becoming more frequent or more likely," said Sawyer. "Keep failures localized as much as possible. Design around a failure as best you can."

While the period after a failure can be stressful, Sawyer also says it can be a moment of opportunity, if only because the data center suddenly has the attention of key decision-makers. "Did you ever notice that in the data center world there is no money for anything, but the day after a failure, you have all the money in the world?" Sawyer said. "Let everybody know what happened, why it happened, and what you're doing about it."

The approach to an investigation will determine how the usefulness of the post-mortem, Sawyer said. An organized inquiry should determine the root cause of the failure, develop an action plan to prevent its recurrence, and publish the findings across the organization. It's also essential to make the process positive and not punitive. "Keep it impersonal and keep it professional," said Sawyer. "If you do that, people will talk to you."