NATIONAL HARBOR, Md.–Being the imperfect organisms that they are, humans are often the ones to blame for their problems.
And that’s true for the biggest problem of the data center industry: downtime. It’s often cited that more than 80 percent of data center outages can be attributed to human error.
While that may be true, there’s a degree of subtlety to that estimate. “It really comes down to the definition of what human error is,” Steven Shapiro, mission critical practice lead at Morrison Hershfield, said. Morrison Hershfield is a major US engineering firm with a substantial data center practice.
The only way the 80-plus-percent estimate is true is if you take into account errors in things like system design, commissioning, and training, not errors made during operation, Shapiro said on the sidelines of our sister company AFCOM‘s Data Center World conference taking place here this week. According to his company’s numbers, actual operator error, where somebody flipped the wrong breaker or shut the wrong valve after losing utility power and brought the facility down as a result, is responsible for more like 18 percent of data center outages.
There is a way to bring the possibility of that kind of human error down, because it almost always “comes down to not following the procedure,” Shapiro said.
The main problem isn’t failure to follow procedure, however, but lack of documented procedure, which is an industry-wide problem that people who work in the data center industry are generally reluctant to talk about, he said. “If the training is there, and the procedures are there, we find a facility that has that, there’s almost no human error associated with failure.”
Most data center facilities teams today don’t have proper procedures in place, relying instead on the knowledge of staff who have a lot of experience in their specific facility. “The guy that built the facility is still there, and he feels that he knows everything that there is to know about it,” Shapiro said, illustrating an example. “And now there’s four guys that work for him, and he hasn’t told them everything they need to know, but he’s still around.”
Sometimes, the team knows they need procedures written down, but they don’t have the budget to do it. “It’s a well-known issue, but nobody wants to talk about it. If the funding was there, it would get done.”
The reasons funding for such projects doesn’t materialize vary. One common scenario is where the IT team controls the data center budget, and the facilities team doesn’t want to tell IT that they don’t have proper training and procedures in place. In other cases, the facilities team has the budget but always finds something more important to spend the money on.
In either scenario, documenting procedures gets put in the back burner because the team doesn’t need that documentation during day-to-day operations. They only need them when there’s a failure or during maintenance, so it’s a problem that’s easy to ignore as long things run smoothly.