Human error is the root cause of most data center outages. It is the data center industry’s maxim backed by data, collected and published by companies that study it.
At Google data centers, however, it simply doesn’t apply. Why? Because Google data centers are operated by the one percent.
“On the infrastructure side, the industry norm is that human error still accounts for the overwhelming majority of incidents,” Joe Kava, Google’s top data center operations exec, said. “Because of our designs and our highly qualified people, only a very small fraction of our incidents were related to human error, and none of them caused downtime.”
A one-percenter at Google is different than a one-percenter the way Bernie Sanders or Occupy Wall Street mean it, however.
Very few Google employees are allowed to visit the company’s data centers. “In fact, less than 1 percent of all Googlers ever set foot in a data center at Google,” Kava said, while speaking at Google’s big cloud event in San Francisco Thursday.
The only way to get in is to have a specific business reason to be there, and Googlers that work in these facilities are some of the most experienced and brightest people with diverse backgrounds in engineering and mission critical operations, who all have a common trait.
“They are systems thinkers,” he said. “They understand how systems interact and how they work together.”
About 70 percent of data center reliability incidents are caused by human error on average, Kava said, citing data by the Uptime Institute, an industry organization owned by the 451 Group. Only 15.4 percent of incidents at Google data centers were caused by human error over the past two years, he said.
One of the biggest reasons Google is so far ahead of the industry average is that it doesn’t outsource data center operations.
“You see, the norm in the industry is for the design-build contractor to hand over a set of drawings and a set of owner’s manuals and the keys to the front door and wish the data center operator good luck,” Kava said. “And all too often, frankly, those operations teams, they’re not employed by the owner; they’re outsourced to the lowest-cost bidder.”
The result is not only does the data center’s actual user have no control over the quality of professionals running their facility, they also can’t be sure those people will go over and above when things do go wrong.
“If there’s one certainly in data center operations, it’s that problems and faults are going to happen in the middle of the night, typically on Sundays, when no-one else is available to help out,” he said.
Googlers responsible for data center operations work side by side with Googlers who design and build the facilities. There is a constant feedback loop between these teams, and every data center that gets built is better than the previous one.
“This gives us an unparalleled level of ownership, end-to-end, of our infrastructure,” Kava said.