PHOENIX – Is “preventive maintenance” not really that preventive after all? In the data center, where human error is a leading cause of downtime, a vigorous maintenance schedule can actually make a facility less reliable, according to several speakers at this week’s 7×24 Exchange Fall Conference.
“There’s this mantra that more maintenance equals more reliability,” said Steve Fairfax, the President of MTechnology. “We get the perception that lots of testing improves component reliability. It does not. The most common threat to reliability is excessive maintenance.”
Fairfax, whose firm has conducted in-depth analyses of failure rates in data centers, says too much maintenance can be disruptive to optimal configurations for reliable operations.
“The purpose (of maintenance) should be to find defects and remove them,” said Fairfax. “But maintenance can introduce new defects. And whenever a piece of equipment is undergoing maintenance, your data center is less reliable.”
One problematic scenario involves poorly-documented maintenance programs interacting with automated systems. An example: the Three Mile Island nuclear meltdown in 1979, in which a secondary feed water pump was disabled by manual valves closed during preventive maintenance.
The challenge? There are other voices emphasizes the value of comprehensive maintenance programs.
“Maintenance is a very lucrative business,” said Fairfax, who said guidance from equipment vendors sometimes slip into FUD (fear, uncertainty doubt) rather than sound methodology. “They want to keep selling their maintenance plans. To overcome this preventive maintenance threat, we must attack false learning. More is not always better.”
Uptime Institute Executive Director Pitt Turner said he was in “violent agreement” with Fairfax on the risks of excessive maintenance. “Implement OEM maintenance routines at your own risk,” said Turner. “Think about what PMs (preventive maintenance programs) are appropriate. Do them and do them well using best practices to lower your risk.”
Fairfax differentiated between failures affecting individual components and those impacting systems.”People respond to component failures, even if a system was not threatened,” he said, adding that “if you buy a 2N data center, you’ll have twice as many component failures as a 1N data center. But you’ll be more reliable.”
MTechnology specializes in quantitiative risk assessment for mission-critical industries, with a focus on failures in complex systems.
“This is not me stroking my beard and telling you about how much experience I have,” said Fairfax. “This is about math and science. When human judgment and experience run up against probabilities and math, math wins. Winners are rare in Las Vegas. But (casinos) make a lot of money in Las Vegas.”
He also praised the 7×24 Exchange as a forum where discussions of complex reliability challenges is welcomed, even if it chalenges popular industry beliefs and practices. “This is a very unique forum,” said Fairfax. “I don’t know very many places where I’d be able to say things like this.”