Is Maintenance Making Your Facility Less Reliable?

Stephen Fairfax, MTechnology

Steve Fairfax, the president of MTechnology, during hiskeynote presentation Tuesday at the 24x7 Exchange Fall Conference in Phoenix.

PHOENIX – Is “preventive maintenance” not really that preventive after all? In the data center, where human error is a leading cause of downtime, a vigorous maintenance schedule can actually make a facility less reliable, according to several speakers at this week’s 7×24 Exchange Fall Conference.

“There’s this mantra that more maintenance equals more reliability,” said Steve Fairfax, the President of MTechnology. “We get the perception that lots of testing improves component reliability. It does not. The most common threat to reliability is excessive maintenance.”

Fairfax, whose firm has conducted in-depth analyses of failure rates in data centers, says too much maintenance can be disruptive to optimal configurations for reliable operations.

“The purpose (of maintenance) should be to find defects and remove them,” said Fairfax. “But maintenance can introduce new defects. And whenever a piece of equipment is undergoing maintenance, your data center is less reliable.”

One problematic scenario involves poorly-documented maintenance programs interacting with automated systems. An example: the Three Mile Island nuclear meltdown in 1979, in which a secondary feed water pump was disabled by manual valves closed during preventive maintenance.

The challenge? There are other voices emphasizes the value of comprehensive maintenance programs.

“Maintenance is a very lucrative business,” said Fairfax, who said guidance from equipment vendors sometimes slip into FUD (fear, uncertainty doubt) rather than sound methodology. “They want to keep selling their maintenance plans. To overcome this preventive maintenance threat, we must attack false learning. More is not always better.”

Uptime Institute Executive Director Pitt Turner said he was in “violent agreement” with Fairfax on the risks of excessive maintenance. “Implement OEM maintenance routines at your own risk,” said Turner. “Think about what PMs (preventive maintenance programs) are appropriate. Do them and do them well using best practices to lower your risk.”

Fairfax differentiated between failures affecting individual components and those impacting systems.”People respond to component failures, even if a system was not threatened,” he said, adding that “if you buy a 2N data center, you’ll have twice as many component failures as a 1N data center. But you’ll be more reliable.”

MTechnology specializes in quantitiative risk assessment for mission-critical industries,  with a focus on failures in complex systems.

“This is not me stroking my beard and telling you about how much experience I have,” said Fairfax. “This is about math and science. When human judgment and experience run up against probabilities and math, math wins. Winners are rare in Las Vegas. But (casinos) make a lot of money in Las Vegas.”

He also praised the 7×24 Exchange as a forum where discussions of complex reliability challenges is welcomed, even if it chalenges popular industry beliefs and practices. “This is a very unique forum,” said Fairfax. “I don’t know very many places where I’d be able to say things like this.”

Get Daily Email News from DCK!
Subscribe now and get our special report, "The World's Most Unique Data Centers."

Enter your email to receive messages about offerings by Penton, its brands, affiliates and/or third-party partners, consistent with Penton's Privacy Policy.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)


  1. Emerson Network Power has long promoted the need for a comprehensive Preventive Maintenance (PM) program, the single most important activity to maximize the availability of Uninterruptible Power Supply (UPS) systems and the battery systems upon which they depend. To confirm the importance of PM and provide insight into the impact of human error on reliability, Emerson Network Power analyzed data collected by its service organization and wrote this white paper outlining the findings. The data covered 185 million operating hours for more than 5,000 three-phase UPS units, and more than 450 million operating hours for more than 24,000 strings of batteries. The UPS analysis looked at the impact of both electrical failure and service-related human error, and the battery analysis allowed the impact of UPS system downtime due to bad batteries to be factored into the calculations. This research indicated that the UPS Mean Time Between Failures (MTBF) for units that received two PM service events a year is 23 times higher than a machine with no PM service events per year, but this is ONLY the case if the service is performed by a factory trained service engineer. We gathered the number of outages due to human error and for our field technicians we saw an error rate of one error per 5,000 service events. This metric is something we are extremely proud of and shows the professionalism and expertise of our field organization. We cannot gauge the error rate per service event for an un-trained service technician from a third party service provider, but we can state that if you use the OEM service from Emerson Network Power your reliability and availability will substantially increase, not decrease. We also gathered data on battery related-outages that occurred and then projected the impact of added monitoring services to the units. The analysis found that to date, there have been zero system outages due to bad batteries on systems where the batteries have been professionally maintained and remotely monitored by system experts. The conclusion of this analysis ( re-affirmed that proactive maintenance and remote monitoring service increase system reliability.

  2. Mr. Miller: Quick nitpick: "human error is a leading cause of downtime" Human error is not a cause. It is an effect. The field of Resilience Engineering illuminates the problems with this statement.

  3. I can only assume these comments are taken out of context, or are worded in such a way as to provoke a response. An argument against PM planning and execution is irresponsible absent context. Poor execution does lead to failure, in maintenance as well as operations. The assumption in the article is that there is a skewed emphasis on "more" without consideration of what the effort entails. I'd be willing to match my experiences with the author and prove that this is an edge case, a low probability scenario. The majority of maintenance groups today are struggling to merely get to existing PM schedules.