Dennis Cronin is CEO of the Data Center Incident Reporting Network (DCIRN)
Come on, folks. Give OVH a break. If you had a data center with thousands of customers, hosting, as reported, “millions of websites,” would you be any better prepared? Would your response be any better?
On Wednesday, 10 March 2021 a fire broke out in a room at the SBG-2 OVHcloud data center in Strasbourg, France. The fire reportedly destroyed SBG-2 and damaged four of 12 rooms in the adjacent SBG-1 data center. Two adjacent data centers, SBG-3 and SBG-4, were not damaged but were shutdown during the event, requiring a massive, time-consuming reboot of all their systems.
Now, nearly 14 days later, OVH has restored much of its services in the remaining operational Strasbourg data centers or moved clients to its other sites. OVH still has a lot to do, but their herculean effort to get this far and as quick as they did should be recognized and commended.
Are they 100 percent? No. And if you are one of those remaining without service you are naturally unhappy, but when the founder quickly puts out a video explaining what they know and what they are doing, and then follows up in the succeeding days with more detailed status updates, it is difficult to challenge the company’s transparency.
Tragedies like this require good leadership and a massive effort before things get back to some semblance of normalcy. Will it take another two weeks, a month, or longer? Only time will tell.
Now comes the hard part: rebuilding reputation and trust.
It would seem that OVH’s transparency to date has them off to a good start, but it will take months, or even years of effort to get to the bottom of what caused this disaster, what were the contributing factors to the extensive damage, and what to do to prevent a repeat in another facility.
Clients, investors, insurers, and the data center industry that takes pride in high availability and reliability of data center facilities will all want to know the details. An event like this cannot be whitewashed in anyway. It is far too public, and questions will linger for years.
At a minimum, the forensic analysis will need to address:
- What happened?
- What went wrong to lead to this disaster?
- What needs to be changed in facility design and operations to avoid a repeat?
There are typically many contributing factors in a disaster like this. Often, just eliminating a factor or two can prevent the disaster altogether. However, that does not resolve the underlying factors that remain. In the end, a comprehensive evaluation covering the forensics, the design, the operating procedures, and staffing must be done.
The following is the start of a road map for OVH and others who may find themselves in similar situations. There will be many paths for the team to go down to verify each path’s contribution to the disaster.
1. Where did the fire start?
It was first reported that the fire started in a UPS system that was serviced the previous day. There was also a rumor that Lithium-ion batteries were the source of the fire.
While the circumstances make these systems suspect, unless there is some specific evidence, such statements are still speculative. As of 23 March, the investigation into the root cause is ongoing. Further, such systems have multiple safeties, so, if they are the source, why didn’t the safeties work?
2. Why did the fire spread so fast?
Good data center design separates the power systems from the IT rooms, however, there are many less costly designs that have the electrical and battery equipment within the data center space. We find this more prevalent in containerized and modular designs, and even some brick-and-mortar facilities, where the UPS systems are made up of modules of 200KVA or less.
In studying internet videos of the building burning, it appears that SBG-2 may be some form of stacked modular design. One has to evaluate what elements of the structural design contributed to this incident.
Whatever the source of the fire, it should have been contained within the room, which raises more questions:
- Were fireproof construction materials used?
- Were there cable openings and other openings between these spaces that were not properly sealed and allowed the fire to rapidly spread?
3. Fire alarms
The first reports did not mention fire alarms going off, but later reports did. So, how engulfed was the data center before the fire alarms were sounded?
There is a lot of air velocity in a data center that makes it difficult for standard smoke detectors to work properly. Were the appropriate detectors in use?
Were all the fire zones active, or were some still in a service mode due to the earlier UPS maintenance work?
Most modern data centers also use supplemental smoke detection, referred to as an Early Warning System, or an Incipient Fire Detection System, that is quick to detect particles of combustion in an enclosed space. Given that this data center is circa 2011, it may not have had such a system. Hopefully, all newer sites do and older sites will be retrofitted with one of these supplemental systems.
4. Fire suppression
Was there a fire suppression system in place, and if so, why did it fail to protect the site?
In most building designs, codes require a basic sprinkler system to be installed. The theory is that if the fire can melt a sprinkler head, then the IT equipment is gone anyway, so let’s at least save the building, which is the longest replacement item. Obviously, that did not happen here.
Gaseous fire suppression? A least one report suggested that there was a “Gaseous Fire Suppression System” in place. These systems use a breathable gas like Halon (the original gas) or today’s more environmentally friendly gases like FM-200 or Novec 1230. The advantage of these systems is that they do not destroy electronics or data. The disadvantage is that for the gas to work, the protected areas must reach a specific gas concentration. Open doors, holes in floors, walls, and ceilings can rapidly dissipate the gas concentration to where it becomes ineffective.
Was such a system in-place? If so, why did it not work?
A special note to those who think a gaseous suppression system should have been in place: In certain countries, any employee who enters a gaseous fire suppression zone has the right to turn the system off while they occupy the zone. That means someone must remember to turn it back on after the last person leaves. Even in well-trained operations, the chances of the gaseous fire suppression system being forgotten is high. While these systems are effective, one must take into account all the operational criteria required for them to be useful.
5. Security systems
It was reported that OVH had numerous cameras throughout the facility. Were these cameras monitored or just on recorders? If they were monitored why didn’t anyone notice the smoke or flames?
6. DCIM, BMS, BAS, SCADA, etc.
Surely, a facility of this size has some form of monitoring system or systems. Assume for a moment that the cause was the UPS system. If this were the case, the system would rapidly fail, sending alarms to the appropriate personnel to immediately investigate.
Did this happen? If not, why? If so, when?
7. Cyber issues
While currently there isn’t even a hint of this being a cyber event, to be thorough, this needs to be evaluated.
In February, the water treatment system in the City of Oldsmar, Florida, was hacked by someone accessing the control system Human Machine Interface (HMI) remotely and modified the set point for sodium hydroxide (Lye) to a level that would be toxic to humans. If it weren’t for an alert employee, the results could have been devastating.
It is unfortunate that we live in a time where these things are done by bad actors, so any major disaster must be evaluated to eliminate the possibility of a cybercrime. The positive of such an evaluation is that it will undoubtedly uncover some cyber weaknesses that need to be rectified.
Staffing, especially third-party staffing and vendor technicians always get a bad rap when disaster happens. Sometimes it is deserved, but often the instructions and procedures to the staff are lacking, as there is an underlying expectation that they know everything about a facility the moment they walk in the door. These expectations make it difficult to get to the truth about an event as the finger pointing and denials starts long before the facts are known. The blame game needs to be quashed early, as it is unproductive and could even lead to blocked information that’s key to understanding the event.
There was once a UPS room with 52 electricians working in it during the business day, preparing planned modifications, when one of the four UPS systems tripped offline and went to bypass. The facility manager came running in, yelling and screaming, telling the electricians they were all fired (guilt by association).
Once the project manager responsible for the electrical crew evaluated the situation, it was readily determined that the UPS systems had no servicing for the two years prior to the event. Apparently, when the facility manager came onboard, he wanted to save money, so he declined to sign any UPS service contracts. The UPS subsequently had too many internal redundant components fail, causing it to go to bypass. The electricians just happened to be at the wrong place at the wrong time.
The moral of the story is that things are not always as they appear. Follow the facts!
9. Sequence of events
One of the most important factors in determining the root cause of this disaster and its extensive collateral damage is going to be creating an accurate timeline of events from multiple sources of information. To do this accurately, the time stamps of various systems must first be reconciled against a master clock, otherwise, false conclusions may be made about the sequence of events.
Once an accurate timeline of events is established, one can then start to evaluate the speed with which events occurred, leading to additional insights, answers to what all went wrong, and the numerous lessons to be learned.
10. Unprepared clients
Every client has the responsibility for a Disaster Recovery Plan (DRP). If done properly, cloud environments tend to make this easier as data and workloads are readily moved to alternate sites. Just the same, these plans must be tested and validated.
Far too often this is not done, as staff is overwhelmed just with their day-to-day workload. Each company’s management must deal with the consequences of a poor or non-existent DRP. The DRP can be outsourced, but the responsibility for a working DRP remains with the management of the affected company, and they need to take charge of their own destinies.
We have outlined 10 key areas with which to start. Many more will be added as the investigation gets into full swing.
We started out with a statement that OVH, its clients, investors, insurers, and the data center industry all want to know the details of how this disaster happened. As the facts are established, the list of “lessons learned” will be extensive. It is not just the initiating incident, but it is also all the other incidents that followed, where the parties have lessons to learn from.
Another tragedy in the wings?
What will happen to this extensive knowledge database that is about to be created? Will it be buried behind some NDA, limiting those who know the facts? Will it be used internally to evaluate and take preventive actions at OVH’s other, similar sites?
This event should not be assumed unique to OVH. Everyone in the global data center industry has something to learn from this disaster and probably many things to learn. This includes clients, support vendors, designers, contractors, and staff. It will be a tragedy if the facts and the lessons learned are restricted to a small exclusive group and not shared with the data center industry.
This is a teaching moment that most have yet to experience. While the disaster will always be known, the current and future generations of data center designers, builders, and operators will need access to the pertinent facts of this incident, so they, too, can learn and prevent a repeat.
Through its website, the Data Center Incident Reporting Network encourages anonymous sharing of data center incidents from around the globe to create a global database of all incidents from which industry trends can be established and from which the next generation can learn.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating.