Can a Data Center Outage Database Prevent Data Center Outages?

Suppose a service outage at a data center were an unsolved crime in a detective novel — a good one. You’d expect the book to contain all the clues you’d need to piece together the chain of events and solve the case.

No database of data center event logs is as interesting as a crime novel, but it should contain all the data pertaining to the cause of an incident. However, the sheer volume of data often makes the cause difficult to identify. Still, the more data you have the more likely an analytics function or even an artificial intelligence (AI) algorithm will spot the culprit.

Here’s the challenge: How much data must be pooled from all the facilities before we can determine not only a pattern but a formula for incidents, one that can be used to diagnose them before they occur? The answer may lie in determining first just how many similarities there are from one data center to another – an amount that appears to be shrinking.

Patterns and Motives

“Like any maturing industry, data center is becoming rapidly commoditized,” says Peter Gross, member of the Executive Committee of the Data Center Incident Reporting Network (DCIRN) and former VP of mission critical systems at Bloom Energy. “The major components, whether they’re cooling systems, power systems, fire protection, EPO [emergency power off], architectural components, are the same types. The architecture (the configuration and topology of the data center, the electrical, mechanical) might be slightly different, but you see commonality in the configurations.”

Hyperscale facilities follow design patterns, Gross tells Data Center Knowledge, and to that end, a hyperscale service provider’s facilities will have greater commonalities with one another. Enterprise data centers are a different story. Although they share essentially identical components, they are implemented differently from site to site, in each case for a different set of reasons.

Gross and his colleagues’ mission with DCIRN is to implement a kind of dragnet for clues. If the organization can collect enough data about a broad enough selection of monitored conditions over time, correlations that would otherwise be missed with respect to a single incident could eventually reveal themselves, he says.

Perhaps your enterprise already reports telemetry about the condition of your server and client operating systems to the OS manufacturer. Usually such a process takes place anonymously. What’s more, an OS installation on one computer bears strong resemblance to other installations, not just in the same building but worldwide.

DCIRN is an effort to pull off a telemetry and reporting system for the buildings, resources, and infrastructure that servers and their workloads rely on. The organization says it takes steps to anonymize the data it collects. With a respectable plurality of reports on hand, it aims to inform members about the correlations it finds between data center operating conditions and classes of similar power-outage and data-loss incidents.

Eventually, Gross tells us, DCIRN will be able to apply real-time analytics to the data, enabling remote algorithms to aid in diagnosis and remedy for potentially serious events before things ever get really bad.

“DCIRN is very young,” Gross admits. “It just started. Before we develop some analytics tools, we need to build up the database population. We don’t have nearly enough incidents recorded to start analyzing the data, organizing it, and developing tools that would [reveal] meaningful information about a design or architecture. Right now, it’s strictly a collection of incidents.”

When Is an Incident an Incident?

An “incident,” as Gross and DCIRN perceive it, is an anomalous event, something out of the ordinary that has a measurable impact on service. Earlier this year the Uptime Institute introduced a classification for such incidents: the five-tier Outage Severity Rating (OSR). The first tier in OSR is for events so minimal in scope and impact that they don’t warrant mentioning in the press, even collectively. Uptime CTO Chris Brown at the time told us that every event with a negligible impact on any level should at least be recorded and classified, especially if it should play a role in diagnosing a more impactful incident down the road.

“This [OSR] is not just a tool for after the fact, [but] for after you’ve had a fire in the closet and the fire’s put out, using this tool to assess how badly it hurt us,” Brown said. “This is a tool to look at every possible thing that could happen in your data center, your networks and your IT systems, ranking what that impact would be to the company, so that you can focus on those things that are going to have the Category 3, 4, 5 impacts before you focus on the things that are only going to have a Category 1 or 2 impact.”

Uptime is a partner with DCIRN and Gross has been keeping up with its progress. There’s a chance OSR will play a role in classifying the events in DCIRN’s database. However, unlike one of Uptime’s stated goals of making incident reports easier for executives and non-IT personnel to comprehend, Gross says he doesn’t perceive C-suite members becoming DCIRN consumers.

“This is not for an executive to evaluate the performance or behavior of his data center,” Gross says. “I understand perfectly well why Uptime developed this five-level scale of incidents. Most people who do not know enough about the data center business have a difficult time understanding its performance qualities. Such a simple scale is very useful for managers and executives.”

He notes that “DCIRN’s objective is not necessarily to educate the executives. It’s really for the designers, operators, facilities people, architects, testing and commissioning people — the rank and file of the industry. With the reporting structure we have in place right now, there is enough detail that describes not only what happened, but also the consequences of the incidents — whether or not they dropped a load, what kind of conditions they created, whether a piece of equipment had to be repaired or replaced. Once we have enough information there and we can start organizing events, we’ll create a system that will enable people to be more precise [and] more focused on the types of situation they are looking for.”

Uptime’s OSR is a non-holistic mechanism for classifying incident severity. Still, it could give a broader-scale data center management system of five colors of poster paint, enabling someone to paint a reasonably holistic mural depicting, say, data loss at the time of a power incident. The potential does exist for DCIRN’s database and Uptime’s broad brushstrokes to be paired up.

As different as facilities tend to be from one another, you can argue that the only way to ascertain whether there’s any science to it at all is to gather as much observational data from these facilities as you can and use analytics to determine whether any correlations truly exist. Gross tells us that DCIRN is in the first stage of that mission; the second stage may come much later. Yet perhaps there is some science we can apply to the problem even in these early days. Couldn’t machine learning (ML) shed some light on the general direction of the answer, in hopes that we may get a jump on figuring out what the question should be?

“Absolutely,” Gross says. “That’s where this whole industry is going. I don’t see a better tool for machine learning than having such a database. This is perfect for machine learning and this is where the entire data center maintenance operation is moving to.”

Comments

Plain text