IT racks in a data center

Enlisting Machine Learning to Fight Data Center Outages

How companies old and new are integrating machine learning in data center management tools to improve data center availability.

Data center operators have many tools in their arsenal to maintain uptime, from redundant power and cooling infrastructure and automated failover to network and system monitoring software to entirely redundant secondary sites. Now, yet another option is emerging: artificial intelligence and machine learning.

Much like self-driving cars promise to upend the auto industry, the use of AI and machine learning, which is a subset of AI, has the potential to transform data center management. One of its key promises is improving data center availability and resiliency.

Established companies in the sector and startups alike are incorporating machine learning into their software and services, allowing data center operators to predict and prevent outages and equipment failure.

Schneider Electric, Eaton, and Nlyte Software are among the first to leverage machine learning in cloud-based data center management products, and more companies are expected to enter the space in 2019, Rhonda Ascierto, research VP at Uptime Institute, said in an interview with Data Center Knowledge.

There are also on-premises software options. Montreal-based Maya Heat Transfer Technologies (HTT), for example, has added machine learning capabilities to its data center infrastructure management (DCIM) software. Meanwhile, Santa Clara, California-based software startup AdeptDC plans to make available both on-premises and in the cloud.

While the use of machine learning in data center management is not a new development, growth in the space has accelerated recently – a result of growth in data science, an increasing number of data scientists, and availability of affordable compute resources in the cloud, Ascierto said.

How Machine Learning Improves Resiliency

Machine learning can help improve data center reliability through anomaly detection; failure-rate prediction modeling, which is the statistical likelihood that equipment will fail after a certain time period based on historical data; and incident analysis, which helps determine the cause of an outage or other performance issues, Ascierto said. 

These techniques can improve uptime in several scenarios. Machine learning algorithms can correlate data from power, cooling, and IT infrastructure to identify anomalies, pinpoint the exact location and cause of an issue, and send alerts and recommendations to IT staff, AdeptDC CEO Rajat Ghosh told us.

“It can be predictive: that this could be a problem, but it hasn’t happened yet. But if there’s an existing issue or something goes down, it can help you resolve it in the most optimal fashion,” Ghosh said.

For example, Maya HTT’s DCIM software uses algorithms that learn the variables that led to equipment failure in the past, and if the pattern starts to reoccur, Maya will notify the IT staff, Remi Duquette, VP of applied AI and Data Center Clarity LC at the company, explained.

The goal is to provide data center operators enough lead time to take corrective action, Enzo Greco, chief strategy officer at San Mateo, California-based Nlyte, said. If a server is about to crash, an hour is plenty for IT staff to move virtual machines to another server. Pending failure of a UPS battery, however, may require a notification window of a couple weeks, since it often takes that long to get one replaced, he said.

Machine learning can also help data center operators move beyond preventative maintenance – done at regular intervals, say, every three months – to predictive maintenance, Ascierto said.

In predictive-maintenance scenarios, software continually monitors equipment conditions and feeds operational data into a machine-learning algorithm to determine when maintenance is actually needed. As a result, maintenance costs can go down – because you’re doing less maintenance – and risk can be reduced, because there are fewer opportunities for human error. “If a human doesn’t have to touch equipment, it lowers risk,” Joe Reele, VP of data center solution architects at Rueil-Malmaison, France-based Schneider, said.

Using machine learning for cybersecurity and risk mitigation can also bolster data center uptime, Ascierto said. “Before you put in new technology or change the configuration or layout, you can model it in a sandbox to understand what it does to the resiliency of data center operations,” she explained.

Maya HTT, which collects data from 400 communications protocols and monitors servers, storage, switches, and power management equipment, uses machine learning to protect customers from cyberattacks.

“It’s a challenge because the attacks keep changing and hackers are getting better at hiding their patterns,” Duquette said. “But the AI is quite rapid at learning new patterns and understanding what is outside the normal ranges of network traffic in switches.”

The Cloud’s Role

Global, distributed cloud infrastructure plays in important role in further development of machine learning-enabled data center management. The accuracy of machine learning algorithms depends to a great extent on the amount of data that’s available for training them, and cloud can provide access to a vast amount of data.

Schneider’s first-generation AI-powered services are focused on reducing risk and improving efficiency, Reele said. The company provides real-time data center monitoring through its data center management as a service (DMaaS) offering called EcoStruxure IT. Launched in 2017, it collects data from its customers worldwide, anonymizes it, and pools it into large data lakes in the cloud.

EcoStruxure then takes each customer’s data center performance, benchmarks it against the global data, and sends the results to individual customers. The DMaaS offering currently runs high-level analyses on some critical equipment, such as UPS and cooling systems, to find anomalies and predict whether equipment will fail, Reele said. When it finds issues, the service alerts customers.

If two of a client’s 30 UPS units are outliers within Schneider’s worldwide database, the client gets a report that says they’re likely to fail, along with an explanation why. “We are giving them actionable intelligence. We are not saying they will fail tomorrow at 2 p.m., but they will probably fail at some point,” Reele said.

DCIM software vendor Nlyte launched a DMaaS this May using IBM’s Watson IoT as its machine learning engine. The company currently offers power management and predictive thermal services to improve efficiency, but it’s also working on two new uptime-focused services: predictive failure and predictive workflow maintenance, which is the ability to manage workloads and move them based on the expected future state of the data center, Greco said.

For example, if a power failure is expected on a particular piece of equipment, the algorithms will determine where to move virtual machines that depend on it. It will be able to move the VMs elsewhere in the same data center, a different facility, or a cloud service. Some customers want to receive an alert and move the workloads themselves, while others are comfortable letting VMware’s vMotion move the VMs automatically.

“Customers that have more confidence in their AI systems will cede control,” Greco said. “Other customers are not as convinced yet.” Nlyte expects to offer predictive failure and predictive workflow maintenance in the first half of 2019.

Meanwhile, AdeptDC has expanded from its cooling optimization roots and has built software that helps data center operators troubleshoot and prevent equipment failures. Instead of using software agents, the company uses containerized software, which pulls data such as temperature, voltage, and power system status from data center equipment.

With AdeptDC’s software, some basic problems can be fixed automatically, Ghosh said. If the issues are more complex, the software gives recommendations to operators on how to fix them.

The End Game: Self-Driving Data Center

The ultimate goal here is to create fully autonomous data centers that are as efficient and resilient as possible. The ability for infrastructure to manage and heal itself will be important, Jennifer Cooke, research director of IDC’s Cloud to Edge Datacenter Trends service, said.

“Because we are embracing a distributed, multiple-cloud hybrid IT, it’s a whole new way of delivering IT service, and it’s very complex, so the more IT infrastructure that can be managed without people, the better the outcomes will be,” she said.

But we’re not there yet.

It requires vast amounts of data from the components of data center equipment, because the more granular the data is, the more accurate AI becomes, Reele said.

Most data center operators implementing machine learning-based tools are still preparing their data, Ascierto said. They are instrumenting their equipment with sensors or meters to collect more data, and they are prepping and cleaning up their data to make sure it’s accurate.

Building an autonomous data center requires data scientists to continually improve and train their machine learning models, and data center operators have to be willing to give up control, vendors and analysts say.

Maya HTT’s team of data scientists sometimes fine-tunes or retrains its algorithms in an hourly or daily basis, depending on how much new data may influence behavior, Duquette said.

Progress is being made, however. Google revealed to us earlier this year that its infrastructure team had started using machine learning-powered software to fully automate cooling plants in its data centers. A year ago, Maya HTT could not do something similar, but because of its own advances, it’s prototyping and piloting a similar feature with some clients today.

“It shows you how fast things change in the AI space,” Duquette said.

TAGS: Manage Uptime
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish