Data center cooling system, digital rendering

AdeptDC Wants to Use Machine Learning to Prevent Data Center Outages

The startup says its upcoming software will do much more than managing the cooling system.

As Google has demonstrated, applying machine learning to understand heat patterns and fine-tune the data center cooling system for maximum efficiency is a sound data center use case for machine learning. But AdeptDC, the software startup applying machine learning to data center management, thinks it can be even more effective if it takes into account more than just cooling, or even power.

The company, which in its early years had been focusing on cooling optimization, is expanding the scope of its capabilities, promising a system that collects data from power, cooling, and hardware, correlating all the various information to holistically optimize for efficiency, troubleshoot, issue incident alerts, and prevent equipment failures by identifying anomalies.

AdeptDC expects to launch its AI assistant for data center operators as soon as next month, CEO Rajat Ghosh told Data Center Knowledge in an interview. It’s using the same machine-learning technology and the same relatively easy installation approach – via Docker containers – that doesn’t require hardware sensors.

The company learned it would have to tackle more than cooling from pilots it’s been doing with prospects.

“We’ve been running pilots with several data centers in the US and overseas, and what we’ve learned is that reducing cooling costs and improving relative efficiency is nice to have but not the main thing [operators] care about,” Ghosh said.

Operators worry mostly about avoiding failures, which often happen because of problems with cooling and related hardware issues. (The disastrous Microsoft Azure outage last month was only the most recent high-profile example.) Applying its technology to help solve problems of that nature is AdeptDC’s new goal. “We’re using the same machine learning technology, but instead of just power and cooling optimization we’re using it to make sure the hardware is running healthy and predicting performance issues,” he said.

That means collecting operational data from server power supplies and fans, whose failure, according to him, is a primary concern in data centers operations. “The CPU is already taken care of in the hardware architecture, but the power systems and server fans fail all the time.”

AdeptDC’s angle here is correlating hardware data with data on the state of the facility cooling system.

“Companies like Google use environmental data as a proxy for the overall health of the data center ecosystem and performance,” Ghosh said. Environmental data (temperature and humidity) is part of overall system health, but voltage monitoring is also critical, he suggested. “Voltage is a primary indicator of overall data center health; if voltage is behaving weirdly, then there could be all sorts of problems.”

It can take about a week after installation to gather enough data to get a baseline and start generating accurate correlations.

The correlations are useful in generating fix recommendations when there are incidents and fine-tuning the cooling system, but most importantly, they are useful for detecting anomalies during normal operations. Once AdeptDC flags an anomaly, its dashboard shows which logical layer it’s in: IT, network, or power and cooling.

“We want to capture the symptoms that act as an early warning,” Ghosh said.

Correlations also help with troubleshooting. The system includes checklists for triaging incidents to help staff, which may be panicking during an outage or looking for problems in wrong places. “When there are data center failures, most of the team runs to the server room, but server problems may be related to cooling issues,” Ghosh said.

There are multiple troubleshooting levels:

Level one is for simple things. If server lights aren’t on, for example, there may be a problem in the power or cooling system. The next level is slightly more complex, such as voltage issues inside a device. Even more complex levels deal with things like airflow data.

If the system goes through the lower levels and fails to identify the problem, the machine learning functionality kicks in to find correlation between the root cause and various other sources that could potentially be causing the problem.

Using machine learning to handle incidents could help make up for the dwindling supply of skilled data center workers. “There is a huge talent shortage, and there are no university courses in data center operations management, so this is going to be a big problem going forward,” Ghosh noted. “Some of this job can be done by AI in a more systematic way, and I'm very hopeful that next-generation AI can help to bridge that gap between supply and demand.”

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish