Skip navigation
Google TPU Pod
Racks of servers powered by Tensor Processing Units (TPUs), Google’s custom processors for machine learning (Photo: Alphabet)

Machine Learning Tools are Coming to the Data Center

New breed of tools use machine learning techniques to add human knowledge to sensor data

Back at the dawn of the internet, data centers could be small and simple. A large ecommerce service could do with a couple of 19-inch racks with all the necessary servers, storage, and networking. Today’s hyper-scale data centers cover acres, with tens of thousands of hardware boxes sitting in thousands of racks. Along with the design changes, these mega-server farms have been built in new, remote locations, trading proximity to large population centers for cheap power.

As they automate data center operations, public clouds like Amazon Web Services or Microsoft Azure hire fewer and fewer highly skilled data center engineers, who are usually outnumbered by security staff and relatively low-skilled workers who do manual labor, such as handling hardware deliveries. Fewer staff managing more servers means monitoring the power and cooling infrastructure requires greater reliance on sensors, which we might now call Internet of Things hardware. They help identify issues to an extent, but there are many cases in which the experience of a seasoned facilities engineer is hard to replace with sensors. These are things like recognizing a sound that indicates a fan is about to fail or locating a leak by hearing the sound of water drops.

You need more than sensors to monitor modern data center infrastructure, and a new generation of applications aims to fill the gap by applying machine learning to IoT sensor networks. The idea is to capture operator knowledge and turn it into rules to help interpret sounds and video, for example, adding a new layer of automated management for increasingly empty data centers. The services promise “to predict and prevent data center infrastructure incidents and failures,” Rhonda Ascierto of 451 Research told Data Center Knowledge. “Faster mean time to recovery and more effective capacity provisioning could also reduce risk.”

See also: Deep Learning Driving Up Data Center Power Density

Predictive Analytics and Wider Data Variety

The first steps in this direction is predictive analytics in data center infrastructure management, or DCIM, software. One example is software by a company called Vigilent, based in Oakland, California. Its “control system is based on machine-learning software that determines the relationships between variables such as rack temperature, cooling unit settings, cooling capacity, cooling redundancy, power use, and risk of failure. It controls cooling units, including variable frequency drives (VFDs), by turning units on and off, adjusting VFDs up or down, and adjusting units' temperature setpoints.” Ascierto said. It uses wireless temperature sensors and predicts what would happen if an operator took a certain action – such as shutting off a cooling unit or increasing set-point temperature.

A different example is Oneserve Infinite, which mixes sensors with a wider variety of data points, pulling in for example usage and weather conditions to deliver what the Exeter, England-bases company calls “Predictive Field Service Management.” The aim here is to predict maintenance requirements, avoid failures, and keep downtime to a minimum. Chris Proctor, Oneserve’s CEO, told us that by applying these techniques, it should be possible to also handle strategic planning and procurement. “A data center would be able to manage their assets and their resources much more accurately and effectively,” he said. (To our knowledge, this kind of functionality isn’t yet live in any data center.)

Oneserve focuses on wider maintenance issues, but the approach maps well with how data centers operate, working with in-house operations and third-party contractors. One useful aspect of its tooling is a dashboard that tracks issues with past maintenance, allowing users to detail where access may be difficult, or where problems have occurred multiple times. Today that’s a very manual approach, but you’ll need this kind of data to train a machine learning system in the future.

Tapping Human Knowledge

Example of a company that combines sensor data with operator knowledge is San Jose-based LitBit. According to Scott Noteboom, its founder and CEO, who in the past led data center strategy for Yahoo and later Apple, LitBit's data center AI, or DAC, allows operators to build, train, and tune their own “co-workers” using machine-learning techniques. These could respond to events across a data center, alerting operators or – eventually -- automating actions. The key to LitBit’s approach is a form of assisted learning, where the system alerts the operators when it detects a new abnormal event, and the operators then create a set of rules for reacting to such events in the future. To collect data, LitBit has a mobile app that takes videos, which it can then turn into thousands of images for training.

The startup provides a managed cloud service, which will allow it to take advantage of many users’ anonymized data to build more complex and more accurate models; while some customers will choose to keep their trained models secret, others might sell theirs as an additional source of revenue. As Ascierto pointed out to us, “the value of data center management data multiplies when it is aggregated and analyzed at scale. By applying algorithms to large datasets aggregated from many customers – with diverse types of data centers and in different locations – … suppliers can, for example, predict when equipment will fail and when cooling thresholds will be breached.”

More on LitBit: This IoT Startup Wants to Break Down Data Center Silos

Don’t Go Seeking a Career Coach Just Yet

There’s a lot of implicit knowledge in operations, and surfacing it as rules can help identify problems and react faster, especially when the human operator with the knowledge isn’t around. Even if you don’t operate large geographically isolated data centers, you still want to be able to respond effectively during off-hours or during staff illness. A data center AI probably won’t completely replace your operations staff, but it could become a tool that enhances their existing skills and helps transfer them to other team members.

This area isn’t mature, but it’s developing fast. Machine learning applications using sensor data are improving rapidly and being used across a wide range of industries. Microsoft Research for example has been working with Sierra Systems to develop machine learning-based audio analysis for oil and water pipeline defects, using its Cognitive Toolkit to help classify anomalies. At the other end of the scale, machine learning models and tooling built for hyper-scale clouds are downscaled, with compressed neural networks using quantized weights running on low-capacity devices like the Raspberry Pi.

Don’t expect implementing an AI-based data center management service to give you instant results; the technology is new, the services are still in development, and they will need a lot of training. You may well need more sensors than you already have for your DCIM software, Ascierto points out. “If you wanted to exploit AI for end-to-end chiller-to-rack decisions, then acoustic and vibration sensors would be required for some equipment, as well as environmental sensors and power meters. If the goal is to optimize and automate setpoint temperatures for cooling units, then multiple environmental sensors per rack (top, middle, bottom) may be required.”

The underlying data models may be there, but they will also have to be tuned for your specific equipment, your specific workload, and, most importantly, to your site’s idiosyncrasies. Training an AI support system will take time, just like bringing a new human operator on-board, but in time, similar machine learning tools to those already running in production in the cloud will help run your data center.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish