How Amazon's New Machine Learning Service Can Help in Data Center Management

SAN FRANCISCO – Last June Google said it was using a neural network, a machine-learning system, to help it manage its data centers more efficiently. The company claimed the system helped it squeeze more efficiency out of its infrastructure beyond the efficiency it could get just by following all the design and operation best practices.

But that’s a proprietary system Google designed for its own needs. Multiple big vendors, however, are building general-purpose services that put machine-learning capabilities that are adoptable for any situation, including data center management, into anybody’s hands.

Amazon Web Services rolled out one such service Thursday at its conference here. Called Amazon Machine Learning, it is a fully managed service hosted in Amazon data centers. Designed originally as a tool for the company’s own developers, the aim behind it is to be able to ingest as many different types of data as possible, build a custom model based on that data, and make predictions in real time.

“There’s no bounds on this thing,” Matt Wood, general manager for data science at AWS, said in an interview. “It’s designed to be a general-purpose system.”

In data center management, you can feed data from temperature and humidity sensors on the IT floor, server CPU utilization, server power draw, outside weather data, and any other of the myriad of metrics that can be tracked in a data center into the system, build a model, train it, and use it to make predictions.

“Can you predict when a particular component needs to be replaced?” Wood asked as an example of the type of questions a user can ask the system to answer. “Can you predict the very, very, infinitesimally small changes in performance that indicate when a hard drive needs to be swapped out, or when an entire power supply is going to fail?”

Not for Rocket Scientists

As far as the user is concerned, the interface is designed to be extremely simple to use through wizards and APIs. It’s for people with zero machine-learning experience.

It starts by automatically taking a lot of the user’s data from their sources and starts running summary statistics over it to try and guess what the best possible format of that data would be to identify features of the model. “And you can dive into that and fine-tune it if you want, but it makes a pretty good, accurate first guess about how to handle the data,” Wood explained. “Then you can train the model.” Training the model means testing how it does and making adjustments as necessary to increase prediction accuracy.

Once the model is in place, the user can simply ask questions they need answered. In the context of data center management, the questions can be as basic as what the temperature in the data center is going to be tomorrow based on historical temperature data and weather data, or when is a particular server likely to fail.

Amazon charges per hour for the time it takes for its infrastructure to crunch through the user’s data to build the model. Then, the user pays based on the amount of predictions the system makes. The current cost is simply $1 per 1 million predictions, so, to run a query about 1,000 servers would mean 1,000 predictions.

Machine-Learning Tools on the Rise

Other examples of big vendors rolling out general-purpose machine learning services include Microsoft, which launched its machine-learning service on Azure into general availability in February, and IBM, which has been steadily adding features to its Watson cloud services.

There is also a handful of vendors that sell solutions that use sophisticated modeling and predictive analytics specifically for data center owners and operators. They are companies like the Romonet, Future Facilities, and Vigilent, among others.

While the likelihood is that data center models created using general-purpose machine-learning services are not going to be as sophisticated as the models generated by data center-specific tools, where services like Amazon’s and Microsoft’s have an advantage is elasticity and scale. They are pay-as-you-go services delivered from the providers’ global data center infrastructure.

Knowing how to build and operate a globally distributed computing system is a lot to bring to the table. “The truth is that we’ve done this before,” Wood said. “We did it with Redshift. You might think of it as a traditional data warehouse, but it’s really a compute-heavy, massively parallel processing system that has custom hardware under the hood. We’ve done it for Redshift, and we’ve done it for Elastic MapReduce, and we’ve done it for EC2 for so long, that we consider it a core competency of ours to be able to manage that capacity on behalf of customers.”

Comments

Plain text