A cooling plant inside Google's Hamina, Finland, data center Alphabet/Google
A cooling plant inside Google's Hamina, Finland, data center

Google is Switching to a Self-Driving Data Center Management System

Google’s use of AI to optimize data center efficiency has entered a new phase • Machine-learning algorithms are now adjusting cooling-plant settings automatically, in real-time, on a continuous basis • The system builds on the work Google revealed before – an AI-based recommendation engine • The new system fine-tunes cooling autonomously

Most data center operators don’t think your typical tornado-watch period is the best time to start tweaking cooling-system settings for marginal energy savings. It’s usually time to batten down the hatches and hope the power stays on. Humans have their priorities.

But an Artificial Intelligence algorithm designed to look for every opportunity to shave off a kilowatt-hour will take that opportunity if it sees it, regardless of weather.

Well, maybe not entirely regardless of weather.

Under a recent tornado watch, the AI system managing the cooling plant at one of Google’s data centers in the Midwest changed the plant’s settings in a way which the facility’s human operators found counterintuitive. After closer scrutiny, however, it did what had to be done to save energy under those specific circumstances.

Weather conditions that make a severe thunderstorm likely to form include a big drop in atmospheric pressure and dramatic temperature and humidity changes. Weather plays a big role in the way some of the more sophisticated data center cooling systems are tuned, and the software running Google’s cooling system recalibrated it to take advantage of the changes – no matter how small the advantage.

That wasn’t quite the same system as the one Joe Kava, Google’s VP of data centers, described in 2014, when he first revealed that the company was using AI to improve data center energy efficiency. That system, developed by Google’s then data center engineer Jim Gao, was implemented as a recommendation engine.

“We had a standalone model that we would run, and it would spit out recommendations, and then the engineers and the operators in the facility would go and change the setpoints on the chillers, and the heat exchangers, and the pumps, and all that to match what the AI system said,” Kava said in a recent interview with Data Center Knowledge. “That was manual.”

The use of AI to manage energy efficiency at Google data centers recently entered a new phase. The company is now aggressively rolling out what Kava referred to as a “tier-two automated control system.” Instead of simply making recommendations, this tier-two system makes all the cooling-plant tweaks on its own, continuously, in real-time.

The first system, developed by Gao as a “20-percent project” and later with involvement from Google’s DeepMind AI team, could shave up to 40 percent off a facility cooling system’s total energy use. The new fully automated version is saving about 30 percent annually, and the company expects further improvements.

The automated control system builds on Gao and DeepMind’s original work. (Gao has since joined the DeepMind team, according to Kava.) It’s looking at the same input variables: outside air temperature, barometric pressure, wet-bulb temperature, dry-bulb temperature, dew point, power load in the data center, air pressure in the back of the servers where hot air comes out, and so on – 21 variables total.

“It crunches all that data, and, based on the weather conditions and the load in the data center, it optimizes PUE (Power Usage Effectiveness),” Kava said.

Lots of Minor Tweaks

The tornado-watch example is a good illustration of the way Google’s machine-learning algorithms for data center management can save energy beyond what human operators can do. The overall benefit is the sum of marginal savings from minor tweaks done continuously. “It’s making more fine-tuned adjustments than you would normally make as a human,” Kava said.

If, for example, outside temperature went from 72 degrees Fahrenheit in the morning to 76 degrees in the afternoon, with wet-bulb temperature staying about the same, a human operator wouldn’t go and change settings on the cooling plant to adjust for that minor temperature change. Even if they knew what changes to make to reduce energy use, “they would probably just say it won’t make that big of a difference,” Kava explained.

The system does especially well when the company launches new data centers (which it’s been doing a lot lately, as it expands the scale of its cloud services business). Typically, a newly launched data center runs at its least efficient, because it’s not utilizing most of the underlying infrastructure’s capacity.

Google may deploy some server clusters in a new building on day-one. Regardless of how many rows are populated, however, the network fabric that stretches across the entire data center needs electricity. “You have to have power across all the rows, even though they’re not full,” Kava said. “Machine learning has really helped to get us much more efficient, even under those light-load conditions.”

Lightly loaded, a newly launched Google data center’s typical PUE is between 1.3 and 1.2, he said. With the cooling system controlled by AI, it can go down to 1.1 or 1.09. “And even though it doesn’t sound like much … it’s a tremendous amount of energy savings at our scale.”

Letting Go of the Steering Wheel… Slowly

Giving a machine-learning algorithm control of some of your most mission-critical infrastructure takes some working up to.

The more runtime you accumulate and the more data you collect, the better your machine-learning algorithm gets, and the more comfortable you get with the idea of giving it more control. “And you start to put in the guardrails to make sure that bad things can’t happen, and then you start to launch fully automated systems instead of semiautomated systems,” Kava said. “And if the fully automated systems actually start running better, then you start deploying more of those.”

The guardrails are important. “If you were to tell a machine to optimize for PUE, the machine might tell you to shut down all of the servers,” he half-joked. (PUE is the ratio between total power consumed by a data center and the amount of power consumed just by the IT equipment inside.)

Even with the extreme uniformity Google applies to building out its infrastructure – uniformity is the only way to operate at such scale – each of its data centers is different enough from the others that the AI-based automated control system cannot be rolled out across them all at once.

Each site’s cooling system is architected in an optimal way for its specific location, and Google’s data center engineers constantly look for new ways to reduce energy use, so at least some changes to the designs are made once about every 18 months.

That means a machine-learning model has to be trained for each site. “You have to train your model for that specific architecture,” Kava said. “So, it takes time, but we definitely believe in it, and we continue to see the benefit, and we’re being as aggressive as possible in that regard.”

Not Just for Google and Not Just for Data Centers

Jim Gao is at DeepMind now, and one of the organization’s many projects is continuing the work he started while on Google’s data center engineering team. The work’s scope now stretches well beyond data centers.

Much of the model Gao developed applies to “any type of an industrial plant that has cooling and heat loads,” Kava said. It could be a chemical plant, for instance, or a petroleum refinery. The idea is that eventually, the model could serve as the basis for a solution Google provides to industrial clients, so they too could use AI to make their plants more efficient.

What About the Jobs?

With more and more of the company’s data centers shifting to automated infrastructure control, and with the real possibility that the same will eventually start happening outside of Google, arises the inevitable question of jobs. Are Google’s data center engineers engineering themselves and their colleagues out of work?

So far, Kava hasn’t seen evidence of that happening.

“We still have people there, because they still have to do all the maintenance,” he said. “So, you’re not getting rid of the people, you’re augmenting” the existing team’s capabilities. “Instead of trying to tune the system themselves, they can focus more of their time on preventative maintenance and corrective repairs.”

Besides, AI still does poorly in situations “outside of the envelope of its training,” he said. People are very good at making observations in what Kava likes to call “corner cases” and coming up with a course of action on the spot. AI isn’t.

In other words, it’s a good idea to have AI fine-tune a cooling system to improve efficiency in pre-tornado conditions, but you better have some human engineers around in case a tornado forms.

Correction: August 13, 2018
A previous version of the article said the new, fully automated version of the machine learning-based cooling management system saved an additional 15 percent of cooling energy use, on top of the savings achieved with the first version. A Google spokesperson said the fully automated version saves 30 percent of energy annually, with more savings expected in the future.
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.