Experts Talk Liquid Cooling Strategies to Manage AI Heat Wave
Powerful new chips and AI applications are generating so much heat that air cooling won't cut it anymore. Experts at DCW 2024 discussed liquid cooling implementation strategies and challenges.
April 26, 2024
Scientists predict that global temperatures are likely to rise steadily over the next couple decades. Data center managers won’t have to wait that long. A heat wave is heading their way courtesy of the latest GPUs, CPUs and AI applications.
“As AI requirements grow, data center operators must adapt their infrastructure to accommodate high-power-density server clusters,” emphasized Bill Kleyman, author of AFCOM’s State of the Data Center Report.
Fortuitously, Data Center World 2024 brought together some of the brightest minds in chip making and liquid cooling to address exactly how much heat we can expect, how next-gen chips and AI are driving disruption in the data center infrastructure, and how deploying new liquid cooling solutions in tandem with the right power strategies can bring respite from the intense heat.
Greg Stover, global director of high-tech development at Vertiv, acted as the emcee of a panel comprised of speakers from Intel, Nvidia and Vertiv.
“Disruption is here,” he said. “We can’t beat the heat with air alone. The majority of data centers will go through a transition from 100% air cooling to an air/liquid cooling hybrid in the next few years.”
Mohammad Tradat, Ph.D, manager of data center mechanical engineering at Nvidia, showed a graph projecting the growth of thermal design power (TDP) for microchips. The number of watts per processor is in the early stages of a surge from a couple of hundred to more than 1,000 watts. He mentioned a new chip from his company that can provide 138 kW in one rack. Such a rack density won’t stay cool with air alone.
“TDP has been spiking since 2020,” said Tradat. “We need to rethink the cooling roadmap by incorporating liquid.”
He considers single-phase technologies to be limited. Two-phase refrigerants, on the other hand, can apply to 200 kW per rack or more, he added.
“The transition from single-phase to two-phase liquid cooling will happen much sooner than air to single-phase liquid cooling,” said Tradat.
Retrofitting Existing Data Centers to Handle the Heat
Data center designs are in a position to plan new structures and start operations with a complete liquid cooling infrastructure. Most existing data centers don’t have that luxury. Tradat recommended that operators introduce whatever liquid they can based on the limitations of existing designs and space.
This might entail introducing liquid-to-air (L2A) coolant distribution units (CDUs), which bring the benefits of liquid cooling without the need for full-scale implementation of facility water. CDUs provide localized liquid cooling where it is needed most and leverage existing air-cooling systems to dissipate heat from the rack or row.
“This technology can be deployed rapidly with minimal disruption in most data centers,” said Tradat. “But once rack density rises, data center managers need to start thinking about liquid-to-liquid CDUs.”
A 4U CDU, he added, can provide 100 kW of cooling. But the liquid cooling industry needs standards for refrigerants and two-phase technologies for it to smoothly enter the mainstream.
Which Liquid Cooling Approaches Should You Invest In?
Dev Kulkarni Ph.D, senior principal engineer and thermal architect at Intel, laid out the four major approaches to liquid cooling – and his quick thoughts on each:
Single-phase direct-to-chip cooling – the most mature liquid technology with an abundance of vendor options
Two-phase direct-to-chip cooling – more cooling potential but with fewer vendors and less maturity
Single-phase immersion cooling – material compatibility issues have yet to be overcome, but many vendors are working on this
Two-phase immersion cooling – serious fluid, corrosion and safety concerns remain
“You have to implement these different cooling solutions based on what you are trying to do,” said Kulkarni. “But it is important to think two or three generations ahead. If you go all out on single-phase only, you might find you need to switch some infrastructure to two-phase technologies within a short period.”
His advice was to pay attention to silicon and AI hardware roadmaps and align your company’s and your customers’ needs with them. At the same time, pay attention to environmental, social and governance (ESG) goals and how you can scale your deployments rapidly.
But don’t wait to deploy AI, he added. He suggests you find a way to introduce it right away while you figure out a larger scale deployment. And finally, he said to find partners that can work with you on AI, cooling, scalability and sustainability.
One Second Away From Disaster
Steve Madara, vice president thermal and data centers at Vertiv, briefed attendees on some of the realities of liquid cooling technologies.
“If direct to chip fluid stops flowing for more than one second, a high-powered server goes down,” he said. “Reliability needs to be ultra-mission critical.”
He recommended that cooling loops going to the chip be put on an uninterruptable power supply (UPS) system so they never lose power – even if the grid power goes down. Madara gave an example: If power is lost and the data center takes 15 seconds to transfer to generator power, it might take a minute for the chiller to start running again and provide the desired level of cooling. In the interim, the water temperature of the latest generation of servers would surge by up to 20℉.
“There is a whole reliability play evolving in the liquid cooling arena,” said Madara.
He put forward L2A CDUs as the simplest liquid cooling technology to deploy. Those, he said, can go into legacy data centers right now.
Forecast: More Heat and More Liquid
The data center weather forecast for some time to come, then, is a lot more heat and far denser racks in the data center. That means more liquid cooling, too.
“Most of our inquires these days are for liquid to air for legacy sites,” said Stover. “But getting the heat out of the chip is one side. You still need to get the heat out of the building.”
That requires a coordinated thrust to add new cooling technologies, squeeze more efficiency out of existing cooling and power solutions, and achieve a higher level of sustainability.
“Data center providers need to facilitate density ranges beyond the normal 10–20kW/rack to 70kW/rack and 200—300kW/rack,” said Courtney Munroe, an analyst at International Data Corp. “This will necessitate innovative cooling, heat dissipation, and the use of sustainable and renewable power sources.”
About the Author
You May Also Like