The Futility of Extending Chilled-Air Cooling

A pair of recent announcements from two of the world’s biggest companies sent the clearest signal so far that the age of chilled-air systems is coming to an end.

According to a report in The Information, Amin Vahdat, Google’s general manager of machine learning, systems and cloud AI, told the Hot Chips conference that the company has had to make major changes to its data centers to accommodate its expanded capacity of AI chips, mainly by switching from air cooling to liquid cooling with noticeable improvements in performance and reliability.

And Jensen Huang, the CEO of NVIDIA, perhaps the poster child of the generative AI era, predicted $1 trillion will be spent over four years upgrading data centers for AI, including the increased need to keep those AI chips cool.

Chilled-air systems have been a loyal companion for most data-center operators for years, but the inefficiencies of those systems are becoming exposed as demands on electricity consumption and water usage continue to skyrocket. To continue down that path is using a blunt instrument on an increasingly complex problem, consigning companies to a life of increasing operational pain.

A look at the numbers reveals just how blunt chilled-air systems are. The primary heat generator in any data center is the CPU within each server. A typical 10,000-square-foot data center might contain 400 server racks, meaning there could be 16,000 CPUs in need of cooling.

But each CPU is just 1,600 square millimeters, so the total area of CPUs is just 275 square feet. Using air cooling, data centers employ AC systems designed for the entire building — cooling the entire 10,000-square-foot space rather than the heat-producing 275 square feet. That’s 36 times less efficient than it needs to be.

The Moving Target of Operational Efficiency

Given the blunt nature of chilled-air systems, data center operators need to put in a good amount of design and operational effort to wring out some reasonable operational efficiency. But that turns out to be an ongoing project because server and rack distributions change over time, whether it’s chips getting hotter with succeeding server models or pressure to manage budgets, space or ESG initiatives.

One of the trickiest problems occurs because the heat isn’t evenly distributed at any level — at the floor space level, or within a rack, or even within a particular server. Because the thermal distribution is uneven, chilled air coming from the raised floor gets warmed by the lower servers and may be too warm to effectively cool the upper servers in the rack. And because most data centers don’t do thermal imaging of their racks, the only symptoms might be that the upper servers are running throttled or have shorter life spans than the lower ones.

It would be tempting to compensate by increasing the flow of the chilled air or by making the air colder, but both are expensive options. And an unintended consequence of using increasingly colder air is that the bottom servers may actually be so cold that there is a danger of condensation, which has the potential to cause electrical shorts. It’s a no-win situation.

Managing Higher Rack Thermal Densities

Over the last decade, average rack thermal densities have quadrupled as server CPUs have become hotter (roughly doubling in thermal power from 100 W to 200 W). The data center HVAC industry has responded with an increasing number of products designed to deliver some combination of more air, colder air, or targeted air.

Most of these products, from bigger, beefier AC systems to air handlers and cooling towers, require sophisticated data center-level thermal CFD modeling to get right and carry their own costs and complexities. For example, implementing hot aisle and cold aisle containment systems can interfere with the fire suppression systems.

It may seem possible, on paper, that chilled-air systems can manage higher rack thermal densities, but the reality proves otherwise. As the thermal power of server CPUs accelerates over the next few years to the 50KW range, using air will simply become untenable.

One Solution: Liquid Cooling

It’s clear that at some point, kicking the can down the road with chilled-air cooling leads to diminishing returns, exponentially greater costs, and a dead end. There is some hope ahead, however.

Liquid cooling is the future for data centers, enabling greater server density per rack and higher compute performance while improving sustainability through reduced electricity, water use and costs. One particular approach, direct-to-chip liquid cooling, is the most targeted and efficient way to cool the hot chips in the data center. Direct-to-chip solves the blunt instrument problem by cooling only the 275 square feet of CPUs, rather than a 10,000-square-foot building, and it provides operators a consistent, low-risk way of maintaining operational efficiency and thermal rack densities.

At a time when energy use in data centers is soaring and on track to only increase, operators must say goodbye to the traditional air-chilled approach and look for a more efficient, reliable method. Google and NVIDIA have seen the future, and that future is liquid cooling.

Comments

Plain text