With the increased demand for artificial intelligence and the vast amounts of data needed to build AI services coupled with the increasing volume of data generated by other sources, the need for sustainable and scalable data storage solutions is becoming more urgent. However, an increase in data center capacity to fill this need is also resulting in an increase in energy consumption. And this increase in data center energy demand is testing the capabilities of legacy thermal technologies, often to their limits.
Data centers are complex systems in which multiple technologies and pieces of hardware interact to maintain safe and continuous operation of servers. With so many systems requiring power, the electrical energy used generates thermal energy. As the center operates, this heat builds and, unless removed, can cause equipment failures, system shutdowns, and physical damage to components.
Much of this increased heat can be attributed to CPUs and GPUs. Each new generation of processor seems to offer greater speed, functionality, and storage, and chips are being asked to carry more of the load.
An increasingly urgent challenge is to find a new approach to cooling data centers that reaches beyond legacy thermal technologies — one that is both energy-efficient and scalable — with the ultimate goal of enabling greater data storage in an energy-efficient context.
One organization stepping up to the challenge is the U.S. Department of Energy, which recently launched the Advanced Research Projects Agency-Energy (ARPA-E) Cooling Operations Optimized for Leaps in Energy, Reliability, and Carbon Hyperefficiency for Information Processing Systems — COOLERCHIPS — initiative, awarding $40 million in grants to 15 enterprise and academic projects aimed at improving data center cooling technology. These projects represent thought leadership that is reinventing the way we think about data, energy, and the environment.
Each of the technologies developed is expected at minimum to meet Tier III reliability levels of 99.982% uptime. The grants will support research, groundbreaking prototypes, and scalable solutions geared to reshaping the landscape of data centers so they meet a sustainable standard.
One recipient of a COOLERCHIPS grant is the University of Florida at Gainesville, which is using its funding to develop a solution for cooling CPUs and GPUs.
Why CPUs and GPUs Are Heating Up
Before delving deeper into the University of Florida's COOLERCHIPS project, it's important to understand why CPUs and GPUs are heating up.
Effective operation of any processor depends on temperatures remaining within designated thresholds. The more power a CPU or GPU uses, the hotter it becomes.
When a component approaches its maximum temperature, a device may attempt to cool the processor by lowering its frequency or throttling it. While effective in the short term, repeated throttling can have negative effects, such as shortening the life of the component. In an ideal scenario, CPUs and GPUs don't require as much power consumption and thus don't get as hot.
Sometimes a game-changing technology forces us to re-evaluate our legacy systems. The increased growth and sophistication of AI have spurred chip designers to create larger and more powerful chips to manage the demands of large-scale language training programs required by AI developers.
For example, Nvidia's A100 and AMD's M100 represent a new generation of "monster chips." The Nvidia A100 contains 54 billion transistors, with a die size of 826 mm² and can execute 5 petaflops of performance, or about 20 times more than Nvidia's previous generation Volta chip. The AMD M100 delivers 25.6 million transistors on a die size of 750 mm² and is also capable of performing both CPU and GPU functions. The new architecture is the first GPU to break 10 TFLOPS, offering up to 11.5 TFLOPS of peak FP64 throughput. Cooling these chips presents new thermal challenges for legacy cooling technologies.
Today's computer chips use something called fin field-effect (FinFET) transistors. The internal resistance of a single FinFET transistor is low, at approximately 12 milliohms, but add 80 amps of current, and the dissipated energy resistance increases to 90w per second. Multiply this by the number of transistors housed in CPUs and GPUs within a data center, and the thermal management challenges become clear.
Legacy Cooling Technologies: Two Approaches
Traditionally, there have been two types of cooling technologies employed in data centers — and neither has focused on improving the chips themselves but rather on managing the environment in which they operate. The first approach is localized within the server infrastructure and works by moving heat away from crowded components to a place where it can be dissipated safely. The second type of cooling technology is located below the center's floor and serves to maintain the ambient temperature, using air circulation and convection to reduce heat stress on all the equipment within the facility.
Cooling Today: The Big Four
Cooling technologies depend largely on four elements: conduction, convection, layout, and automation. Each successive advance has represented a step forward in efficiency, but with today's increasing data demands, these four elements can often be found working in concert within a single facility. Let's look at each in greater detail.
Used to keep the earliest servers from overheating, conduction relies on direct surface contact to move heat from hot spots to cooler areas where it can be safely dissipated. Heat spreaders allowed thermal energy to be moved away from sensitive components, but the technology's capabilities were limited. Spreaders were quickly replaced by heat sinks, which remain an industry standard.
Typically a heat sink is mounted directly to the heat-producing surface by means of a face plate. Designs have evolved to maximize surface area and boost efficiency. To smooth the manufacturing process, face plates are generally made from die-cast aluminum. The addition of a copper center to the base plate increases conductive efficiency, as copper possesses approximately 40% greater conductivity than aluminum.
When conduction methods could no longer support the increasing power demands, a second heat-removal method was required. Built into the architecture of the data center, convection methods are more efficient than conduction. Using the directed flow of air or liquid to provide the desired cooling, convection systems are able companions to the conduction systems already in place. Conduction features such as heat sinks gather heat from electrical components, while convection moves that heat away from the servers.
Advances in convection technology have led to changes in fan design, including innovations in fan depth, blade architecture, and building materials to better control airflow and maximize cooling capacity. Variable-flow fans have also proven successful in adjusting airflow during heat surges from increased demand.
Heat pipes, another feature of convection systems, have also received upgrades to enhance efficiency. Common heat pipes feature a copper enclosure, sintered copper wick, and a cooling fluid. Incorporated within the heat sink base, these pipes directly contact the CPU, directing heat toward the exhaust fins in the heat sink.
Placement matters, especially when you're dealing with heat-generating electrical components. As power demand exceeded the abilities of both conduction and convection technologies, engineers were tasked with a new challenge — to use the layout of the center itself to facilitate cooling.
Successful approaches include removal of obstructions to airflow, design adaptations to enhance and control airflow, and the use of symmetrical configurations to balance airflow within the facility.
More recently, automation has entered the frame, allowing a finer level of temperature control and introducing the possibility of fully autonomous data centers, where temperatures are continually self-monitored and regulated. Automation also allows servers to rest components that are less in demand, and to use that energy to power components in higher-demand areas of the facility. Automated systems make use of heat sensors and cooling fans to direct and control airflow where and when it's needed. Power capping technologies have also allowed for less energy waste without compromising performance.
Limitations of Legacy Cooling Technology
While each of the aforementioned strategies generated increased cooling capacity, each too had limits. Use of these technologies in concert led to changes in the way data centers were conceived, designed, and built. Features such as raised floors, hot and cold aisles, and containment became common.
Prior to raised floors, the computer room air conditioning (CRAC) system simply blasted large volumes of chilled air into the space. Air distribution, however, was less than ideal. To address this challenge, raised floors were introduced to provide sub-floor cooling. Solid tiles were swapped out for perforated replacements, further improving air exchange and supporting more even cooling.
Server configurations also changed to a hot aisle-cold aisle system in which servers were arranged in parallel rows with hot air exhausts and cold air intakes facing each other at each end. Such configurations promote airflow and increase cooling efficiency. But even this, combined with raised floors, remained insufficient to meet demand.
A new approach emerged: containment cooling, sequestering hot air from cold and creating a system to strictly manage airflow streams. Containment successfully improved cooling efficiency, reduced cooling costs, and offered data center designers greater flexibility and more layout options. Containment, when used with these other systems, remains the efficiency standard in data center cooling.
But all that is about to change.
Looking Ahead: COOLERCHIPS' University of Florida Project
According to the ASHRAE Equipment Thermal Guidelines for Data Processing Environments, temperatures within data facilities should be between 18-22°C (recommended), with 5-25°C being the allowable limits. Staying within these parameters is challenging, especially as demand increases for large-scale language training programs for AI. Meeting the cooling demands of today's hotter chips while remaining sensitive to the global environment remains a technological and environmental hurdle.
What is being done to address the challenge? The University of Florida at Gainesville is using its $3.2 million grant from the COOLERCHIPS program to develop a disruptive thermal management solution for cooling future CPU and GPU chips at unprecedented heat flux and power levels in data centers server racks.
The new technology allows for significant future growth in processor power, rejects heat directly to the ambient air external to the data center, and would facilitate adoption within existing data center infrastructure with a primary liquid cooling loop.
The challenge, according to Saeed Moghaddam, William Powers Professor of Mechanical and Aerospace Engineering at the University of Florida and the project lead of the university's COOLERCHIPS program, is that significant energy is used to cool data center servers, accounting for up to 40% of the IT power consumption. "This energy is mostly used in air handling units, chillers, pumps, and cooling towers that are all elements of a typical cooling system," Moghaddam told Data Center Knowledge.
The energy used, he explained, produces a flow of cool air running through the racks to cool the servers. "Each rack at UF HiPerGator [supercomputer] dissipates ~40kW. So the heat intensity is very high. Hence, new technologies are needed to reduce energy associated with cooling the data centers."
Moghaddam is keenly aware of the impact that data centers have on the environment. "Data centers, being pivotal hubs of digital infrastructure, have a critical role to play in reducing greenhouse gas emissions and promoting energy efficiency," he said.
So what makes the University of Florida initiative a game-changer? "Because chipsets' temperature is ~80°C and, in principle, their heat can be released to the ambient environment at temperatures as high as 40°C to completely eliminate the need for power-hungry chillers and their associated systems," Moghaddam said. "But, because heat goes through so many interfaces to reach the ambient air, a temperature difference of higher than 80-40°C is required."
In the University of Florida's model, the chips are cooled directly using a heat sink in which liquid is boiled. The boiled liquid is then pumped outside at 70°C and can be cooled using an air-cooled rooftop heat exchanger with a fan that uses less than 2.5% of the IT load.
"When we add all power consumption associated with our cooling system, we come down to 4% of IT load compared to the current 40% of the IT load," Moghaddam said. "Our technology will greatly reduce the CPEX and OPEX."
We spoke with Panoply Group consultant and former Informa editor Lisa Sparks. She finds the COOLERCHIPS initiative intriguing, particularly because grant recipients include both enterprise and academic entities. She cautions that an open conversation between cooling system researchers and chip manufacturers is essential in determining which technologies are going to be most compatible.
The global chip market is changing, with China vastly increasing its purchase of high-power chips to more quickly advance its AI initiatives, she said. With the Biden administration considering limiting tech sales to China, it will be interesting to see how this all plays out.
Some important questions must be tackled, according to Sparks. Purchasers of any new technology are likely to ask whether the system can be retrofitted to enterprise, colocation, and hybrid data centers; how the new chips will affect the decommissioning of old servers; and whether hybrid processing power is needed to achieve these aims.
In a data-hungry world, legacy cooling technologies are stretched to the limit. The DOE's COOLERCHIPS projects represent a step toward both energy efficiency and environmental sustainability in the world of data storage.
The University of Florida at Gainesville is using its grant to develop an efficient system that directly cools chips in a heat sink using a liquid coolant, then moves that heat to where it can be safely disbursed. The system can be both integrated into existing data centers and incorporated within future designs.
The hope is that through advanced technologies like those be developed through the COOLERCHIPS program, we will be able to meet growing global data demand in an environmentally sustainable way.