Most people slow down as the weather gets hotter. Purdue University has applied that logic to its data center, developing software that can slow server activity as temperature rises. This technique recently allowed Purdue’s data center to continue operating during several cooling outages, throttling down server activity to prevent the room from overheating.
Patrick Finnegan, a Unix systems administrator at Purdue, came up with the idea of using the built-in capabilities of server hardware and the Linux operating system to slow the machines in an emergency. Finnegan developed a program to put large clusters of servers into a power-saving mode in which they draw less power and generate less heat.
The two supercomputers in the Purdue data center run around the clock, meaning there was no way to test the program until an actual cooling emergency occurred.
Worked Through Two Cooling Events
The system has been used twice this summer, in June and again in July, and has kept nearly 1,900 processors online throughout both thermal events. Purdue says it believes this is the first time this strategy has been used to manage a data center cooling failure.
“This is the first I’ve heard of it being used at this scale,” said Mike Shuey, high-performance computing systems manager at Purdue. “The cluster runs a bit slower when we put it in power saving mode. But we can keep it running, and we don’t lose any work.”
That’s an important priority in research computing centers, where jobs can run for weeks or months. “Whenever you shut down a cluster, you kill a process that’s running,” said Shuey. “A full shutdown of one of our systems can mean the loss of 2 million to 3 million CPU hours. I needed a middle lever between opening the doors and turning on fans and shutting stuff down.”
Cooling Two Large Clusters
The system was used in a 7,500 square foot server room housing two large clusters totalling nearly 2,000 nodes. One cluster is cooled using a chilled water system and conventional computer room air conditioner units (CRACs) around the perimeter of the floor, while the second cluster employs Coolcentric rear-door heat exchanger units.
The center taps the university’s central chiller plant, which typically delivers water at about 50 degrees F, and the data center normally operates at about 70 degrees F.
In June the cooling plant sent an alert warning of an outage. The data center began to heat up, threatening a “thermal runaway” in which temperatures soar out of control and equipment must be turned off. Instead, the Purdue staff was able to invoke Finnegan’s program, reducing the heat generated by the servers and stabilizing the temperature.
In July a second cooling loss occurred near midnight, and the power saving system was invoked remotely to manage through the event.
Purdue says its power saving mode can reduce power usage by 10 to 35 percent, with the largest benefits available in AMD systems. The software employs the Linux kernel’s Frequency Scaling driver, and was written for RedHat Enterprise Linux 5.0.
Purdue is now making the procedure available to other institutions and corporations on the Folio Direct website for $250.
“The program includes all of the notes on the implementation, so a datacenter manager can see the implementation from someone who has brought the process into production,” Shuey says. “We’re offering it to other organizations so they don’t have to do all of the discovery in-house. Everybody faces the same sorts of concerns, especially in older facilities and in the not-for-profit sector.”