More Power, Scotty!
As Oak Ridge continues to expand its technical computing operations, it will need additional space and power for both its supercomputers and its in-house computing needs. An upgrade is in the works that will provide Oak Ridge with an additional 20 megawatts of power for IT loads and 6 megawatts of chiller power capacity.
Jaguar and the other supercomputers at Oak Ridge provide researchers with the ability to tackle computational problems that would be impossible on other systems. Scientists are using these machines for breakthrough research in astrophysics, quantum mechanics, nuclear physics, climate science and alternative energy.
While the powerful systems housed at Oak Ridge require different approaches to power and cooling, the nature of their workloads enables different approaches to infrastructure. “Because of our financial and footprint constraints, we have to be really focused on keeping things simple,” said Griffin. “We don’t need to keep these things on at any cost, so we don’t need a Tier IV system. HPC used for research can recover from power outages. The biggest problem with the power going off is restarting stuff and hardware problems (from a hard stop).”
Operational Focus on Reliability
Even though Oak Ridge may not have the same uptime requirements as a major bank or stock exchange, reliability still matters. At a recent meeting of the Tennessee chapter of AFCOM, Griffin and Scott Milliken, Computer Facility Manager at Oak Ridge, discussed some of the operational strategies the lab employs to maintain high reliability.
The ORNL team works to rigorously commission, test, inspect and maintain electrical and mechanical equipment. That includes infrared and acoustical scans of electrical and mechanical rooms, power testing using load banks, simulations of power losses, predictive and preventive maintenance, and maintaining an inventory of spare parts on-site for critical components.
Griffin said Oak Ridge also has detailed power quality monitoring to guard against equipment challenges related to “dirty power,” and specs equipment to be able to ride through a range of power quality events in the electrical system. “Nowadays, power supplies can handle a lot of things on power quality events,” said Griffin.
On the user side, no single computing job can run more than 24 hours, so any loss of data from a power outage would be limited.
Focusing Redundancy on Most Critical Systems
The Oak Ridge data center focuses its redundant infrastructure on key systems that manage a graceful shutdown of power and a quick restart of cooling systems. A 1,000 kVA uninterruptible power supply (UPS) system backs up the disk storage systems and some chillers, allowing the lab to maintain cooling in critical areas of the facility. ORNL also has worked with vendors on a quick-start system on its chillers, which allows it to produce chilled water within five minutes of a restart – a key consideration when seeking to limit the time period in which cabinets go without cooling.
On the efficiency front, the lab has implemented cold aisle containment in its storage gear and some areas of the supercomputing installations. It will also raise the temperature in cold aisles to 65 degrees – still chilly by most standards, but up from the original 55 degrees.
The lab is currently nearing completion on an additional 20,000 square foot data hall which will be dedicated to its enterprise computing needs. As Oak Ridge’s in-house workloads are migrated to the new space, it will free up more space for future supercomputers.
“We envision two systems beyond Titan to achieve exascale performance by about 2018,” wrote Jeff Nichols, Associate Laboratory Director for Computing and Computational Sciences. “The first will be an order of magnitude more powerful than Titan, in the range of 200 petaflops. This system will be an exascale prototype, incorporating many of the hardware approaches that will be incorporated at the exascale. We hope to scale this solution up to the exascale.”
Pages: 1 2