Oak Ridge: The Frontier of Supercomputing

11 comments

Oak Ridge’s three supercomputers: Gaea, Kraken and Jaguar, are all currently ranked among the top 33 supercomputers in the world. (Photo: Rich Miller)

More Power, Scotty!

As Oak Ridge continues to expand its technical computing operations, it will need additional space and power for both its supercomputers and its in-house computing needs. An upgrade is in the works that will  provide Oak Ridge with an additional 20 megawatts of power for IT loads and 6 megawatts of chiller power capacity.

Jaguar and the other supercomputers at Oak Ridge provide researchers with the ability to tackle computational problems that would be impossible on other systems. Scientists are using these machines for breakthrough research in astrophysics, quantum mechanics, nuclear physics, climate science and alternative energy.

While the powerful systems housed at Oak Ridge require different approaches to power and cooling, the nature of their workloads enables different approaches to infrastructure. “Because of our financial and footprint constraints, we have to be really focused on keeping things simple,” said Griffin. “We don’t need to keep these things on at any cost, so we don’t need a Tier IV system. HPC used for research can recover from power outages. The biggest problem with the power going off is restarting stuff and hardware problems (from a hard stop).”

Operational Focus on Reliability

Even though Oak Ridge may not have the same uptime requirements as a major bank or stock exchange, reliability still matters. At a recent meeting of the Tennessee chapter of AFCOM, Griffin and Scott Milliken, Computer Facility Manager at Oak Ridge, discussed some of the operational strategies the lab employs to maintain high reliability.

The ORNL team works to rigorously commission, test, inspect and maintain electrical and mechanical equipment. That includes infrared and acoustical scans of electrical and mechanical rooms, power testing using load banks, simulations of power losses, predictive and preventive maintenance, and maintaining an inventory of spare parts on-site for critical components.

Griffin said Oak Ridge also has detailed power quality monitoring to guard against equipment challenges related to “dirty power,” and specs equipment to be able to ride through a range of  power quality events in the electrical system. “Nowadays, power supplies can handle a lot of things on power quality events,” said Griffin.

On the user side, no single computing job can run more than 24 hours, so any loss of data from a power outage would be limited.

Focusing Redundancy on Most Critical Systems

The Oak Ridge data center focuses its redundant infrastructure on key systems that manage a graceful shutdown of power and a quick restart of cooling systems. A 1,000 kVA uninterruptible power supply (UPS) system backs up the disk storage systems and some chillers, allowing the lab to maintain cooling in critical areas of the facility. ORNL also has worked with vendors on a quick-start system on its chillers, which allows it to produce chilled water within five minutes of a restart – a key consideration when seeking to limit the time period in which cabinets go without cooling.

On the efficiency front, the lab has implemented cold aisle containment in its storage gear and some areas of the supercomputing installations.  It will also raise the temperature in cold aisles to 65 degrees – still chilly by most standards, but up from the original 55  degrees.

The lab is currently nearing completion on an additional 20,000 square foot data hall which will be dedicated to its enterprise computing needs. As Oak Ridge’s in-house workloads are migrated to the new space, it will free up more space for future supercomputers.

“We envision two systems beyond Titan to achieve exascale performance by about 2018,” wrote Jeff Nichols, Associate Laboratory Director for Computing and Computational Sciences. “The first will be an order of magnitude more powerful than Titan, in the range of 200 petaflops. This system will be an exascale prototype, incorporating many of the hardware approaches that will be incorporated at the exascale. We hope to scale this solution up to the exascale.”

Pages: 1 2

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

11 Comments

  1. exascale supercomputer— able to deliver 1 million trillion calculations each second Now that's something

  2. Bill Hopper

    In the "Cooling 54 kilowatts per Cabinet" paragraph, "it interacts with a chilled water loop and is converted from liquid back to gas." should be "...gas back into liquid."

  3. twodogs

    Typo: Last line should end with "gas back to liquid." "Cool air flows vertically through the cabinet from bottom to top. As it reaches the top of the cabinet, the server waste heat boils the R-134a, absorbing the heat through a change of phase from a liquid to a gas. It is then returned to the heat exchanger inside a Liebert XDP pumping unit, where it interacts with a chilled water loop and is converted from liquid back to gas."

  4. Thanks for noticing. Yes, we meant "gas back to liquid" and have corrected this.

  5. Chris

    Typo: Jeff Nichols (not Nicheols)

  6. Thanks, Chris. I've corrected this.

  7. Jeffrey Plum

    Has anyone done research on recovering energy from data center and general business cooling? Mining waste heat might power associated businesses, or even living space. Heat mining, like data mining could be a new area of extracting value from overlooked assets of an operation.