Nvidia Makes Its Mark in Top 500 Supercomputers List for Power Efficiency

The resurgence of the most storied name in high performance computing is now complete. Cray supercomputers have captured 12 of the top 25 spots in the University of Mannheim’s venerable Top 500 Supercomputers list, the latest edition of which was released Wednesday.

A top-of-the-line Cray XC50 chassis with 206,720 total cores, dubbed Piz Daint [pictured above] and built for the Swiss National Supercomputing Centre, holds onto the #8 slot, with an R_max score of 9,779,000 — just under 10 petaflops (trillions of floating point operations per second). By comparison, the #7 slot — held by Japan’s RIKEN research lab’s venerable K computer, which debuted six years ago at #1, and was built on a Fujitsu SPARC64 chassis — performed at about 10.5 petaflops on the Linpack benchmark.

Piz Daint is accelerated by some 170,240 GPU pipelines, provided by way of Nvidia Tesla P100 accelerators. And while those GPU accelerators may be responsible for Piz Daint’s rise to success, they may also be a key contributor to the second largest power efficiency ratio on the November 2016 list: 7453.51 megaflops/watt. Of the 25 most power-efficient supercomputers undergoing the Linpack battery of tests, 16 are supplemented by Nvidia Tesla accelerators, with the top efficiency scorer — the #28 DGX SATURNV, built by Nvidia itself on its own DGX-1 deep learning system chassis — scoring a colossal 9462.09 megaflops/watt. SATURNV posted an R_max score of about 3.3 petaflops.

What do I mean by R_max? It’s an assessment of maximal sustained performance in a battery of tests based on the industry standard Linpack benchmark.

The Top 500 scores are assessed twice annually by the University of Mannheim, working in cooperation with Berkeley National Laboratory and the University of Tennessee, Knoxville. Testers look for how well supercomputer systems perform in this battery over a long stretch of time. The R_max score refers to “maximal achieved performance.” Testers operate under the assumption of a theoretical peak performance, called the R_peak; the ratio of achieved performance to theoretical produces an interesting derivative called the yield.

The top performer on the list overall is no surprise: Sunway TaihuLight, built for Wuxi, China’s National Supercomputing Center. It burst onto the scene earlier this spring with performance that could leave the dust in the dust, so to speak: just over 93 petaflops, with a decent megaflops/watt rating of 6051.3. “Sunway” is an Americanization of Shen Wei. Its CPUs are astoundingly simple in design, without memory caches, but with its 10.6 million-plus processor cores divided into clusters of 65.

In other words, the Shen Wei design is built for supercomputing, not an extrapolation of x86 architecture — not just a commercial, off-the-shelf (COTS) design. Needless to say, it doesn’t use (or need) acceleration from GPUs.

Last year’s supercomputing leader, China’s Tianhe-2, held onto second place with an R_max score of about 33.9 petaflops. When a country devotes a good chunk of resources to the exclusive design and production of an unmatched supercomputer — and no other country can — it has a high chance of success.

So from a standpoint of an actual race (and let’s be honest, all performance tests are really races) the real objective for commercial processor-based design is to demonstrate just how much power can be achieved by designs that are not exclusive to supercomputing, and learn what we can from their achievements. Viewed in that light, Oak Ridge National Laboratory’s Titan — an old 2012-model Cray XK7 which held the #1 spot five years ago — truly proves its mettle by scoring just under 17.6 petaflops.

Titan’s power plant is comprised of some 35,040 16-core, 2011-model AMD Opteron 6274 processors — several generations removed from today’s mainstream. (Titan is only one of 7 AMD-based systems on the November 2016 list, by the way.) But it is assisted by a battalion of 261,632 GPU pipeline cores, provided by about 100 2013-model Nvidia Tesla K20X accelerators.

The next best-performing Opteron-based model in this list is the #87-ranked Garnet, run by the U.S. Dept. of Defense’ Supercomputing Resource Center, scoring 1.17 petaflops in the R_max. Garnet does not use GPU acceleration. But Titan scored 2142.77 megaflops/watt in power efficiency, while Garnet scored just 209.55.

The message here is that GPU accelerators clearly improve power efficiency in high-performance settings.

However, accelerators don’t necessarily make HPC designs perform more closely to their theoretical peak. When systems on the list are sorted according to yield — that interesting measure of observed maximal performance to theoretical peak performance — only the 94th highest yielding system uses GPU acceleration: the #69 JURECA system, built for Forschungszentrum Jülich (FZJ), the European research institute in far western Germany near the Netherlands border. JURECA’s yield is about 84.13%.

The system with the highest yield this time is #295 NEMO bwForCluster, built for Freiburg University in Switzerland by Germany’s Dalco AG. Its yield is a stellar 98.78%.

Intel’s 14 nm “Broadwell” series Xeon E5 processors account for 69 of the Top 500, including the #11 system, a 241,920-core Cray XC40 based on Xeon E5-2695v4 processors. It scored 6.77 petaflops, though its megaflops/watt score has not been posted. But the older 22 nm “Haswell” series led the way, with 224 of the Top 500, the best performer of which is the #8 Piz Daint.

Comments

Plain text