HPE and DOE Partner to Build Largest ARM-Based Supercomputer

By the end of the year, the U.S. is going to have two record breaking supercomputers. We learned of the first a few weeks back, with the announcement of IBM's speed demon, Summit, with a clock speed of 200 petaflops, making it the planet's fastest supercomputer, leaving the former number one, China's Sunway TaihuLight at 93 petaflops, in the dust. Then early this week Hewlett Packard Enterprise (HPE) and the Department of Energy (DOE), announced they'll also have a supercomputer called Astra up and running, possibly as early as the end of this summer but definitely by the end of the year.

This system will be used by the National Nuclear Security Administration to run advanced modeling and simulation workloads to address issues such as national security, energy and science.

At 2.3 theoretical peak petaflops, this one's not set to top the Top 500 list, but by one important metric it will be the world's fastest. When up and running, it will be the largest and fasted supercomputer running ARM silicon. And if 2.3 petaflops doesn't sound like much when compared to Summit or Sunway TaihuLight, it'll still be somewhere about midpoint on the Top 100 list.

"It's not the largest supercomputer in the world, but it's by far the largest ARM-based computer," Mike Vildibill, VP of HPE's advanced technologies group told Data Center Knowledge. "It's still in the top 100, which is just a phenomenal milestone. To my knowledge there's no ARM-based systems on the Top 500 today, to kind of show you how aggressive the Department of Energy is being in taking this new architecture all the way into their production environments."

According to HPE, Astra represents something of a test bed on the path to develop an exascale-class system, meaning a system that can achieve 1,000 petaflops, or 10 times faster than Summit.

The system is based on HPE's Apollo 70 System, a 2U enclosure (twice the height of a standard rack mounted server) with four servers utilizing two Cavium ThunderX2 systems-on-a-chip in each enclosure. In total, the system will deploy 2,592 servers with 5,184 CPUs, all tied together using InfiniBand, a high bandwidth interconnect.

The ThunderX2 processor is relatively new, announced only a couple of months ago, and was chosen partly because of its memory performance. HPE claims that the system will offer 33 percent better memory performance than traditional systems with greater system density. The memory performance is important, since it enhances the system's ability to perform supercomputer workloads.

"The idea with these HPC systems, is the customer plans to run a single application across the entire computer system at the same time," Vildibill said. "So all of the CPUs are running on the same application and they're sharing data back and forth at very high rates. So they need this very high bandwidth InfiniBand interconnect, but they also need the high memory bandwidth that the ThunderX2 provides. The ThunderX2 comes with eight memory controllers built into the CPU, which is more memory controllers that are in traditional x86 systems today."

Astra will utilize the Lustre file system, a parallel file system that grants high-performance access through simultaneous, coordinated input/output operations (IOPS). For storage, Astra will deploy 20 all flash HPE Apollo 4520 units connected to run as a single file system with a capacity of over 400 terrabytes.

"The parallel file system allows for all 5,000 of these CPUs to read and write to the one common file all at the same time," he explained. "The parallel file system is like the traffic cop and manages all of the I/Os so that the complete supercomputer can run one application very fast, and it can do all of its I/O to one file or one file system, all simultaneous."

The 1.2 MW system will be liquid cooled using HPE's MCS 300, a liquid cooling solution that houses the Apollo 70 racks.

"We remove 99 percent of the heat from the racks with this cooling solution, which brings a lot of efficiency to the data center and saves the customer a lot of money over time," he said. "It's very expensive to try to blow hot air long distances, and if you can convert it to liquid to extract it you're much more efficient.

"Furthermore, with this solution the distance that the hot air has to travel is actually minimized compared to even other liquid cooling solutions. Quite interestingly, at the end of the day when you're calculating the efficiency of the cooling environment, calculating the distance that you're blowing with fans, all this hot air actually becomes an interesting bellwether to how your overall efficiency is going to look."

Vildibill said that the decision to use ARM processors was made by the Department of Energy before they began seeking a partner to design and build the system.

Comments

Plain text