This week at GTC 2022, Nvidia launched a broad range of data center products based on the new Hopper architecture.
At the center of the range is the H100 – a hardware accelerator featuring 80 billion transistors and two types of cores, built using the industry-leading 4 nanometer manufacturing process.
The H100 is much more than an incremental upgrade: it promises up to 9x faster AI training and up to 30x faster AI inference in popular machine learning models over the previous generation’s A100, released just two years ago. This represents an IT foundation that could help make artificial intelligence applications truly mainstream.
Much of the speedup is down to the new Transformer Engine, designed to accelerate specific machine learning tech originally developed at Google, and now used to create the world’s largest and most complex ML models.
“Of course, we are going to get more transistors, and it's going to be faster. But when we look at function-specific optimizations, we asked, what would really make a difference to our customers and to where the industry was going? And that's where the Tensor core support for Transformer really comes to life,” Charlie Boyle, VP and GM of DGX Systems at Nvidia, told Data Center Knowledge.
Meanwhile, Nvidia’s advancements on the server side enable any business (with deep enough pockets) to achieve supercomputer-class performance with off-the-shelf parts, using the new DGX H100 servers.
Yet another exciting announcement is the expansion of the proprietary NVLink interconnect, which is now able to link together individual server nodes, enabling the creation of what the company describes as a “data center-sized GPU.”
The age of the accelerator
The Hopper architecture is named after Rear Admiral Grace Hopper, who worked on some of the first general-purpose computers and was responsible for major advances in programming languages.
The most basic building block of Nvidia’s Hopper ecosystem is the H100 – the ninth generation of Nvidia’s data center GPU.
The device is equipped with more Tensor and CUDA cores, and at higher clock speeds, than the A100. There’s 50MB of Level 2 cache and 80GB of familiar HBM3 memory, but at twice the bandwidth of the predecessor – reaching 3 TB/sec.
Then, there are the new DPX instructions that most customers won’t care about, but which can be used to considerably speed up dynamic programming algorithms in fields like healthcare, robotics, quantum computing, and data science.
By tradition, Nvidia still calls the H100 a graphics processing unit, but the term is clearly on its last legs: just two out of the 50+ texture processing clusters (TPCs) in the device are actually cable of running vertex, geometry, and pixel shader maths required to render 3D graphics.
In terms of GPU virtualization, every H100 can be carved into up to seven isolated instances with a ‘shared nothing’ approach, making this the first multi-instance GPU with native support for Confidential Computing – i.e. able to protect data in use, and not just during storage or transfer.
The DGX H100 server
The newly-announced DGX H100 is Nvidia’s fourth generation AI-focused server system. The 4U box packs eight H100 GPUs connected through NVLink (more on that below), along with two CPUs, and two Nvidia BlueField DPUs – essentially SmartNICs equipped with specialized processing capacity.
If you combine nine DGX H100 systems together in a single rack, along with some third-party storage and networking equipment, you’ll get a DGX POD: a reference design that can be used to deliver AI-based services at scale.
The DGX SuperPOD takes this a step further: it links together 32 DGX systems and 256 H100 GPUs to deliver one Exaflops of AI performance with FP8 precision – a number that was reserved for the world’s fastest machines just a few years ago.
Nvidia itself will combine 18 SuperPODs, or 4,608 H100 GPUs, to build Eos, which is expected to become the world’s fastest AI supercomputer when it is switched on later this year.
The importance of NVLink for these systems cannot be understated: originally designed to connect several GPUs together while bypassing the relatively slow and bulky PCIe slot, the interface has evolved into a versatile tool for chip-to-chip connectivity.
A single H100 supports up to 18 fourth generation NVLink connections (cards based on Pascal and Volta architectures had six) for a total bandwidth of 900 GB/s – about seven times more than what is possible with PCIe Gen 5.
The most recent iteration of NVLink can deliver this level of bandwidth between up to 256 GPUs, or 32 DGX H100 systems, thanks to the addition of a dedicated chip called NVSwitch.
“This effort has been going on for a number of years, to optimally work NVLink inside the box, as well as outside the box,” Boyle told Data Center Knowledge.
“When you think about it, that’s 32 servers, and they're all running their own OS, they're all running their own stack. And you have to coordinate between all of them to run a single job. And that's a fairly big computer science problem. How do I ensure the back-end communication is working, and synchronized correctly, when the actual system may be doing something else?”
At GTC, the company announced that NVlink would be now considered default for all future Nvidia chips – including CPUs, GPUs, DPUs and SOCs. And yes, Nvidia is set to release a Hopper family CPU called Grace next year, based on Arm cores.
The company also said it would share NVLink specifications with the customer and partner ecosystem, enabling them to build compatible silicon products.
It is important to note that Nvidia is not trying to replace the popular InfiniBand interconnect, widely used in HPC – the Hopper architecture leverages both, and there are eight InfiniBand networking adapters in every DGX H100.
“There are some operations that are best done on the InfiniBand fabric that all the programmers know how to use. And then there are certain high-speed operations that you want to reserve for the NVLink part,” Boyle said.
Hot and bothered
For server manufacturing partners, the H100 comes with new and unusual design constraints. The board consumes much more power, and produces more heat than its predecessors—it might have higher performance per watt, but the H100 is rated at a sweltering 700W TDP, while the A100 would max out at 400-450W.
On the upside, data centers physically able to power and cool these things are easy to identify thanks to ‘DGX-Ready’ badge introduced in 2019.
“The power is going up, but that's not necessarily a bad thing,” Boyle said. “Because what we try to optimize for is total work done.
“And yes, I deal with customers who are power-limited in places, even with current-generation [hardware], which is why we have a great program called ‘DGX-ready’ where we pre-qualify colocation partners around the world.
“For the most part, if customers are buying a couple of systems, they can always find a place to put them in their data center. Yes, they take up a little more power than the average server, but you can always find a place. But when you're talking about building an AI center of excellence, you're talking about putting in 40 to a few hundred of these things, almost nobody – including us at Nvidia – just has that data center lying around.
“And that's why we've got a great partner network around the world, so no matter where you are, we've got a bit at least one, if not multiple data center partners in your area, so that you can quickly get online.
“And we do that with a lot of our customers who are looking at our larger configurations; we'll work with them on data center airflow analysis, the cooling analysis.”
The Nvidia H100 GPU and DGX H100 servers are expected to appear on sale in the third quarter of 2022.