Supercomputers aren’t supposed to matter much to the enterprise data center. When someone talks about “high performance” in the context of the Linpack benchmark or solving differential equations, the topic police often show up to separate the academics from the practitioners, the scientists from the accountants.
But in the 2000s, something extraordinary happened. High-performance computing (HPC) reinvented itself by adopting a completely different architectural principle. Setting aside decades of bespoke designs, the HPC field as a whole invested in cheaper, commercially available x86 processors, deploying them in bulk. As a result, the first great advances in parallel processing made by Intel and AMD were warmly greeted by both the academic and commercial HPC communities. Suddenly, supercomputing foretold the benefits that enterprises could partake in, four or five years down the road.
In the 27 years since Jack Dongarra co-developed the High-Performance Linpack (HPL) algorithm – with which the Top 500 supercomputers are scored every six months – supercomputer architectures have metamorphosed at least four times. He has seen the rise of the first commercial off-the-shelf (COTS) era of HPC. In an exclusive interview with Data Center Knowledge, Professor Dongarra tells us he is witnessing another transition in progress, one that is splitting architectural evolution in two directions. One is a return to the notion of bespoke designs, which would lead to a colossus that forges its own path to exascale: the capability to process a billion billion (1018) floating-point operations per second.
The other is a kind of second COTS era, where instead of single piles of CPUs there are now hundreds of GPUs — vector-processing accelerators rooted in the graphics cards of the x86 PC’s heyday. But neither path, Dongarra believes, will make it easier for developers and software engineers such as himself to produce the types of optimized work processes that inspire more efficient, better performing software for everyone else down the road.
He has an idea for solving this, and the solution he suggests would rewrite the guiding principle of the computer industry since 1965. But there may be no other way but to rewrite that principle.
Following is our interview with Professor Dongarra, lightly edited to improve readability.
Jack Dongarra: GPUs are more complicated to use, and because they’re more complicated to use, it’s harder to get at the peak performance for the machines. So, we have devices that are quite powerful on paper, but when you try to use them, applications can’t quite get to their peak performance.
As scientists, there are other ways that we exploit the computer to get better performance. In the old days, hardware was the main driver. Hardware was the only thing that pushed performance to its extreme. Today, hardware is still there, but not as great, and we have to resort to other things. Being clever people, we know it’s not just the hardware, but it’s also the algorithms, the software that we use. We have to make improvements in the software to compensate for the slowdown of Moore’s Law.
How can we get back to the point where maybe Moore’s Law is still in effect, or some law is still in effect? Let’s not call it “Moore’s Law” anymore; let’s call it a law that takes into account not only the hardware, but the algorithms and the software, those things combined — we’ll call it “Jack’s Rule.” We can get to that same point by using better algorithms and software with the applications.
Scott Fulton, Data Center Knowledge: Well, if we were to create Jack’s Rule right now… When Gordon Moore wrote “Cramming More Components Onto Integrated Circuits” [PDF] back in 1965, his observations had to do with the production of processors, and the performance we could expect from processors over time. So, the “Law” that was attached to this observation, really applied to this processor. But as you just told me, systems have a whole bunch of performance components that kind of interact with one another: You have accelerators, co-processors, performance factors presented by the network. And in the future, when we get quantum computers, and we’ll have quantum interconnectedness between these individual points. You might be able to pair processors together and stack qubits on top of each other over a network and combine processing power that way.
So, if we decide here and now that system performance is a sum of performance of the system’s parts, maybe we can come up with a rule that says, based on what we understand about accelerators, what we understand about algorithms, the combined network performance and latency minimization, and the optimizations that we’re running in the background, we can predict for the next year a Gordon Moore Commemorative Performance Factor of “1.5” or “1.3” — some number that can give you a forecast. But everybody will know it’s a variable. It’s susceptible to changes. Maybe we need a supercomputer to help us figure what those changes are.
Dongarra: A lot of what you say, I’m agreeing with. The point I guess I want to make here is, the performance of a computer is not just based on the hardware. It’s not just because of the transistors that we have. It’s how those transistors are used in some algorithms, some methods, and some applications. I’m all for benchmarking things; I’m all for looking at what the real performance is, of something... I want to be able to have a measurement which is more reflective of what reality is, and not just based on peak performance. Looking at the transistors on a chip is one thing, but when we really get down to it, what I’m interested in — what I think the users are interested in — is the actual performance — or, better said, user productivity.
Prof. Dongarra shared this slide that depicts performance levels of the highest performing several dozen machines in the November 2020 Top 500 list, with the #1 Fugaku system on the far left. The green line here represents theoretical HPL maximum performance levels Rmax for these machines, and the black line represents observed HPL peak performance levels Rpeak. Now, note the performance scores for the same systems on the High-Performance Conjugate Gradients (HPCG) test, which the professor believes is more reflective of the real-world work with which supercomputers are tasked to perform today. Note the Y-axis scale, which represents teraflops-per-second, is in powers of 10 — so the HPCG score can be about 2 percent of the HPL score on average.
Dongarra: We get a lot less performance because of a number of things, but mainly because how the hardware interacts with the software is complicated. We can change things by making improvements in the algorithms and software we’re using to solve the application we are investigating.
I’m a scientist who does work in the software area. I develop algorithms. I can’t change the hardware. In some sense, somebody throws a piece of hardware over the fence, and I have to use it, and I can’t change that hardware. The only thing I can do is to change how that hardware is used by my software. Now, Gordon Moore threw that device over the fence, and now “Jack’s Rule” has to come in and said, “We need to make improvements in the way in which we solve the problem, to get back up to a point where we feel good about how that machine is performing.”
DCK: You knowing algorithms as well as you do, are you able to create maybe a corollary of Jack’s Rule (or maybe that’s the rule itself) that gives you that understanding of margin of error, so that you know when they throw the machine at you over the wall, the first raw HPCG number is going to under-predict a few things, and knowing what you know about the system, you could look into a couple of factors and say, “You know, I know I can optimize that so it can double the performance numbers I’m seeing here on that first score?”
Dongarra: Yes, that’s what I do for a living. You’ve just encapsulated what I’ve been doing for the last 40 years.
DCK: The first talk at the recent Supercomputing 2020 conference was about the coming need to change materials. Shekhar Borkar, a former Intel fellow currently with Qualcomm, was saying CMOS is probably going to come to an end. Now, we can do what we can, and maybe make some efficiency changes for maybe five years, maybe ten. But at that point, we’ll probably need to move to a different concept — which has a very strong likelihood of continuing the Moore’s Law comfort level that we had in the past. But we might have some area of discomfort getting into how wild these new materials and new concepts may be.
In order for those big leaps to take place, I would think you’d have to have a lot of investors. You’d have to have a lot of interest from multiple governments, and multiple manufacturers. When supercomputers started becoming a thing again in the ‘90s, they took COTS technology, Intel processors (big Pentiums), clustering them together and doing a lot with a thousand processors. So, you were leveraging a lot of the investment that was already there... Can we do what’s predicted here just with the supercomputer industry alone?
Dongarra: You’ve put your finger on an issue here that’s at the heart of the problem. In the old days, we had companies that made supercomputers only. That was all they did. We can point to companies like Cray and CDC, to some extent, and others that made just supercomputers — they made the whole thing, the hardware, everything in that machine. Then microprocessors became more powerful. We leveraged the microprocessor. And those specialized computers were too expensive. The market was not there to sustain production of those machines over time. So, the machines were not designed for scientific computing. We sort of cobble together microprocessors and accelerators now and use them to solve scientific problems. Scientific problems have a different characteristic in what they do. The hardware really hasn’t addressed the characteristics enough to make them very efficient. If we had an investment by industry in developing hardware that was specific for scientific computing, we would see much better machines — machines that would be easier to use. The problem is, the market can’t sustain that.
But there’s a counterpoint. The Japanese researchers at the RIKEN Center for Computational Science said, “We’re going to build this machine, it’s going to be for scientific computing, and we’re going to try to address some of the weaknesses in the hardware and make them better.” They invested half-a-billion dollars [in Fugaku]. That’s a lot of money for one machine that doesn’t even get to exascale, but if you take a look at the performance numbers for it, it’s very good. The critical thing today is not floating-point operations; it’s data movement. It does a very good job of moving data around. That’s where it really shines, and you can see it in the benchmarks. You can see it in the ease in which you can get performance on that machine. It’s a dream for applications users to use that machine.
Whereas, on our machine at Oak Ridge — which is Summit, based on IBM and Nvidia GPUs (Fugaku’s predecessor at the top of the Top500 list) — it’s a struggle to get performance. Applications people have to work very hard to contort their algorithms and their applications into a way that can effectively use that architecture. I won’t say it can’t be done, but it takes a lot of work to do it. In some sense, my time is more valuable than the computer’s time, so I want to optimize the user’s time in some sense. That’s what I would strive to do.
Now, we’re going towards exascale. And the new exascale machines are also expensive. The US Deptartment of Energy is investing $1.8 billion in three computers: one at Oak Ridge called Frontier, another one at Argonne called Aurora, and the third at Lawrence Livermore called El Capitan. Those three machines will be exascale, so they’ll reach 1018 floating point operations per second. The machine at Oak Ridge is going to be based on AMD processors, plus an accelerator that AMD has. The machine at Argonne is going to be based on Intel processors and an accelerator that Intel has. They’re going to be hard to use. I can predict it today. We’re going to have to work very hard as scientists to implement our algorithms. We can do it, it’ll be possible, but we’ll have to work hard at it, and that’s because of the algorithmic decisions that were made — decisions based on commodity parts, to a large extent. They will challenge us in extracting that exascale performance.
Jack Dongarra is the Distinguished Professor of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville, where he directs the Innovative Computing Laboratory and the Center for Information Technology Research. He is also a member of the Distinguished Research Staff at Oak Ridge National Laboratory.