Can CPU architecture have any direct impact on the design of the data center? The makers of ARM processors would say so, but usually only insofar as the simplest of tasks can be enabled to drive down power consumption. What about the most power-intensive tasks you can run, like artificial intelligence and machine learning?
"What if there's a new kind of workload, and I want to aggregate much more memory than could ever even fit in a box?" asked IBM Distinguished Engineer Bill Starke during a recent press conference. "It's a whole new use. What if I want one computer to talk to a petabyte of memory? Well, nobody knows how to build that today."
With a concept called distributed memory disaggregation being integrated into his company's latest Power10 processors, unveiled this week, "you could actually get to huge, huge amounts of memory, and enable whole new kinds of workloads to run on the computers that run in my cloud."
In an exclusive interview with Data Center Knowledge, two IBM executives overseeing the development of Power10, shared new details about the new technique that IBM is calling "memory inception." One aspect of this feature, they told us, enables a Power10 core for one server in a cluster to directly address physical memory attached to another server in that cluster, not like a procedure call to a database but like a real pointer.
Yes, memory inception has made headlines recently, probably just as much for its name as anything else, but even some of the most learned analysis on this feature may have missed the point: Multiple cores sharing a single, vast memory map can be made to execute shared tasks in parallel — tasks that would require asynchronous accelerators, such as GPUs, ASICs, and FPGAs, to utilize a great many more threads, as well as a separate orchestration scheme.
Today, accelerators are attached to individual servers. They perform highly recursive, algorithmic tasks much faster than a typical CPU, whose threads of parallelism are limited to the number of available cores. While IBM has had a hand in crafting accelerator technology, most recently by backing the OpenCAPI architecture for accelerator interfaces, it doesn't have the same stakes in this market as Intel (after its 2015 purchase of Altera), Nvidia, and Microsoft's new favorite, Xilinx.
With memory inception, IBM told us, a great many classes of memory- and power-intensive tasks would not require asynchronous accelerators — or their often sweat-inducing thermal envelopes, especially with GPUs.
"This AI math for Power10, we have tried to make part of the SMP complex," remarked Satya Sharma, IBM's CTO for cognitive systems. By SMP, Sharma is referring to symmetric multiprocessing — the classic parallelism utilized by the general-purpose CPU. A typical program that utilizes asynchronous processing would queue up a job for the accelerator, launch the job, go about its other business, and receive the signal from the accelerator that the job is completed, whenever that may be.
"The Power10 core itself has this AI math capability," Sharma continued. "Therefore, we can write these programs as part of the SMP complex. There are instructions in Power10 to make use of this AI math. I have to tell you, from the programmer friendliness part of view, if you are part of the SMP complex, then you are better off compared to specialized hardware."
A decade and a half ago, IBM tried to start its own revolution with the development of what was called the Cell Broadband Engine (CBE). Though it didn't really have much to do with broadband communications, and it didn’t end up a permanent part of the data center, it did find its way into Sony's PlayStation 3 game console. CBE advanced the concept of not just multiple cores, but multiple processors collectively, and in parallel, operating on the same memory maps.
What differentiates Power10's methodology from CBE's, remarked Sharma, is the way AI tasks are integrated into the general flow of the program. While there may still be huge tasks that require an accelerator's gift for dividing and conquering, Power10, IBM believes, may be able to re-absorb many classes of tasks in the AI and database space, back into the sequential, symmetrical, synchronized CPU workflow.
"If you look at your core data center people, who are looking at running SAP-type applications or large Oracle databases," remarked Steve Sibley, IBM's vice president of Power Systems offering management, "running core business applications in a cloud-like way. That's Power10's sweet spot where our clients are. They're extending into where we've taken supercomputers with Power and AI, and we're bringing back some of that inference capability into the core."
Sibley added that IBM intends to make full use of its Red Hat division, making OpenShift, its commercial Kubernetes platform, as a workload deployment mechanism. What this means is that a Power10 cluster will be able to orchestrate highly parallelized workloads — perhaps not extremely parallelized, though still somewhat complex — into a single group of regular tasks, manageable through Kubernetes rather than some separate, external, asynchronous engine.
This could actually have an impact upon overall resource consumption. Even though, Sharma admitted, integrating AI math into Power10's libraries (an automatic process, requiring no existing program re-compilation) may not yield an immediate, near-term power consumption drip, it's the way these workloads are managed and orchestrated that could change. And that could positively influence the data center's overall power draw.
Samsung was chosen in 2018 to be Power processors' fabricator, and will produce this chip using its 7 nm lithography process, which Samsung has been putting to use since April 2019. A production timetable for Power10 is not as forthcoming as usual, probably due to the pandemic.