It’s been a huge week for Arm, and not only because Apple announced the beginning of the switch from Intel x86 processors to its own Arm-based chips in its Mac computers. The non-profit high-performance computing ranking organization Top500 also announced this week that for the first time in history, the world’s fastest supercomputer is powered by Arm chips.
Add two more big developments for Arm in the data center announced this week. As you probably know by now, the race to produce cooler, more efficient servers has taken a turn away from Intel co-founder Gordon Moore’s Law and toward Sophie Wilson’s dream. Santa Clara-based Ampere Computing, the Arm server chip startup formed by ex-Intel president Renee James that just in March launched its 80-core Altra Arm CPU, announced that sampling of a 128-core Altra Max processor will start in the fourth quarter. Focused squarely on hyperscale cloud providers, Ampere designed the part to compete with Intel’s Xeon Platinum 8160 ($9,899 suggested list) and AMD’s Epyc 7742 ($6,950) on performance.
Also this week, the Cambridge, UK-based firm known up until last year as Kaleao and restructured last December as Bamboo Systems, said it would release its first 1U Arm server, the B1000N Series, in the third quarter. The server is designed for low-power environments, such as edge data centers.
Suddenly the Performance Leader
The “emergence” of Arm processor architecture in data centers by our own count is already well into its fifth year. So much about the architecture and construction of modern data centers has directly or indirectly derived from x86 processor architecture created originally for PCs. But the evolution of Arm in this space has been slow but steady.
The performance of Ampere’s new 128-core Altra Max is aiming for supercomputer territory. There, Intel’s Xeon Platinum 8160 powers nine systems in the latest Top500.
“At a high level, we’re delivering so much more performance than Intel is on a per-CPU basis,” Jeff Wittich, Ampere’s senior VP for products, said. The raw numbers aren’t out yet, but Wittich is claiming 2.2x performance gain against similar x86 processors, and Ampere’s Intel chip of choice to compare its new part against has been the 8160.
Altra Max will be socket-compatible with the 80-core Altra, which claimed the highest Arm core count. Wittich asserted that the 128-core processor will maintain linear scalability, meaning performance-per-core won’t drop off gradually as core count increases.
If his claim holds true, that would be a welcome development since May 2019. In a study published that month [PDF], a University of Bristol team compared performance of what was then the first Arm-based supercomputer, a Cray XC50 Scout system dubbed Isambard and powered by Marvell ThunderX2 Arm processors, against Cray machines with similar specifications, including one built on 28-core Xeon Platinum 8176 processors.
The Bristol team found that the Marvell chip suffered from scaling efficiency drop-off, especially after a node count of 16. At 64 server nodes, scaling efficiency for the Arm-based processor dropped below 80 percent, while the Intel chips all stayed above 100 percent.
Ampere’s tests in contrast are based on core and thread counts, not node counts. Still, if what Wittich said proved true, it would mean that Arm processor engineers have overcome a serious glitch that could have rendered their products non-competitive in the HPC space.
“Our focus has been cloud, so we’re optimizing everything for our cloud environment,” Wittich told DCK. “But a lot of the things that we’re doing there would be equally applicable to a highly scalable supercomputer. So we’ll see interest there for sure. There’s nothing that precludes it.”
Ampere aims to bring Altra Max into general volume production by mid-2021.
“A lot of companies out there are already putting Arm processors into their servers. HPE, Supermicro, Lenovo have all got one,” Tony Craythorne, CEO of Bamboo Systems, remarked. “But all they’ve done is literally plugged an Arm chip into the x86 architecture. That can give you some of the benefits of Arm — it will reduce the power and cooling — but it won’t give you any of the benefits of processing, I/O, and throughput capability, where Arm does have a massive advantage.”
His point was that Bamboo was introducing not just a server with an Intel or AMD processor substituted with Arm, but a completely new architecture built around this style of processor. Showing some of the effort with which its name was crafted, he calls it Parallel Arm Node Designed Architecture, or PANDA.
“Our product today can save a customer up to 50 percent of their acquisition costs at a minimum (and it could go even higher), 75 percent of their energy consumption, and about 80 percent of their rack space due to the density that we can get into a very small form factor,” Craythorne told DCK.
Although the Bamboo architecture is being designed for what he called “mini-supercomputer” scalability, at least at this early stage scaling starts from the low end up. Each Bamboo server node may contain one or two blades, with each blade containing four complete processing units. A 1U box contains eight Linux servers, each with dedicated memory and storage. Bamboo plans to produce a 4U product later this year.
“Part of the reason we’re launching it as a 1U [is] we understand this technology is new,” Craythorne said. “Everybody has an Intel legacy system. Nobody’s just going to throw that out and go spend $150-200,000 on a 4U system. They may want to just try it. They want something that’s easy to buy, easy to sell, low-cost to try out, so they can then see if it’s going to work for them.”
By “low cost” Craythorne meant $9,995. While a typical 1U low-power x86 server can sell for under $1,500, each “node” may only contain a single quad-core CPU. The Bamboo CEO told us his team used AWS’s Total Cost of Ownership calculator to estimate the three-year cost of operating a rack of eight 2U Dell PowerEdge R740XD servers totaling 16kW of capacity. AWS’s three-year TCO estimate was approximately $560,000.
Although Bamboo has yet to sustain a real three-year trial run, the company claims a similarly performing rack of B1008N servers would incur about $200,000 over the same period.
There are few TCO studies for Arm servers with which to compare Bamboo’s projections. A 2014 analysis of Hewlett-Packard’s (now HPE) first 64-bit ARMv8 server cartridge, the ProLiant M400, by analyst Patrick Moorhead [PDF] may have set at least some precedent. Although the M400 was a “cartridge” rather than a 1U, when used in a Web server scenario, Moorhead projected that the three-year TCO of the M400 would be 35 percent lower than TCO of a similarly performing 1U x86 server. Moorhead’s research included input from Sandia National Labs.
Craythorne asserted that a B1008N could save customers up to 50 percent in acquisition costs, at least 75 percent in energy consumption, and 80 percent of rack space on account of higher server density. Although he said his company had conducted internal testing and produced graphs to indicate those tests involved publicly known benchmarks, Bamboo has yet to release hard numbers, but Craythorne said it would do so in the near future.
He also admitted that part of Bamboo’s TCO could be spent on recompiling some applications originally designed for x86 to run on Arm.
Every Arm processor is an implementation of processing architecture that contains intellectual property licensed from Softbank Group-owned Arm Holdings and usually fabricated by a third-party manufacturer. As a result, almost every Arm processor can be said to have its own architecture, at least insofar as the non-licensed part is concerned. Bamboo calls its own version Panda. Naturally, not having been a PC in its distant past, it omits the often requisite expansion ports, leaving behind two pairs of QSFP Ethernet ports (one for each blade).
“This is the key part that a lot of people struggle to understand about our product,” admitted Craythorne. In Panda, the CPU is limited to managing and executing the application, with access both to DRAM and non-volatile memory (NVMe). But networking and storage tasks are handled exclusively by a co-processor, and the built-in network switch replaces a top-of-rack switch.
“We’ve got a non-blocking L3 switch inside every single blade with a chunk of the networking inside the blade,” noted Siobhan Ellis, Bamboo’s director of product management. So to a certain extent we don’t need to send network traffic outside the blade.” Optionally, both QSFP ports on a blade may be connected to a switch, or one port may be connected to a switch and the other to the blade next door. “That cuts down on the number of external switches that you need in the rack.”