If it is to be war, if there is to be a competitive market in upper-tier, server-class CPUs once again, like it was 1999, the battlefield this time will be “vector extensions.” These are the expansions that clever designers can make to a processor’s basic instruction set, enabling them to perform single, often custom (or customizable) functions on broad arrays of data rather than single registers.
In announcing the latest editions of its Neoverse server chips recently, Arm made it abundantly clear: Its addition of scalable vector extensions (SVE) to the new Arm Neoverse V1 and Arm Neoverse N2 are mostly, if not entirely, about providing its partners – the companies that design, produce, and sell physical chips based on Arm’s IP – the leverage they need to go up against Intel’s third-generation Xeon.
“Neoverse V1 platform supports IP capabilities necessary to target markets like high-performance and exascale computing,” Chris Bergey, Arm’s senior VP and general manager for infrastructure, said in a briefing with reporters. “We give our partners the flexibility to incorporate on-die, specialized accelerators. They also have the freedom to right-size the I/O and leverage chiplet and multi-chip capabilities – to push core count and performance but by combining smaller dies that offer better yields and costs.”
In Arm terminology, a “chiplet” is one lithographically printed “page” of a processor component. A substrate may conceivably support two chiplets, according to the requirements of the licensee/partner.
The Arm Neoverse V1, N2 Vector Strategy
Arm’s Neoverse portfolio is what has made it possible for chip makers like Ampere to go toe-to-toe against Intel and AMD in the data center market. Up to now, it’s been made available to licensees in either of two form factors: Neoverse E1 (“Helios”), which blends out-of-order instruction execution with simultaneous multithreading; and Neoverse N1 (“Ares”), which aims to balance raw performance-per-watt with low power consumption. Those were two niches of the HPC market that partners such as Ampere and Marvell could tackle comfortably, without too much risk.
Arm is now adding two more cores to the mix: N2, with designs on improving on N1’s performance levels; and the confusingly-named V1. No, the “V” does not stand for “version,” which can be confusing when you’re talking about the second version of a platform. It stands for “vector.” As Bergey’s revelations made clearer, it’s with this core that producers will have the best opportunity to date for building specialized processors – perhaps in low volumes – that could become more enticing to customers in very particular classes. Think about academic institutions running quantum computing simulators, or 5G service providers building localized cloud computing capabilities at the customer edge.
SVE, said Bergey, “enables an entirely new set of vector programming and data manipulation tools for Arm developers.” One example he offered involves Arm’s previous implementation of Single Instruction Multiple Data (SIMD, pronounced “sim · dee”) execution. This is an early form of vectorization, where a single instruction is executed simultaneously over a broad table, or “tuple,” of data. (Today’s GPUs are essentially multithreaded SIMD engines.) Historically, SIMD pipelines have suffered from a phenomenon called “control flow divergence.” This is where parallel computation threads end up taking divergent paths and have difficulty getting back together. Imagine a movie chase scene on a mountain where the car doing the chasing takes what appears to be a shortcut on a winding path down the side of the mountain only to end up climbing it again and eventually find itself the car being chased.
Arm’s implementation of SVE to effectively replace its SIMD, called SVE2, applies what the company calls “auto-vectorization.” Although it’s unknown exactly which auto-vectorization approach Arm’s engineers chose to take (it may license its IP to others, but it’s not open source), one likely methodology has been tossed around academic circles. Called “whole-function vectorization,” it’s essentially the idea that the shortcut paths around the mountain may be mapped in advance, ensuring that all the paths converge at the right spot. All that time that processors spend reconciling their execution paths can be reclaimed.
Bergey claims this reclamation of lost divergence time is largely responsible for what it asserts to be a 3.5x performance gain on SIMD tests, compared to the “Neon” SIMD technology the company used for Neoverse N1.
Stepping on Ice Lake’s Achilles Heel
Earlier in April Intel unveiled its highly-anticipated third-generation Xeon processor series (“Ice Lake”). During that unveiling, Intel engineers mentioned that vector extensions remained a key feature of Xeon architecture. For example, Intel has made significant additions to the processors’ ability for encryption and decryption, thus enabling object code to be delivered to the chip in an encrypted form.
“We’ve introduced vector instructions that actually improve the crypto performance by multiple-fold,” Sailesh Kottapali, Intel’s chief architect, explained in a briefing with reporters. “We derive multi-fold improvements through some of these vector instructions we’ve introduced.”
But besides a vague reference by Intel corporate VP Lisa Spelman to her company delivering a reference design that can be built upon and optimized, there wasn’t a message that extensibility through SVE would be something that third parties could accomplish. By contrast, Arm’s Bergey emphasized the concept of customized acceleration in conjunction with SVE.
The addition of SVE to Neoverse V1 (codenamed “Zeus”), he remarked, “provides some big performance uplifts over [Neoverse] N1: 1.8 times over a range of vector workloads, double the floating-point throughput with SVE, and up to 4x the machine learning throughput from the other new instructions and improvements. Of course, SVE is delivering a new, high-performance, developer-friendly programming capability to HPC.”
We asked Bergey to clarify what exactly is Arm’s role in driving server architecture at this point. Is it moving towards disaggregation, and if so, should we no longer assume that performance quality is mainly a factor of the CPU or SoC?
“We continue to be working very hard on optimizing the architecture and applying new techniques,” he replied. “But I think one of the clear directions of the industry is tight coupling with accelerators… We’re going to continue to be able to achieve performance gains at the system level, because of tight coupling with accelerators, in addition to continuing to make the processors more performant as well.”
While declining any direct reference, Bergey’s explanation appeared intentionally compatible with Nvidia’s recent announcement that it will produce an Arm-based CPU, code-named “Grace,” intended to address the needs of AI in the data center, with a shipment date projected for early 2023. Nvidia already produces its own AI-oriented GPU accelerators, which have become so successful that investors and analysts already perceive graphics as its side business.
In its own announcement, Nvidia left out any direct references to Neoverse. It’s also conceivable that Nvidia’s strategy does not require it to have acquired Arm, should the UK government finally block the takeover. There certainly hasn’t been any “tight coupling” on this score yet.
But Arm appears to be intentionally painting a picture of a future disaggregated system architecture with Neoverse at its center and an ecosystem of vector extension architects orbiting it. If Arm can build this into more of a firm product feature than a hazy, indeterminate vision, it could successfully exploit one aspect of Intel design that probably cannot adapt to changing times: its sole dependency upon itself for core processing extensibility.