With the virtuous cycle of Moore’s Law having very likely run its course, Intel is looking for other means to boost its processors’ performance than, as its founder once put it, “cramming” transistors onto chips. We’ve seen how the company uses microcode to fast-track the execution of certain classes of code — for example, recognizing when code refers to the OpenSSL open source library and pointing it to faster hardware-based microcode instead. Now, the CPU leader is looking to make performance inroads any way it can, even if it means one market segment at a time.
Yet with Intel’s DL Boost architecture for its Xeon server processors, launched in April, the company is attempting a curious new twist on this approach. On the surface, DL Boost is being presented as a feature utilized by certain 2nd Generation Intel Xeons (formerly code-named “Cascade Lake”) to fast-track the processing of instructions used in deep learning operations.
“Intel is trying to counter the perception that GPUs are required for deep learning inference by adding extensions to its AVX-512 vector processing module that accelerate DL calculations,” Marko Insights Principal Analyst Kurt Marko says.
Intel is stacking up its Xeon CPUs directly against Nvidia’s GPUs, which in recent years have seized the artificial intelligence (AI) hardware market and transformed the brand formerly associated with gaming and PC enthusiasts into a key player in high-performance servers in the data center.
“Inference, I don’t think, was ever really an all-GPU game,” Ian Steiner, senior principal engineer with Intel, tells Data Center Knowledge. “GPUs got a lot of attention for the things that were done in [machine learning model] training. Intel CPUs have been heavily used in inference for a number of years, so I think it’s more a perception thing than a reality thing.”
The Latest Perception Thing
The game of processor performance has always been to some extent “a perception thing.” During the heady days of Intel-vs-AMD, one manufacturer would claim a significant lead in a benchmark that folks respected and the other would counter-argue that what really matters is how performance feels to the end user.
But those performance gains were typically achieved (especially by Intel) through sophisticated techniques such as hyperthreading and on-board resource monitoring. With deep learning, Intel has grabbed an opportunity to claim a performance advantage by boosting a surprisingly unsophisticated feature of its CPUs, dating back to 1979 and the 8088 processor: 8-bit, fixed-point integer math.
“If the end user is okay with negligible accuracy loss, then they can take advantage of the performance boost that VNNI brings,” says Andres Rodriguez, a senior principal engineer with Intel’s Data Center Group.
VNNI is Intel’s Vector Neural Network Instructions, an extension to its AVX-512 instruction set introduced in 2017 that provides low-level chip support for vector instructions. In processing, a vector instruction performs the same function upon a large array of data in parallel. Pipelining was first introduced as a method of performing vector instructions in GPUs, originally for shading polygons in 3D objects. That process was leveraged by sophisticated libraries such as Nvidia’s CUDA, converting “graphics processing” instructions into “general purpose” instructions (without even having to change the GPU abbreviation).
Adding VNNI to Xeon, Intel’s DL Boost opens up an avenue for AI developers to leverage Intel CPUs instead of Nvidia’s GPUs for one of the key functions that distinguishes deep learning from ordinary machine learning: quantization.
“We take a model that’s generally trained using floating-point (FP32) and we do some work to turn that model into INT8,” Steiner says.
The Quantization Threshold
“Machine learning” and “deep learning” are phrases so often used in conjunction with one another that many folks have come to perceive them as synonymous. What often goes unaddressed is the question, “Deep for whom?”
There’s a point in time for every computer — even a supercomputer — when the size and scope of the machine learning training model grows linearly larger and becomes exponentially more difficult to process. This is the point where the process requires quantization — a reassessment of how many bits are required for the machine learning algorithm to infer a likely result or pattern from a training model.
In a sense, quantization takes a model where something has been deeply learned and converts it into a system where that something is more shallowly remembered. The 32-bit or even 64-bit floating-point values are substituted with 8-bit fixed-point (INT8). Think of a high-resolution photograph from a digital camera being converted into a much smaller GIF for sharing over social media.
For quantization, Steiner says, “INT8 is the predominant method used in the industry.”
“There are tools that people are working to develop to try to convert models from FP32 down to INT8,” he says. “But today, there’s a human element involved, so part of the cost of going from FP32 to INT8 [can be measured in] human time.”
Because machine learning models consume a big amount of data – and deep learning models much more so – deep learning is often characterized as a “Big Data” problem. It’s not. Big Data – with the capital letters – refers to the class of tasks that bogged down relational databases and which today are handled with analytics tools. By contrast, deep learning is one huge math problem broken down into smaller chunks — in the case of INT8, much smaller ones.
A visual example of matrix multiplication (Source: Wikimedia Commons)
The training of Big Data models involves a process you may have learned about in a high-school math class: matrix multiplication. It’s the mass multiplication of rows from one matrix with columns from another. When a learned pattern is stored in memory as a matrix and successive images of that pattern are imprinted on top of it, matrix multiplication ensures that the common factors of those patterns are “remembered.” It’s the process that machine learning uses in simulated learning to substitute for whatever it is that makes neurons learn and recall information stored in the brain.
“Imagine you’ve got a neuron inside a deep learning model,” Intel’s Steiner says. “It’s got a whole bunch of different inputs and it’s trying to understand an output based on those inputs. So for each of the different inputs, there’s a weight associated with it. What the matrix multiplication is doing is, for one of those multiplications, we’re looking at all of those inputs, we’re multiplying it by all the different weights and we’re accumulating up the final answer and trying to figure out what the output value for that specific neuron is going to be.”
In the real-world neuron model first posited by Santiago Ramon y Cajal in 1906, inputs are electrical impulses that at some point trigger the neuron to fire. That output correlates to a reproduced memory, or at least a part of one in a concerted pattern. In the deep learning computer model, the grid of “weights” represents the probability that a neuron (in this case, an object in memory) will be triggered to fire. The grid comprising these weights is the product of the deep learning operation, using matrix multiplication.
Those weights are “trained” using floating-point arithmetic, usually the 32-bit variety. What DL Boost ends up boosting is the process of converting the values of the weights on that grid into INT8 values. It does this by representing a sequence of three AVX-512 instructions (an 8-bit multiplication, a 16-bit accumulation, and a 32-bit write) as a single instruction. In Intel mnemonics, that instruction is VPDPBUSD.
Implementing this single instruction requires rewriting source code, so it does take human effort. It’s not automatic and there is a loss of accuracy reflected in the final confidence factor, which Intel’s Rodriguez tells us may range from 0.1 percent to 0.6 percent.
In exchange, users get a performance boost, he says, but performance boosts typically need to be measured before people believe they can feel them.
Measuring the Narrowing Gap
In May, Intel published a blog post boasting a performance advantage of 0.4 percent (perhaps statistically negligible) for a server based on a 56-core, hyperthreaded Xeon Platinum 9282 [“Scalable” form factor pictured above, in the middle] over Nvidia’s published results for its server with a Tesla V100 GPU on board, using the public domain ResNet-50 image recognition benchmark. Nvidia claims the V100 is capable of 100 TFlops (1014 floating-point operations per second).
“DL Boost provides a tangible improvement in Cascade Lake DL performance that should be ‘good enough’ in many cases,” analyst Kurt Marko says. “However, [deep learning]-optimized chips like the [Nvidia] T4, Google TPU (and Edge TPU), and FPGAs [field-programmable gate arrays] will provide better performance-per-watt, and even overall performance depending on the DL model.”
Intel’s specifications show its Xeon 9282 posts a thermal design point (TDP) of 400W, referring to the relative amount of electricity needed to maintain a nominal operating temperature. Nvidia claims a V100 GPU for NVLink interfaces operates within a 300W power envelope. Granted, the two power metrics don’t mean the same thing. (Rarely have competitive chip manufacturers adopted the same metric willingly.) Also, Nvidia’s test did not take into account the power consumption of the CPU in the system with the GPU.
But Intel’s numbers suggest that a server without a GPU could perform at least as well – and perhaps be generally as efficient – as a server with one, at least for certain tasks. If that’s true for deep learning applications, suddenly the possibility arises for other vector operations being expedited by on-chip arithmetic using the most ordinary of data types. (Graphics, anyone?)
In a company blog post, Paresh Kharya, director of product marketing for Nvidia, launched a counter-offensive, claiming Intel’s test actually proved the value of Nvidia’s price/performance argument. Citing an unconfirmed report that a single Xeon 9282 could retail for as high as $100,000 (Intel does not make its high-end MSRPs public), Kharya wrote, “Intel’s performance comparison also highlighted the clear advantage of NVIDIA T4 GPUs, which are built for inference. When compared to a single highest-end CPU, they’re not only faster but also 7x more energy-efficient and an order of magnitude more cost-efficient.”
Kharya’s claims are backed up by a September 2018 study published by the Azure team at Microsoft that shows a wide gap in overall performance for servers with GPUs over servers without — a gap that DL Boost alone couldn’t possibly bridge. With three-node GPU clusters besting 35-pod CPU clusters in performance with benchmarks including ResNet-50 by as much as 415 percent, the Azure team wrote, “the results suggest that the throughput from GPU clusters is always better than CPU throughput for all models and frameworks, proving that GPU is the economical choice for inference of deep learning models.”
But even then, the Azure team added, “for standard machine learning models where number of parameters are not as high as deep learning models, CPUs should still be considered as more effective and cost efficient.” This means that machine learning models that don’t require quantization may perform well enough on CPUs alone, which suggests the possibility that by addressing the quantization case — the threshold leading from machine learning to deep learning — specifically, Intel may yet narrow that gap measurably.
“I think [Intel] will have to cherry-pick examples and configurations to have any sort of credible data,” Marko says. “But I suspect, when you pair a smaller, cheaper CPU with a T4 add-on versus a larger, more expensive standalone CPU, the price/performance advantage will always be with the GPU. Indeed, it’s why Apple and Qualcomm have added neural net accelerator modules to their mobile systems-on-a-chip (SoC) rather than using the chip space for a more sophisticated (deeper pipeline, larger vector unit) CPU core.”
“One of the challenges that Nvidia has with its GPUs – and even Intel has – is we all have to face Amdahl’s Law,” Intel’s Rodriguez says. “GPUs usually have a lot more cores — weaker cores, but more of them — than CPUs. You can use these cores not just for deep learning but also for classical machine learning algorithms and all your regular workloads. What this provides is one hardware platform that can be used across the spectrum.”
If Intel ends up holding the high cards in this battle, this could be why. A server with a high-end GPU will always be delegated the highest-capacity workloads in the data center. By contrast, a server with a CPU designed to be leveraged for certain high-capacity workloads may yet be co-opted for others, but on a wider scale, so it won’t be relegated to one corner of the facility, reserved but mostly untouched. Intel’s discovery of a speed boost by tapping into the least sophisticated arithmetic of its processor portends the possibility that other pockets of long-buried, recently untapped power may yet be uncovered.
Perhaps Nvidia may yet come to regret the “GP” coming to mean “general purpose” after all.
We acknowledge the invaluable contribution of Marko Insights’ Principal Analyst Kurt Marko in helping to gather and explain the pertinent data for this report.