As Moore’s Law runs up against the perhaps impenetrable barrier of physics, where transistors simply cannot be miniaturized any more than they are, semiconductor makers are looking to off-die accelerators to continue the perception of steady performance increases. The two giants in the space — Altera (now part of Intel) and Xilinx — are dueling over FPGA, an easily reconfigurable matrix of logic blocks and interconnects that can substitute for oft-invoked software libraries, boosting throughput by orders of magnitude.
In an effort to shift the fulcrum in this see-saw battle away from Intel’s natural advantages, today Xilinx made a move to realign its FPGA products around a software-like ecosystem, involving that supremely successful metaphor from the open source networking and infrastructure realm, the stack. Xilinx’ new Reconfigurable Acceleration Stack is not even a new product, but it is a new framework that could shift the focus away from repairing the cracks and dents in Moore’s Law, to writing a few new laws to take its place.
“You want to see an order of magnitude of acceleration, or at least efficiency, per watt over a CPU,” said Andy Walsh, Xilinx’ director of strategic market development, in an interview with Data Center Knowledge.
“Two factors are making acceleration more and more attractive: One is, of course, Moore’s Law and the lessening of the goodness coming out of every new [product] cycle of the CPU,” continued Walsh. “Couple that with the fact that there are a bunch more of workloads that are very compute-intensive. Data analytics is one of them. Machine learning, inference, and AI are others. Transcoding is another one. These are things that are driving the need to get more throughput and efficiency.”
Specific Cases of Common Workloads
Walsh listed three categories of targeted workloads — machine learning, video transcoding, and data analytics — that will be treated as first-order frameworks in Xilinx’ new, to borrow that phrase again, stack. For example, deep neural networking (DNN) is the practice of filtering observations through multiple layers of matrices in order to deduce underlying patterns, and then bump those learned patterns further forward so that they can be recognized sooner.
DNN is one of the key research fields addressed by Caffe, UC Berkeley’s deep learning framework; and it’s an example of one of the frameworks Xilinx’ new RAS will support. Walsh said Xilinx’ own DNN is integrated with Caffe, so developers can start building DNN essentially the way they’re currently doing, without making adjustments for infrastructure. FFMPEG libraries will be supported for common video transcoding applications.
“We start from the application developer’s perspective, with the framework they use today, as we create a stack that’s easy to turn into an 80-percent solution,” said Walsh, having explained that the remaining 20 percent is provided by customer customization.
RAS will also feature support for OpenStack, particularly with expediting its fundamental job functions: cloud service provisioning and management. Those benefits should become available to Xilinx’ hyperscale customers, the company said, with the upcoming Ocata release of OpenStack, which is currently scheduled for late February.
Fat Man vs. Little Boy
One key distinction between Xilinx’ and Altera’s FPGA architecture concerns data precision. There are competing arguments here in favor of either side: Altera touts the ability to achieve 32-bit “single precision” or 64-bit “double precision” floating point arithmetic. But by its own admission, setting up the arithmetic logic to work within either of those frameworks, is complex and difficult.
“To implement floating point, large barrel shifters, which happen to consume tremendous amounts of the programmable routing (interconnect between the programmable logic elements), are required,” states a June 2014 Altera white paper dissecting the meaning behind FPGA vendors’ performance claims. “All FPGAs have a given amount of interconnect to support the logic, which is based on what a typical fixed-point FPGA design will use. Unfortunately, floating point does require a much higher degree of this interconnect than most fixed-point designs.”
Xilinx’ Walsh argues that his firm’s approach, which focuses on smaller, 8-bit integer blocks, provides higher performance in the end. Altera has hardened its 32-bit FP block, which he admitted gives Altera’s design strengths in particular areas, particularly in training neural net models. But specifically in machine learning inference — one of the hottest use cases for accelerators today — Walsh argues that 32-bit FP works against efficiency on the inference side of the equation, where it comes time for the model to yield results.
“With machine learning inference, and how the model is trained, 8-bit integer is basically the target today,” he said, “based on the ability to efficiently operate the neural network.” He added that all the evidence on AI trends points towards 8-bit, or even 4-bit, values being used to train neural net models.
“Today, it’s fair to say that the game is best measured at the 8-bit integer, fixed point data type. And comparing discrete SDKs from Altera and Xilinx, not only is there 2 to 6 times the computing per watt benefit in that type of workload.”
Making Acceleration Happen Faster
Xilinx hopes that its new accelerator stack acts as an “accelerator” in the marketing sense too, in an effort to bring customers up to speed with FPGA more readily.
“One of the pioneers in this space, four years ago, took maybe three years to develop and deploy not only the hardware but also the software infrastructure, and the provisioning and management infrastructure,” explained Steve Glaser, Xilinx’ senior vice president for corporate strategy and marketing. “It was quite a lot of work.”
The current adoption time that Xilinx customers are experiencing today, is 12 to 18 months, said Glaser. It’s the company’s goal with Xilinx to shave off one-third, or maybe one-half of that amount in development and deployment time. It plans to accomplish this by providing a platform that is 80 percent pre-configured with the libraries, and the tools for adopting those libraries, with respect to the three major use cases upon which RAS is focused.
Software and services applicable to Xilinx’ Reconfigurable Acceleration Stack are now available for download.