About two years ago, Facebook infrastructure engineers, the team responsible for designing and running all the technology inside its data centers, realized that the platform was gobbling up more and more computing resources so fast, they wouldn’t be able to get away with just three huge data centers per region for much longer. In some of their regions – the company has now announced a total of 15 – they needed to get to six facilities.
That would be problematic. They had designed the 100Gbps-based network fabric interconnecting data centers within each of the regions to support three data centers maximum. They would have to redesign the fabric from scratch.
Last week, Facebook announced results of the work that started then: an entirely new fabric topology, a new hardware switch, a new “fabric aggregator,” and a rejiggered version of its open source network software FBOSS. The new design quadruples network capacity available for shuffling data from one data center to another in a single location and achieves it without switching to 400G links. The company also unveiled some of the latest hardware that’s been the culprit behind this latest network rethink.
New Data Center Video, AI Hardware
Lately, the biggest leaps in computing capacity demand at Facebook have been primarily caused by two things: video and machine learning (the leading type of computing techniques for AI). As Vijay Rao, the company’s director of technology and strategy laid out in a keynote at last week’s OCP Summit in San Jose, the vision for video is to continue pushing it in the direction of being a social, interactive experience.
That applies not only to live videos users stream to their friends (the ones where the thumbs-ups and the smileys bubble up the screen as they come in) but also to professionally-produced content like The Real World, the 1990s hit reality TV show Facebook is resurrecting together with its original producer MTV for Watch, its most recent shot at original content. According to Rao, more than 400 million people spend at least one minute a day on Watch monthly.
The neural networks Facebook builds for its many machine-learning applications keep getting bigger and bigger, needing more and more computing resources. Hundreds of thousands of bots and assistants work on the platform today, Rao said, making about 200 trillion predictions per day. “We train our [computer vision] models with more than 3.5 billion images,” he said. Facebook’s neural network-based translation systems now serve translations in more than 4,000 “language directions.” (Spanish to English is a direction.)
Vijay Rao, Facebook director of technology and strategy, holding up an OCP accelerator module during a keynote at the 2019 OCP Summit
Video and machine learning are not only demanding more computing hardware, they’re driving the need for new types of computing hardware – namely accelerators.
The rise of machine learning has made accelerators into a whole new class of hardware to join compute, memory, storage, and network as the basic building blocks of cloud computing infrastructure. Facebook has been investing heavily in developing purpose-built hardware for use with GPU accelerators to train machine learning models and with accelerator ASICs for inference and video transcoding, Rao said. It’s also defined a standard set of specifications for the accelerators themselves, which vendors can use to design accelerator hardware for Facebook and others who choose to use the specs, published through the Open Compute Project. (Microsoft is participating in the accelerator spec project.)
At the summit, Facebook unveiled Zion, a new machine-learning hardware system that consists of an eight-socket modular server working in tandem with a platform that packs eight GPU accelerators for training ML models.
Building blocks for Facebook's AI training data center hardware.
The company also showed off its new hardware for AI inference, consisting of Kings Canyon inference modules and a Glacier Point v2 “carrier card” for those modules, two of which slide into Facebook’s Twin Lake server blades, a pair of which then slides into its Yosemite v2 server chassis.
Building blocks for Facebook's data center hardware for AI inference
Ever thinking about modularization, Facebook hardware engineers designed their new Mount Shasta video transcode module to fit into the Glacier Point v2 card, so the transcoding can be done using the same Glacier Point/Twin Lake/Yosemite combination as the Kings Canyon inference system.
Building blocks for Facebook's data center hardware for video transcoding
The New Facebook Data Center Network Fabric
It’s all this new hardware that’s generating a volume of in-region network traffic the three-data center fabric is having trouble keeping up with, Omar Baldonado, director of software engineering on Facebook’s network infrastructure team, said while speaking at the summit.
The event where Facebook announced all this (an annual conference showcasing the latest data center technology developments within the Open Compute Project’s orbit) comes less than five years after the company launched in Iowa the first data center running its current-generation network fabric, which Baldonado described as “the classic Facebook fabric.” It’s been less than three years after Facebook revealed its current-generation network hardware designs (the Backpack fabric switch and the Wedge 100 top-of-rack switch) and just one year after it announced the Fabric Aggregator, the massive rack-size network switch used to shuffle traffic between data centers in a region. This cadence illustrates just how quickly and radically the hyperscale platform’s infrastructure needs change.
According to Baldonado, F16, the new fabric, has four times the capacity. It is more scalable, simpler to manage, and sets his team up to expand happily for… “we are better equipped for the next few years,” he wrote in a blog post.
The current F4 fabric’s hardware building blocks Wedge 100 and Backpack are 100G switches. To increase bandwidth, Facebook could theoretically use the same fabric topology but upgrade to 400G switches, Baldonado said. But 400G optical components are not easy to source at Facebook’s scale. Plus, 400G ASICs and optics would need a lot more power, and power is a precious resource at any data center site. So, they built the F16 fabric out of 16 128-port 100G switches, achieving the same bandwidth as four 128-port 400G switches would.
Using the same Broadcom Tomahawk 3 ASIC that would’ve been used for a 400G fabric, F16 connects 16 single-chip planes with 100G link speeds. The top-of-rack switches are still Wedge 100, but Backpack is replaced with Minipack, Facebook’s new work horse fabric switch. The single-ASIC Minipack design is half Backpack’s size and uses half the power, according to Baldonado. (More design details on the Facebook Data Center Engineering blog)
Facebook's 100G Minipack switch
HGRID, which replaces the Fabric Aggregator, the massive cluster of Wedge 100 switches used to direct traffic between multiple F4 fabrics in a region, is built out of the new Minipack switches. Six HGRIDs can combine to create a massive aggregation layer for an availability region with six data centers, each interconnected by a full F16 fabric.
The engineers also flattened the aggregation layer by doing away with “fabric edge pods,” used to connect F4 fabrics to Facebook’s network backbone and to other fabrics at a site. In F16, fabric spine switches connect directly to HGRIDs, which not only flattens the regional network for East-West traffic but also scales regional uplink bandwidth to “petabit levels per fabric.”
Here’s how F4 and F16 fabrics compare:
Facebook F16 and F14 fabric comparison
Two Switches, Two Vendors, One Purpose
As it’s done with its previous-generation switches, Facebook is contributing the Minipack design to OCP. It’s also contributing a similar switch, designed together with Arista Networks, its long-time network hardware vendor.
Minipack is produced by Edgecore Networks, but hyperscale cloud platforms like Facebook like to have more than one source for as many pieces of their infrastructure as possible. That’s why Facebook collaborated with Arista on another switch that can do all the things Minipack can. This is a first in the relationship between the two companies. Facebook has traditionally only bought data center gear from Arista as an OEM.
This was the first time Arista was supplying Facebook with its early prototypes for a product. And, it was supplying prototypes at a higher volume than ever before, Anshul Sadana, Arista’s chief operating officer, said.
Both Edgecore and Arista have announced commercial versions of their Facebook switches.