Facebook engineers have designed a new way to interconnect the company’s enormous data centers within a single region.
As it builds data centers in more and more new locations around the world, Facebook is also building more data centers per location (or “availability region”) to meet its growing computing demands. This week, for example, the company said it would expand its Papillion, Nebraska, region from two to six data centers.
At Facebook, applications like search, AI, and machine learning are pushing even the highest-capacity networking switches on the market to their limits. They require shuffling vast amounts of data between servers within a region – much more than the amount of data that travels between Facebook’s global network and its users’ devices.
Each building in a region has several network fabrics, which talk to each other through a “fabric aggregation layer,” Sree Sankar, technical product manager at Facebook, explained. The engineers saw big scaling challenges on the horizon in the aggregation layer.
“We were already using the largest switch out there,” she said. “So we had to innovate.”
Sankar spoke about the new solution at the OCP Summit in San Jose, California. The annual conference is put on by the OCP Foundation, which oversees the Open Compute Project, the open source data center hardware design community launched by Facebook in 2011. The company is contributing the solution’s design to OCP.
To continue scaling the aggregation layer, Facebook would need three times more ports than it had at the time (about a year and a half ago), but simply scaling network capacity linearly wouldn’t be sustainable from a data center power perspective, Sankar said. Whatever the solution, it would have to be more energy efficient than what was in place.
Interconnecting Many Fabrics
They built a distributed, disaggregated network system, called Fabric Aggregator, which interconnects many of the Wedge100S switches Facebook designed earlier, and developed a cabling assembly unit to emulate the switch chassis backplane. The system runs on FBOSS (Facebook Open Switching System) software.
It took about five months to design, and the company has been rolling it out in its data centers over the past nine months.
The approach enables Facebook to scale aggregation-layer capacity in big chunks, but in addition to the scaling capabilities came better network resiliency and higher energy efficiency. Since the system is made up of many of the same building blocks, the Wedge100S, any of the switches can go out of service, accidentally or on purpose, without affecting the network’s overall performance.
The system can be implemented in a number of different ways. One implementation described in a Facebook blog post is a Fabric Aggregator node (a node is simply a unit of bandwidth being added using the approach) that supports both regional (“east-west”) traffic and inter-regional (“north-south”) traffic.
The two layers – “downstream” for regional and “upstream” for inter-regional – are disaggregated, meaning each one can be scaled independently. The upstream layer’s purpose is to compress the interconnects with Facebook’s backbone network.
The downstream and upstream layers can contain a quasi-arbitrary number of downstream subswitches and upstream subswitches. Separating the solution into two distinct layers allows us to grow the east/west and north/south capacities independently by adding more subswitches as traffic demands change.
Since the basic building blocks are the same across the system, any Wedge100S switch can be taken out of service for debugging without affecting the network’s overall performance:
If we detect a misbehaving subswitch inside a particular node, we can take that specific subswitch out of service for debugging. If there is a need take all downstream and upstream subswitches out of service in a node, our operational tools abstract all the underlying complexities inherent to multiple interactions across many individual subswitches.
The system is more efficient because it requires fewer ASICs to increase the amount of ports. It provides higher port density and 60 percent higher power efficiency, Sankar said.
The team designed a cabling assembly for the Fabric Aggregator. There are single-rack and multi-rack versions of the assembly, each with its own constraints and benefits, and four cabling configurations.
Next Challenge Around the Corner
For Sankar and her colleagues, hitting the limits of what the Fabric Aggregator can do is already on the horizon. With 100G networking, power limits are already being stretched, which may mean major difficulties in the future, when 400G networking arrives. “It’s unsustainable for a 400-Gig data center,” she said.
Speaking at the summit, Andy Bechtolsheim, co-founder of Sun Microsystems and founder Arista Networks, one of the largest suppliers of networking equipment to hyper-scale data centers, cited a Dell’Oro Group forecast that said the transition from 100G to 400G will start in 2019, ramp up volume in 2020, and pass 100G in total amount of bandwidth deployed in 2022.
Facebook’s favored solution to the power problem is to include optics and ASICs in the same package, with modified I/O, instead of the current design, where optical modules are plugged into the switch. That would enable higher density and lower power, Omar Baldonado, director of software engineering on Facebook’s network infrastructure team, said. “Co-packaged optics is a solution that we strongly believe in,” he said.
Both Sankar and Baldonado called on networking vendors to accelerate development in this area.