Facebook today made two announcements about the software it uses to run its network. The first is that it's releasing Katran, the load balancer that keeps the social site from crashing and burning, as open source. In addition, it's offering details on the inner workings of the Zero Touch Provisioning tool it uses to help engineers automate much of the work required to build its backbone networks.
Although Facebook is known primarily for running the worlds largest social network and as a major player in the advertising business, it's also a software company -- because it has to be. Very few companies operate at Facebook's scale, and it faces unique challenges in designing for the out-of-the-ordinary traffic patterns of its social network. The company pretty much has little choice but to develop the software it uses in-house.
The good news is that Facebook's IT folks work and play well with others. Not only is it the company initially behind the Open Compute Project, which shares the designs of data center products, its software developers have been willing to share the details of how the software they develop works, as it's done today with its provisioning tool, and have released much of their most important efforts as open source. Just this month the company open sourced PyTorch, the software behind its machine learning and artificial intelligence projects.
While PyTorch seems to be a work-in-progress with some kinks yet to be worked out, Katran has been battle tested and is off-the-shelf ready to go.
According to a blog written by Facebook production engineer Nikita Shirokov and software engineer Ranjeeth Dasineni, Katran was designed to address shortcoming in previous load balancing software the company had built, primarily from open source software, and had used for four years.
A load balancer for Facebook has to meet three criteria, they wrote. For performance and agility, it needs to be a software solution that runs on Linux. Then it has to coexist with other server services to remove the need for dedicated servers that are exclusive to the load balancer. It also must allow low-disruption maintenance, because maintenance and upgrades at Facebook "are a norm, not exceptions." Finally, it needs to offer easy instrumentation and debugging, to reduce the time to debug and troubleshoot issues.
Shirokov and Dasineni said that their first software defined load balancer "fell short on the goal of coexistence with other services, specifically the backends."
To overcome that shortcoming, for Katran Facebook completely redesigned the forwarding plane by working some magic with two recent developments in the Linux kernel: XDP, which provides a high performance, programmable network data path, and eBPF, the extended Berkeley Packet Filter.
"Katran is deployed today on backend servers in Facebook’s points of presence, and it has helped us improve the performance and scalability of network load balancing and reduce inefficiencies ... ," they wrote. "By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work."
The writers list a few "constraints" that were introduced to Katran for the sake of performance, and say, "we found these constraints to be fairly reasonable, and they did not block our deployment. We believe that most users of our library will find them easy to satisfy."
Katran is available for download under the GNU General Public License v2.0 on GitHub.
The details of Facebook's Zero Touch Provisioning tool came from a blog, written by James Quinn, an engineering manager for Facebook's Backbone Automation Tools; along with Facebook network engineers Joe Hrbek, Brandon Bennett, and David Swafford, that points out that Facebook's networks span continents and include two parallel IP backbone networks. They are also constantly expanding to accommodate accelerating growth in internet bound traffic and to meet even larger machine-to-machine demands.
Facebook was facing similar problems building networks as it did with load balancing. The provisioning systems it was using were not up to the task of dealing with the scale and complexity they were required to handle.
"Ultimately, these challenges drove Facebook’s network engineers to develop a completely new approach for network deployment workflows," the writers said. "We called it Vending Machine, a name inspired by the machines that dispense candy and soft drinks. In the case of Facebook's Vending Machine, the input is a device role, location, and platform, and out pops a freshly provisioned network device, ready to deliver production traffic."
The blog says that the new framework has allowed Facebook engineers to move more quickly and to solve problems more creatively.
The Zero Touch tool is still a work in progress, but Facebook is already at work pursuing new goals that include orchestrating groups of Vending Machine device jobs to build or rebuild larger networks, and continuous automated rebuilds of its backbone network.
The blog doesn't indicate Zero Touch Provisioning's license, but the best guess is that it's currently proprietary. Don't be surprised, however, to find Facebook releasing it under an open source license in the future.