Opening a huge new data center provided Facebook with lots of new capacity for adding more servers and storage to support its growth. But is also presented some challenges, expanding the scope and geography of the company’s network. To address these issues, the Facebook data center team ran some large-scale simulations, and used a new software tool to automate the deployment of new server capacity.
The new data center in Prineville, Oregon forced Facebook to create a third region within its data center infrastructure, which had previously been focused in two groups of data centers in Silicon Valley and northern Virginia. Adding a region helps with latency, but creates challenges in synchronizing the Facebook application and data, as the engineering team noted in 2008 when Facebook opened its first Virginia facility. The opening of Prineville added anohter layer of complexity.
The effort to add the third region was dubbed Project Triforce, invoking a reference from the classic Nintendo game “The Legend of Zelda.”
“We needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible,” Facebook’s Sanjeev Kumar writes on the Facebook Engineering blog.
Staging via Cluster in Virginia
“The solution involved taking over an active production cluster of thousands of machines in Virginia and reconfiguring them to look like a third region,” Kumar writes. “Because of the geographical distance, the latency between Virginia and our master region in California was far larger than the latency expected between our new data center in Prineville and California. This was actually a good thing: it stressed our software stack more than we expected the Prineville data center to, and allowed us to quickly surface any potential latency problems that could arise when our Prineville data center came online.”
As with most challenges at Facebook, a key ingredient to the solution was custom software. The engineering team came up with a software program called Kolbold to automate many of the testing and deployment processes.
“Kobold gives our cluster deployment team the ability to build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way,” Kumar writes. “Tens of thousands of servers were provisioned, imaged and brought online in less than 30 days.”
Facebook has published most of the designs of its new data center and servers as part of the Open Compute Project. It wasn’t immediately clear if the company had plans to open source the Kobold software tool.