As we have reported before, Facebook has one of the boldest approaches to testing its infrastructure resiliency. The Facebook data center team regularly shuts down entire sites to see how its application will behave and to learn what improvements can be made.
During his keynote this morning at the Facebook @Scale conference in San Jose, California, Jay Parikh, the company’s head of engineering and infrastructure, shared some of the big lessons his team has learned so far from these fire drills.
The idea to do these kinds of stress tests was born after Hurricane Sandy wreaked havoc on internet infrastructure on the East Coast in 2012, Parikh said. Many data centers went down and stayed down for prolonged periods of time.
Read more: In Sandy’s Aftermath, Epic Challenges for Data Centers
Facebook data centers weathered the storm – its two East Coast sites in North Carolina and Virginia are far from the hardest hit areas – but the team started wondering what would happen if the social network lost an entire data center or availability region due to a disaster of similar proportions.
They created a swat team called Project Storm for this purpose, but the periodic “massive-scale storm drills” usually involve the entire engineering team and “a lot of the rest of the company,” Parikh said.
Needless to say, it’s not easy. A single Facebook data center processes 10s of terabytes of traffic per second, draws 10s of megawatts of power, and runs thousands of software services.
“This is a pretty hard problem, and it’s really hard to decompose this and figure out how are we going to solve this, how are we going to build a more resilient service,” he said. “Things didn’t go that well the first couple of times we did this.”
Facebook users didn’t notice that something went wrong, “but we learned a lot.”
One of the biggest things they learned was that traffic management and load balancing were “really, really hard” during an outage. Traffic patterns were chaotic, as illustrated in this graph Parikh displayed:
If you’re an engineer and you see a graph like this, you conclude that you either have bad data, your control system is not working, or you have no idea what you’re doing, he said.
Once Parikh’s team got a better handle on the traffic management problem, they were able to get a more normal, boring graph:
Another big lesson was that it takes a long time to bring a data center back up after an outage. “When we take a data center or a region down, that actually happens much faster,” Parikh said.
It’s a similar lesson to the one you learn as a child, when you realize a toy is much easier to take apart than to put back together, he said. Only this is less like putting together a toy and more like putting together an aircraft carrier.
Facebook has developed an automated runbook that includes every step for turning a data center off or for turning it back on. The runbook includes both automated and manual tasks.
During each drill, the team times itself on every individual task to continue looking for improvement opportunities. “You really want to time this kind of like a pit stop at a race,” he said.
Proper tooling is one of Parikh’s three key tenets for building infrastructure at scale and planning around resiliency. The other two are commitment and embracing failure.
There has to be a set of leaders in the company who push their teams outside of their comfort zone so they can learn something new. This means that leadership team has to embrace failure.
Commitment is important in terms of going through with a drill regardless of what else is happening. Delaying a drill because of a product launch would be an example of breaking that commitment.
Product launches are going to happen, and you don’t know how your application will behave during a product launch if one of the data centers goes down. “There’s only one way to find out,” Parikh said.