A few months ago, Facebook added a whole new dimension to the idea of an infrastructure stress test. The company shut down one of its data centers in its entirety to see how the safeguards it had put in place for such incidents performed in action.
Jay Parikh, global head of engineering at Facebook, talked about the exercise in his keynote presentation at the company’s @Scale conference in San Francisco Monday.
“This is not a small thing,” he said. “This is tens of megawatts of power that basically we turned off for an entire day to test how our systems were going to actually respond.”
He didn’t specify which of Facebook’s data centers was shut down. It has its own facilities in Oregon, Iowa, North Carolina and Sweden, and leases wholesale data center space in California and Virginia.
The company did run some “fire drills” prior to the test to prepare, and while there were skeptics that the team would actually pull the plug, it was important that it did happen. “We turned the entire region off,” Parikh said.
And the prep work paid off. “It was actually pretty boring for us,” he said.
Not everything worked 100 percent, and the team did put some improvements on the roadmap. But the overall system persevered, and the applications stayed up, and Parikh’s team is planning to continue such stress tests.
An exercise like this falls into one of key tenets of engineering at Facebook, which is embracing failure, Parikh said. Facebook encourages its engineers to take big risks – without being reckless – and doesn’t punish those who take them and fail.
“We don’t squash those,” Parikh said. There are precautions taken to minimize the consequences of failure, and the team spends a lot of energy on analyzing causes of failure and being able to recover quickly.