Jesse Robbins is a trained fireman. He also has managed some of the world’s largest Internet infrastructures. Robbins says the lessons of fire readiness can be applied to building reliable systems.
“You cannot learn the lessons of failure without experiencing it,” said Robbins, the co-founder and Chief Community Office at Opscode. “That’s why we do fire drills.”
In a keynote at last week’s Cloud Connect conference, Robbins said that resiliency is a function of culture, as well as engineering. That’s can be difficult message for IT operations team that view downtime as an enemy to be avoided at all costs – the Voldemort (“He Who Must Not be Named!”) of the data center.
“Failure happens,” said Robbins. “You just have to design for it. Every organizaton must learn that this is going to occur.”
Robbins, who earned the moniker “The Master of Disaster” during his time on the infrastructure team at Amazon, points to incident reports as evidence of the prevailing sentiment about outages.
“If you search for ‘outage post-mortem’ on the Internet, what you find is people talking about how crazy and impossible the outage was,” he said. “It is always a perfect storm of impossible events. ‘We could never have known that there was this one latency defect.’ ”
The way to prevail is to assume that this type of hard-to-predict defect will eventually materialize and discover it, rather than being surprised when it reveals itself in a crisis. One way to accomplish this is “resilience engineering” featuring fault injection – introducing unexpected events to see how the system responds. The most famous example of this is the Chaos Monkey used by the Netflix engineering team.
People: The Key Link in Resiliency
Robbins uses a similar approach, called GameDay, to help teams “create resiliency through destruction.”
“It’s a function of people first, then technology,” he said. “We depend on services that should never fail. These exercises cause people to prepare. You build confidence in your responsibility to respond to failure. This is the difference between companies that succeed at scale on the web and those that don’t.
Automation is a critical component of engineering for resiliency, said Robbins. Opscode makes configuration management software that allows organizations to automate large portions of their infrastructure.
“We have a bunch of manual processes which we need to automate,” said Robbins. “This is really the future.”