Automation is an incredibly important tool in the data center. But it's difficult to anticipate every set of events and conditions, and complex systems with lots of automated infrastructure can sometimes experience unexpected results. That's been the case in some of the recent outages at Amazon Web Services, in which equipment failures have triggered ripples of server and network activity. A similar scenario emerged Saturday at the open source code repository GitHub, as failover sequences triggered high levels of activity for network switches and file servers, resulting in an extended outage for the site that lasted more than 5 hours.
"We had a significant outage and we want to take the time to explain what happened," wrote GitHub's Mark Imbriaco in an incident report. "This was one of the worst outages in the history of GitHub, and it's not at all acceptable to us."
The details of the incident are complicated, and explained in some detail in GitHub's update. A summary: GitHub performed a software upgrade on network switches during a scheduled maintenance window, and things went badly. When the network vendor sought to trouble-shoot the issues, an automated failure sequence didn't synchronize properly - it did what it was supposed to do, but "unlucky timing" created huge churn on the GitHub network that blocked traffic between access switches for about 90 seconds. This triggered failover measures for the file servers, which didn't complete correctly because of the network issues.
The value of a detailed incident report is that it identifies vulnerabilities and workarounds that may prove useful to other users with similar infrastructure. As more services attempt to "automate all the things," understanding complex failover sequences becomes more important, and GitHub's outage report may prove interesting reading for the devops crowd.
Imbriaco got props from Hacker News readers for the thoroughness of the incident report, and shared some advice on the topic:
"The worst thing both during and after an outage is poor communication, so I do my best to explain as much as I can what is going on during an incident and what's happened after one is resolved. There's a very simple formula that I follow when writing a public post-mortem:
1. Apologize. You'd be surprised how many people don't do this , to their detriment. If you've harmed someone else because of downtime, the least you can do is apologize to them.
2. Demonstrate understanding of the events that took place.
3. Explain the remediation steps that you're going to take to help prevent further problems of the same type.
Just following those three very simple rules results in an incredibly effective public explanation."