Microsoft Azure team has published a post-mortem on the widespread outage the cloud went through Wednesday that affected about 20 services in most availability zones around the world.
A configuration change meant to make Blob storage (Azure’s cloud storage service for unstructured data) perform better unexpectedly sent Blob front ends “into an infinite loop,” Jason Zander, a vice president of the Microsoft Azure team, wrote in a blog post. The front ends could not take on more traffic and caused problems for other services that used Blob.
The last major Azure outage happened in August, when services were interrupted completely in multiple regions, and issues persisted for more than 12 hours. Multiple regions went down earlier that month as well.
Problems during configuration changes are often cited as reasons for cloud outages. While companies do extensive tests of updates before they roll them out globally, they sometimes start behaving unpredictably when deployed at scale, as illustrated by the Azure outage.
In September, Facebook went down briefly after its engineers applied an infrastructure configuration change.
As a result of Tuesday’s Blob bug, Azure services were interrupted in six availability zones in the U.S., two in Europe, and four in Asia. Affected services included everything from Azure Storage to virtual machines, backup, machine learning, and HDInsights, Microsoft’s cloud-based Hadoop distribution.
The Azure team’s standard protocol dictates that changes are applied in small batches, but this update was made quickly across most regions, which Zander described as an “operational error.”
The bug showed itself during an upgrade that was supposed to improve performance of the storage service. The team had been testing the upgrade on some production clusters in the weeks running up to the mass deployment, but the issue did not surface during testing, Zander wrote.
Once the issue was discovered, the team rolled the update back, but storage frontends needed to be restarted. Performance issues persisted for close to 11 hours on Wednesday.
Because the outage had such a wide-ranging impact on the Azure infrastructure, the team could not communicate with its users through the cloud’s Service Health Dashboard, resorting to Twitter and other social media instead.
Zander outlined four steps his team would take to prevent a repeat occurrence of such incidents:
- Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
- Improve the recovery methods to minimize the time to recovery.
- Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
- Improve Service Health Dashboard Infrastructure and protocols.