Recent Microsoft Azure Outage Came Down to Human Error: Report

This article originally appeared at The WHIR

Microsoft offered more details this week about the cause of the Microsoft Azure outage in November that caused downtime for thousands of sites, including Microsoft’s own msn.com and Windows Store.

The Microsoft Azure service interruption on Nov. 18 resulted in intermittent connectivity issues with the Azure Storage service in multiple regions.

In a lengthy, detailed post on its Microsoft Azure blog on Wednesday, corporate vice president, Microsoft Azure, Jason Zander said that the issue arose after Microsoft deployed a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.

During deployment, there were two operational errors, Zander said.

“The standard flighting deployment policy of incrementally deploying changes across small slices was not followed,” he said. Secondly, “although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends,” which exposed a bug that resulted in Blob storage Front-Ends entering an infinite loop.

Microsoft’s final Root Cause Analysis for the event determined that “there was a gap in the deployment tooling that relied on human decisions and protocol.”

“With the tooling updates the policy is now enforced by the deployment platform itself,” he said.

Aside from technical failure, Microsoft said that it failed to in communicating with affected clients during the incident. It fell short in posting status information to the Service Health Dashboard, and had insufficient channels of communication (Tweets, Blogs, Forums). It also admitted the response from Microsoft support was slow.

“We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services,” Zander said. “We appreciate the trust our customers place in Microsoft Azure, and I want to personally thank everyone for the feedback which will help our business continually improve.”

Below is the full timeline of the Nov. 18 Microsoft Azure service interruption:

11/19 00:50 AM– Detected Multi-Region Storage Service Interruption Event
11/19 00:51 AM – 05:50 AM– Primary Multi-Region Storage Impact. Vast majority of customers would have experienced impact and recovery during this timeframe
11/19 05:51 AM – 11:00 AM – Storage impact isolated to a small subset of customers
11/19 10:50 AM – Storage impact completely resolved, identified continued impact to small subset of Virtual Machines resulting from Primary Storage Service Interruption Event
11/19 11:00 AM– Azure Engineering ran continued platform automation to detect and repair any remaining impacted Virtual Machines
11/21 11:00 AM– Automated recovery completed across the Azure environment. The Azure Team remained available to any customers with follow-up questions or requests for assistance

This article originally appeared at: http://www.thewhir.com/web-hosting-news/recent-microsoft-azure-outage-came-human-error-report

Comments

Plain text