This article originally appeared at The WHIR
Microsoft offered more details this week about the cause of the Microsoft Azure outage in November that caused downtime for thousands of sites, including Microsoft’s own msn.com and Windows Store.
The Microsoft Azure service interruption on Nov. 18 resulted in intermittent connectivity issues with the Azure Storage service in multiple regions.
In a lengthy, detailed post on its Microsoft Azure blog on Wednesday, corporate vice president, Microsoft Azure, Jason Zander said that the issue arose after Microsoft deployed a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.
During deployment, there were two operational errors, Zander said.
“The standard flighting deployment policy of incrementally deploying changes across small slices was not followed,” he said. Secondly, “although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends,” which exposed a bug that resulted in Blob storage Front-Ends entering an infinite loop.
Microsoft’s final Root Cause Analysis for the event determined that “there was a gap in the deployment tooling that relied on human decisions and protocol.”
“With the tooling updates the policy is now enforced by the deployment platform itself,” he said.
Aside from technical failure, Microsoft said that it failed to in communicating with affected clients during the incident. It fell short in posting status information to the Service Health Dashboard, and had insufficient channels of communication (Tweets, Blogs, Forums). It also admitted the response from Microsoft support was slow.
“We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services,” Zander said. “We appreciate the trust our customers place in Microsoft Azure, and I want to personally thank everyone for the feedback which will help our business continually improve.”
Below is the full timeline of the Nov. 18 Microsoft Azure service interruption:
This article originally appeared at: http://www.thewhir.com/web-hosting-news/recent-microsoft-azure-outage-came-human-error-report