Microsoft has offered a service credit for customers of its Windows Azure cloud computing service in the wake of an extended outage on Feb. 29 due to a Leap Year date issue. Microsoft also apologized for problems with its status dashboard for Azure, which crashed during the outage, and promised to expand its communications efforts to make better use of social media.
“Microsoft recognizes that this outage had a significant impact on many of our customers,” the company said in an incident report. “We stand behind the quality of our service and our Service Level Agreement (SLA), and we remain committed to our customers. Due to the extraordinary nature of this event, we have decided to provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted.”
The root cause of the outage was a Leap Year bug in which the system that generates security certificates for Azure. The incident recalld the days of the Y2K bug and the ability of an unanticipated date to trigger a system malfunction. The system recognized Feb. 29, 2012, but the certificate generation process set a one-year expiration date of Feb. 29, 2013 – which is not as a valid date, as 2013 is not a leap year.
The impact on customers was exacerbated when the Windows Azure Dashboard was knocked offline, unable to handle the traffic from Azure customers seeking information about the outage. The dashboard is hosted on two internal infrastructures, Windows Azure and Microsoft.com and also geo-replicated, according to Microsoft. But that didn’t help as the outage persisted, which resulted in “exceptionally high volume and fail-over/load balancing” problems.
“The service dashboard experienced intermittent availability issues, didn’t provide a summary of the situation in its entirety, and didn’t provide the granularity of detail and transparency our customers need and expect,” the company said. “A significant number of customers are asking us to better use our blog, Facebook page, and Twitter handle to communicate with them in the event of an incident. ”
Microsoft is about three years late in acknowledging the importance of Twitter as a primary communications tool during outages. But the frequency of updates wasn’t the only issue for customers, who were clearly unhappy with generic “something is still wrong, we’re working on it” updates.
“Although updates are posted on an hourly basis, the status updates were often generic or repeated the information provided in the last couple of hours,” Microsoft said. ” Customers have asked that we provide more details and new information on the specific work taking place to resolve the issue.”