-
Errant ‘Safety Valve’ Caused Windows Azure Outage

A feature designed to prevent network outages caused last week’s downtime for the Windows Azure cloud computing platform, Microsoft said yesterday as it released its root cause analysis of the incident.
A “safety valve” designed to throttle connections during traffic spikes wasn’t properly configured to handle a capacity upgrade for the West Europe sub-region, resulting in a flood of network management messages that maxed out the Azure system. The result was a 2 hour, 24 minute outage for users in West Europe.
“Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our data center network hardware devices,” wrote Mike Neil, General Manager of Windows Azure, in a blog post. “Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.”
RESOURCE LINKS:
Building A Cloud-Savvy Model for TCO and ROI
How Storage is Shaping The Cloud Data Center
Bringing Colo to the Customer: Modular Gets Local
Microsoft’s $1 Billion Data Center


August 3rd, 2012