windowsazure

Errant ‘Safety Valve’ Caused Windows Azure Outage

Add Your Comments

A feature designed to prevent network outages caused last week’s downtime for the Windows Azure cloud computing platform, Microsoft said yesterday as it released its root cause analysis of the incident.

A “safety valve” designed to throttle connections during traffic spikes wasn’t properly configured to handle a capacity upgrade for the West Europe sub-region, resulting in a flood of network management messages that maxed out the Azure system. The result was a 2 hour, 24 minute outage for users in West Europe.

“Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our data center network hardware devices,” wrote Mike Neil, General Manager of Windows Azure, in a blog post. “Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.”

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)