A feature designed to prevent network outages caused last week’s downtime for the Windows Azure cloud computing platform, Microsoft said yesterday as it released its root cause analysis of the incident.
A “safety valve” designed to throttle connections during traffic spikes wasn’t properly configured to handle a capacity upgrade for the West Europe sub-region, resulting in a flood of network management messages that maxed out the Azure system. The result was a 2 hour, 24 minute outage for users in West Europe.
“Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our data center network hardware devices,” wrote Mike Neil, General Manager of Windows Azure, in a blog post. “Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.”