6. Windows Azure Cloud Outage, Feb. 29: Then there’s the “Leap Year Bug,” in which a date-related glitch with a security certificate was triggered by the onset of the Feb. 29th “Leap Day” which occurs once every four years. The incident left Azure customers unable to manage their applications for about 8 hours and knocked Azure-based services offline for some North American users. “This issue appears to be due to a time calculation that was incorrect for the leap year,” said Microsoft’s Bill Laing. Microsoft later offered customers a service credit under its service level agreement (SLA).
7. Salesforce.com Outage, July 10: June and July tend to be the toughest months for uptime, and that held true for Salesforce.com, which had outages in both months. The more significant of the two incidents occurred July 10, and was caused by a brief power loss in a Silicon Valley data center operated by Equinix. As is often the case, the restoration of power to the data center was prompt, but was followed by a longer recovery period for customers with databases and apps. Equinix restored power within a minute, but Salesforce.com was affected for more than 9 hours.
8. Syrian Internet Blackout, Nov. 29: Downtime has a political component, as we have learned over the past two years in Internet “blackouts” in Egpyt, Libya and most recently Syria. On Nov. 29, Internet monitoring services reported that all 84 of Syria’s IP address blocks have become unreachable, effectively removing the country from the Internet. CloudFlare’s monitoring suggested that government claims of terrorism and cable cuts lacked credibility. “The systematic way in which routes were withdrawn suggests that this was done through updates in router configurations, not through a physical failure or cable cut,” CloudFlare reported.
9. “Safety Valve” KOs Azure. July 28: Sometimes the systems set up to protect your network can inadvertently become the enemy. In a July 28 outage for the Windows Azure cloud computing platform, a “safety valve” feature designed to throttle connections during traffic spikes wasn’t properly configured to handle a capacity upgrade for the West Europe sub-region, resulting in a flood of network management messages that maxed out the Azure system. The result was a 2 hour, 24 minute outage for users in West Europe.
10. Hosting.com Outage, July 28: Human error is often cited as one of the leading factors in data center downtime. There was an example of that in a July incident that caused an outage for 1,100 customers of Hosting.com. The downtime occurred as the company was conducting preventive maintenance on a UPS system in the company’s data center in Newark, Del. “An incorrect breaker operation sequence executed by the servicing vendor caused a shutdown of the UPS plant resulting in loss of critical power to one data center suite within the facility,” said Hosting.com CEO Art Zeile. “This was not a failure of any critical power system or backup power system and is entirely a result of human error.”
Pages: 1 2