In a business built on uptime, outages make headlines. The major downtime incidents of 2012 illustrate the range of causes of outages - major disasters, equipment failures, software behaving badly, undetected Leap Year date issues, and human error. Each incident caused pain for customers and end users, but also offered the opportunity to learn lessons that will make data centers and applications more reliable.
A case in point: 2012 was the year of the cloud outage, as several leading cloud platforms experienced downtime, most notably Amazon Web Services. The incidents raised questions about the reliability of leading cloud providers, but also prompted greater focus on architecting cloud-based applications across multiple zones and locations for greater resilience. Meanwhile, the post-mortems on SuperStorm Sandy have just begun, and will continue at industry conferences in 2013. Here's a look at our list of the Top 10 outages of 2012:
1. SuperStorm Sandy, Oct. 29-30: Data centers throughout New York and New Jersey felt the effects of Sandy, with the impacts ranging from flooding and downtime for some facilities in Lower Manhattan, to days on generator power for data centers around the region. Sandy was an event that went beyond a single outage, and tested the resilience and determination of the data center industry on an unprecedented scale. One of the affected providers willing to share their story was Datagram CEO Alex Reppen, who described the company's battle to rebound from "apocalyptic" flooding that shut down its diesel fuel pumps. Indeed, diesel became the lifeblood of the recovery effort, as backup power systems took over IT loads across the region, prompting extraordinary measures to keep generators fueled. With the immediate recovery effort behind us, the focus is shifting to longer-term discussions about location, engineering and disaster recovery - a conversation that will continue for months, if not years.
2. Go Daddy DNS Outage,Sept. 10: Domain giant Go Daddy is one of the most important providers of DNS service, as it hosts 5 million web sites and manages more than 50 million domain names. That's why a Sept. 10 outage was one of the most disruptive incidents of 2012. Tweet-driven speculation led some to believe that the six-hour incident was the result of a denial of service attack, but Go Daddy later said it was caused by corrupted data in router tables. “The service outage was not caused by external influences,” said Scott Wagner, Go Daddy’s Interim CEO. “It was not a ‘hack’ and it was not a denial of service attack (DDoS). We have determined the service outage was due to a series of internal network events that corrupted router data tables."
3. Amazon Outage, June 29-30: Amazon's EC2 cloud computing service powers some of the web's most popular sites and services, including Netflix, Heroku, Pinterest, Quora, Hootsuite and Instagram. That success has a flip side: when an Amazon data center loses power, the outage ripples across the web. On June 29, a system of unusually strong thunderstorms, known as a derecho, rolled through northern Virginia. When an Amazon facility in the region lost utility power, the generators failed to operate properly, depleting the emergency power in the uninterruptible power supply (UPS) systems. Amazon said the data center outage affected a small percentage of its operations, but was exacerbated by problems with systems that allow customers to spread workloads across multiple data centers. The incident came just two weeks after another outage in the same region. Amazon experienced another cloud outage in late October.
4. Calgary Data Center Fire, July 11: A data center fire in a Shaw Communications facility in Calgary, Alberta crippled city services and delayed hundreds of surgeries at local hospitals. The incident knocked out both the primary and backup systems that supported key public services, providing a wake-up call for government agencies to ensure that the data centers that manage emergency services have recovery and failover systems that can survive a series of adversities – the “perfect storm of impossible events” that combine to defeat disaster management plans.
5. Australian Airport Chaos, July 1: The “Leap Second Bug,” in which a single second was added to the world’s atomic clocks, made headlines on July 1. The change caused computer problems with the Amadeus airline reservation system, triggering long lines and traveler delays at airports across Australia, as the outage wreaked havoc with the check-in systems used by Qantas and Virgin Australia.
6. Windows Azure Cloud Outage, Feb. 29: Then there's the "Leap Year Bug," in which a date-related glitch with a security certificate was triggered by the onset of the Feb. 29th “Leap Day” which occurs once every four years. The incident left Azure customers unable to manage their applications for about 8 hours and knocked Azure-based services offline for some North American users. “This issue appears to be due to a time calculation that was incorrect for the leap year,” said Microsoft’s Bill Laing. Microsoft later offered customers a service credit under its service level agreement (SLA).
7. Salesforce.com Outage, July 10: June and July tend to be the toughest months for uptime, and that held true for Salesforce.com, which had outages in both months. The more significant of the two incidents occurred July 10, and was caused by a brief power loss in a Silicon Valley data center operated by Equinix. As is often the case, the restoration of power to the data center was prompt, but was followed by a longer recovery period for customers with databases and apps. Equinix restored power within a minute, but Salesforce.com was affected for more than 9 hours.
8. Syrian Internet Blackout, Nov. 29: Downtime has a political component, as we have learned over the past two years in Internet "blackouts" in Egpyt, Libya and most recently Syria. On Nov. 29, Internet monitoring services reported that all 84 of Syria’s IP address blocks have become unreachable, effectively removing the country from the Internet. CloudFlare's monitoring suggested that government claims of terrorism and cable cuts lacked credibility. “The systematic way in which routes were withdrawn suggests that this was done through updates in router configurations, not through a physical failure or cable cut,” CloudFlare reported.
9. "Safety Valve" KOs Azure. July 28: Sometimes the systems set up to protect your network can inadvertently become the enemy. In a July 28 outage for the Windows Azure cloud computing platform, a “safety valve” feature designed to throttle connections during traffic spikes wasn’t properly configured to handle a capacity upgrade for the West Europe sub-region, resulting in a flood of network management messages that maxed out the Azure system. The result was a 2 hour, 24 minute outage for users in West Europe.
10. Hosting.com Outage, July 28: Human error is often cited as one of the leading factors in data center downtime. There was an example of that in a July incident that caused an outage for 1,100 customers of Hosting.com. The downtime occurred as the company was conducting preventive maintenance on a UPS system in the company’s data center in Newark, Del. “An incorrect breaker operation sequence executed by the servicing vendor caused a shutdown of the UPS plant resulting in loss of critical power to one data center suite within the facility,” said Hosting.com CEO Art Zeile. “This was not a failure of any critical power system or backup power system and is entirely a result of human error.”