In a business built on uptime, outages make headlines. The major downtime incidents of 2012 illustrate the range of causes of outages – major disasters, equipment failures, software behaving badly, undetected Leap Year date issues, and human error. Each incident caused pain for customers and end users, but also offered the opportunity to learn lessons that will make data centers and applications more reliable.
A case in point: 2012 was the year of the cloud outage, as several leading cloud platforms experienced downtime, most notably Amazon Web Services. The incidents raised questions about the reliability of leading cloud providers, but also prompted greater focus on architecting cloud-based applications across multiple zones and locations for greater resilience. Meanwhile, the post-mortems on SuperStorm Sandy have just begun, and will continue at industry conferences in 2013. Here’s a look at our list of the Top 10 outages of 2012:
1. SuperStorm Sandy, Oct. 29-30: Data centers throughout New York and New Jersey felt the effects of Sandy, with the impacts ranging from flooding and downtime for some facilities in Lower Manhattan, to days on generator power for data centers around the region. Sandy was an event that went beyond a single outage, and tested the resilience and determination of the data center industry on an unprecedented scale. One of the affected providers willing to share their story was Datagram CEO Alex Reppen, who described the company’s battle to rebound from “apocalyptic” flooding that shut down its diesel fuel pumps. Indeed, diesel became the lifeblood of the recovery effort, as backup power systems took over IT loads across the region, prompting extraordinary measures to keep generators fueled. With the immediate recovery effort behind us, the focus is shifting to longer-term discussions about location, engineering and disaster recovery – a conversation that will continue for months, if not years.
2. Go Daddy DNS Outage,Sept. 10: Domain giant Go Daddy is one of the most important providers of DNS service, as it hosts 5 million web sites and manages more than 50 million domain names. That’s why a Sept. 10 outage was one of the most disruptive incidents of 2012. Tweet-driven speculation led some to believe that the six-hour incident was the result of a denial of service attack, but Go Daddy later said it was caused by corrupted data in router tables. “The service outage was not caused by external influences,” said Scott Wagner, Go Daddy’s Interim CEO. “It was not a ‘hack’ and it was not a denial of service attack (DDoS). We have determined the service outage was due to a series of internal network events that corrupted router data tables.”
3. Amazon Outage, June 29-30: Amazon’s EC2 cloud computing service powers some of the web’s most popular sites and services, including Netflix, Heroku, Pinterest, Quora, Hootsuite and Instagram. That success has a flip side: when an Amazon data center loses power, the outage ripples across the web. On June 29, a system of unusually strong thunderstorms, known as a derecho, rolled through northern Virginia. When an Amazon facility in the region lost utility power, the generators failed to operate properly, depleting the emergency power in the uninterruptible power supply (UPS) systems. Amazon said the data center outage affected a small percentage of its operations, but was exacerbated by problems with systems that allow customers to spread workloads across multiple data centers. The incident came just two weeks after another outage in the same region. Amazon experienced another cloud outage in late October.
4. Calgary Data Center Fire, July 11: A data center fire in a Shaw Communications facility in Calgary, Alberta crippled city services and delayed hundreds of surgeries at local hospitals. The incident knocked out both the primary and backup systems that supported key public services, providing a wake-up call for government agencies to ensure that the data centers that manage emergency services have recovery and failover systems that can survive a series of adversities – the “perfect storm of impossible events” that combine to defeat disaster management plans.
5. Australian Airport Chaos, July 1: The “Leap Second Bug,” in which a single second was added to the world’s atomic clocks, made headlines on July 1. The change caused computer problems with the Amadeus airline reservation system, triggering long lines and traveler delays at airports across Australia, as the outage wreaked havoc with the check-in systems used by Qantas and Virgin Australia.
Pages: 1 2