It’s hard to say whether data center outages were more frequent in 2009 than in the past. But they were certainly more visible, as round-the-clock consumption of blogs and social networks made downtime harder to hide, and Twitter amplified customer complaints. First, let’s look at the outages – there were some doozies, and sometimes they came in bunches - and then review how social media altered the status quo for data center downtime in 2009.
Here are the top 10 data center outages of 2009, in no particular order:
Lengthy Outage at Fisher Plaza: The early July outage at this Seattle data center hub was widely felt, affecting e-commerce around the globe as Authorize.net went offline. The outage, which also affected availability for customers of Internap and AdHost and Microsoft’s Bing Travel site, was later blamed on an insulation failure in a bus duct.
Michael Jackson’s Death Slows The Web: On June 25 the Internet creaked under the weight of millions of users seeking news on the death of pop star Michael Jackson. As the news of Jackson’s death circulated, the traffic jam spread to many large news sites. An anlysis by Keynote systems later blamed some of the problems on slow-loading third-party contentlike ad networks and widgets.
The Sidekick Snafu: On Oct. 10 T-Mobile told all users of its popular Sidekick mobile that their data had been lost due to a server failure at Microsoft’s Danger subsidiary. Microsoft was eventually able to recover much of the endangered data, but not before a vigorous debate broke out about whether the Sidekick fiasco could be tied to the risks of cloud computing and online storage.
Total Data Loss for Ma.gnolia: Users who stored bookmarks online using the Ma.gnolia service were not as lucky. All of the site’s user data was irretrievably lost in the Jan. 30 database crash. The data disaster underscored the importance of sound backup practices, as well as the challenge of running a large service as a one-man operation.
Twitter Felled by DDoS: On Aug. 6 an electronic attack known as a distributed denial of service (DDoS) targeted sites on several major social networks. While Facebook and LiveJournals were slowed, Twitter crashed completely for about three hours before restoring service. The attacks continued for weeks as Twitter worked with its data center provider, NTT America, to strengthen its defenses.
Amazon Cloud Glitches: With cloud reliability receiving close scrutiny, even brief outages at Amazon Web Services were interpreted by some as a sign of cloud fragility. The Amazon cloud had a weather-related downtime in June, and also experienced brief outages in July and December. It bears mentioning that Amazon created multiple availability zones to address these types of outages.
Downtime at Rackspace: Persistent power problems at a Dallas data center caused several high-profile outages for Rackspace, as a June 29 event was followed by another outage on July 7. The incidents prompted a response from the top, as Rackspace CEO Lanham Napier taped a video outlining the company’s response.
Fire at 151 Front Street: The July 6th incident at Toronto’s major carrier hotel was overshadowed by the downtime at Fisher Plaza and Rackspace that same week. A number of tenants (including Peer 1) were knocked offline by the incident, caused by a fire in an eighth-floor electrical room.
Gmail is Down! Gmail is Down! Some of 2009′s most intense Tweetstorms were driven by brief outages for Gmail, Google’s enormously popular webmail program. A February outage was attributed to capacity problems in Google;s European data center network, while a September downtime incident was atributed to overloaded routers.
Paypal Down, and the Meter is Running: The cost of downtime is a popular topic in the data center business. Nowhere is this metric more relevant than in financial services. Performance problems on August 3 left Paypal unable to process transactions, a scenario that some estimated cost the eBay business unit as much as $7 million an hour in missed transaction fees.
Air New Zealand Travel ‘Chaos’: Rarely does a data center provider get a public chastisement like Air New Zealand CEO Rob Fyfe’s tongue-lashing of IBM after a generator failure in an Auckland data center crashed the airline’s reservations system. Most data center systems were back online in an hour, but it reportedly took up to six hours for the situation to normalize at New Zealand’s airports.
A recurring theme in industry downtime was the role of Twitter, which became a real-time aggregator for complaints from customers, but also emerged as an important customer support channel. Here’s a look at how the use of Twitter evolved around outages:
Data Center Outages and Staff Scalability: In March we examined how the speed and volume of customers tweets during outages was overwhelming existing staffing levels for hosting providers’ corporate Twitter accounts. “The challenge: support staff are not like cloud computing clusters,” we noted. “You can’t auto-provision and scale the staff needed to manage your reputation in a crisis. You need dedicated staff.”
Weathering the Customer Service Tweetstorm: LA host Media Temple was among those that retooled their customer service operation to boost staffing of its Twitter channel, a move that paid dividends when the company suffered a lenghty outage in May.
Hosting Downtime and Competition: By July, hosting companies were adding outage-related hashtags to “adverTweets” marketing their services to customers affected by outages, prompting discussion of whether “rescue marketing” was a winning strategy or a shady tactic.