Lightning storms, floods, car crashes and errant transfer switches were the culprits in his year’s most significant data center outages. There were plenty of high-profile outages for social media sites like Twitter, Facebook and free hosting services (more on this in a moment). But 2010 also saw at least four major outages for e-commerce services, with some disrupting tens of millions of dollars in transactions. There were also incidents that knocked government services offline for days. Here’s a look at the top business-related outages of the year, followed by a roundup of social media downtime.
Jan. 19: Storms KO San Jose Data Center: A NaviSite data center lost power after severe storms knocked out the facility’s utility power from PG&E, causing a lengthy outage for merchants using the eBay ProStores service to operate e-commerce web sites. NaviSite later took steps to overhaul the surge suppression system at its San Jose data center, which didn’t adequately protect relay fuses within an automated transfer switch (ATS) from a power surge.
March 31: Flooded Exchange Disrupts BT Service: Major flooding at a BT telecom exchange point in the Paddington section of London disrupted telecom service and payment networks in parts of London, with some areas taking several days to recover.
May 6: Finance Sites Slowed by Market Whiplash: In the year’s most glaring traffic-storm outage, several major Internet finance portals experienced performance problems as major stock market indicators plunged precipitously in the May 6 “Flash Crash.” That included the web’s most popular financial information site, Yahoo Finance, which was inaccessible to many users as the Dow Jones average plunged 700 points in a matter of minutes between 2 and 3 p.m. Eastern time.
May 11: Car Crash Triggers Amazon Power Outage: Some customers of Amazon’s EC2 cloud computing service were knocked offline when a vehicle crashed into a utility pole near one of the company’s data centers, and a transfer switch failed to properly manage the shift from utility power to the facility’s generators. While the incident lasted only an hour, it was the fourth separate outage in a week for EC2 services.
June 1: Water Main Break Floods Dallas Data Center: IT systems in Dallas County were offline for more than three days after a 90-yeard water main ruptured, flooding the basement of the Dallas County Records Building, which houses the UPS systems and other electrical equipment supporting the data center on the fifth floor of the building. The county did not have a backup data center, despite warnings that it faced the risk of service disruption without one.
June 16: Power Failure KOs Intuit Sites for 24 Hours: A data center power failure during routine maintenance knocked Quickbooks.com and other Intuit web sites offline for more than a day. The outage was critical for Intuit’s small business hosting customers, most of whom were unable to access their web sites or process transactions.
June 29: Downtime for Amazon.com: The main Amazon retail store is rarely offline, which is a good thing, since it drives revenue of about $1.75 million an hour. But on June 29 Amazon.com was unavailable for about three hours due to “latencies.”
Aug. 27: Computer Outages Hobble Services in Virginia: Many critical services in the state of Virginia were crippled by failures in a state data center in Chesterfield. More than 220 servers were offline, leaving at least 24 state agencies without full IT support. Some agencies – most notably the division of motor vehicles – experienced problems for weeks afterward.
Sept. 13: Major Outage for Chase.com Web Site: The Chase web site crashed when a third party vendor’s database software corrupted the log-in process, disrupting site access and online bill payments for nearly three days. Chase said the outage delayed more than $132 million in transfers, and offered to compensate customers for late fees on bill payments.
Oct. 29: Hardware Failure Cited in Paypal Outage: A network hardware failure was the trigger for an outage at PayPal, which left millions of merchants unable to process online transactions. The hardware failure was compounded by problems shifting traffic to another data center, resulting in about 90 minutes of downtime for the payment processing service.
Nov. 4: Transfer Switch Glitch KOs iWeb Customers: About 3,000 servers at Montreal web host iWeb experienced an outage after a fire near one of its data centers prompted the company to shift the facility to generator power. All three generators started properly, but one of the transfer switches failed. Power came back in an hour, but some customers were offline for far longer.
SOCIAL MEDIA DOWNTIME:
While business downtime was often tied to equipment failure and acts of God, social media and hosting sites saw a number of outages tied to coding snafus and configuration challenges. Here’s an overview:
March 25: Wikipedia’s Data Center Overheats: The online encyclopedia Wikipedia was knocked offline when a cooling problem in its European data center led to a heat condition that caused a server shutdown. The initial problem affected European Wikipedia users, but the main English-language Wikipedia site was affected when the failover to WikiMedia’s Tampa data center didn’t go as planned.
June 11: Errant Code Change Crashes 10 Million Blogs: The blog hosting service WordPress.com suffered a major outage when a code change overwrote key options in the options table for its blogs. Most sites appear to have been offline for about an hour, but it was about six hours before the WordPress.com team posted that operations were “back to full speed.”
June 16: Record World Cup Traffic Slams Twitter: Traffic spikes during World Cup soccer matches overwhelmed Twitter’s internal network capacity, resulting in frequent downtime and performance problems. A month later, the company announced plans to build its own data center to better manage its growth.
Sept. 8: Digg’s Downtime Debacle: The rollout of the social news hub’s Version 4 led to significant availability problems for Digg.com, prompting a debate about whether the downtime was tied to challenges in deploying NoSQL databases like Cassandra or simply a case of a company launching a new site architecture before it was ready for prime time.
Sept. 23: Longest Facebook Outage in 4 Years: Facebook went down for more than two hours on Sept. 23, marking its longest outage in about four years. A configuration change created a feedback loop that overwhelmed a database cluster. The only way to fix the problem was to take the whole cluster offline – which meant downtime for web site.
Dec. 5: Tumblr Offline for 24 hours: The microblogging service Tumblr was offline for more than a day after a database cluster failed and a network outage followed. The site has also had to cope with denial of service attacks. Shortly after the incident, Tumblr announced plans to add a new data center to better manage traffic that has reached 500 million page views a month.
The good news: No major squirrel-related outages this year.