Folks who've worked in the data center industry for a while tend to have their squirrel stories. Mike Christian, who runs business continuity for Yahoo, shared his recently during a keynote at the O'Reilly Velocity conference in a presentation titled "Frying Squirrels and Unspun Gyros," which examined the many ways that data centers can fail.
"A frying squirrel took out half of our Santa Clara data center two years back," Christian said, noting squirrels' propensity to interact with electrical equipment, with unfortunate results.
If you enter “squirrel outage” in either Google News or Google web search, you’ll find a lengthy record of both recent and historic incidents of squirrels causing local power outages.
Yahoo houses its servers in 29 different data centers, explaining Christian's familiarity with the many ways they can fail. These include:
- Inadvertant fire suppression: When electrical triggered smoke detectors at a Texas data center hosting Yahoo Launch (Broacast.com), staffers didn't realize they could override the next phase of the system - power shutdown and a "dump" of FM200 fire suppressant.
- HVAC Failure: A cooling system failure in an N+1 Yahoo facility in Reston, Virginia caused a temperature spike in part of the data center, which triggered the fire suppression system - which then shut down the remaining HVAC units, resulting in a "thermal runaway" that resulted in 130 degree F temperatures in the data center. Yahoo was able to shift the load, resulting in no downtime. That's one reason Yahoo built its Lockport, N.Y. "chicken coop" data center to use fresh air instead of mechanical cooling. "That's one less failure point," said Christian.
- UPS Meltdowns: Yahoo had a small UPS setup in its Sunnyvale data center fail three times in five years. Christian cites a recent survey indicating that up to 29 percent of unplanned data center outages are caused by UPS failures. "Our UPS causes as many problems as it solves," said Christian. "Complexity is introduced by adding all these multiple systems. They actually introduce additional failure cases."
Christian also reviews the Rackshack fire of 2003, the explosion at The Planet in 2008, the Verio outage from Hurricane Wilma, the 365 Main outage of 2007, and the 2011 AWS cloud outage in Dublin.
How do you prepare for these kind of events? Focus on storing data in more than one location, and routing around facility failures. How does Yahoo know this will work? It conducts full-scale live failover testing with live loads, shifting millions of users between data centers with no visible impact. Here's a video of Christian's presentation, which runs about 15 minutes.
If you enjoy Christian's presentation, check out The 10 Most Bizarre and Annoying Causes of Fiber Cuts, in which Level 3's Fred Lawler looks at unusual outages in the company's network. You'll be shocked (SHOCKED!) to learn that 17 percent of Level 3's fiber outages are caused by squirrels.