Several generators at 365 Main’s San Francisco data center failed to start when the facility lost grid power Tuesday afternoon, causing an outage that knocked many of the web’s most popular destinations offline for several hours. The disruption, which began at 1:45 pm PST, occurred during a grid outage for Pacific Gas & Electric, which left significant portions of San Francisco in the dark. Parts of 365 Main’s data center lost power, causing downtime for customer sites including CraigsList, Technorati, LiveJournal, TypePad, AdBrite, the 1Up gaming network, Second Life and Yelp, among others.
Wild rumors circulated about why 365 Main’s backup systems failed to maintain power to key systems, including reports of employee sabotage or a possible triggering of the facility’s emergency power off (EPO) button, a frequent cause of outages at mission-critical facilities. While less sensational, the actual cause of the outage was the failure of backup diesel generators.
“An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building,” the company said in an incident report. “On-site facility engineers responded and manually started affected generators allowing stable power to be restored at approximately 2:34 pm across the entire facility.”
“As a result of the incident, continuous power was interrupted for up to 45 minutes for certain customers,” the report continued. “We’re certain 3 of the 8 colocation rooms were directly affected, and impact on other colocation rooms is still being investigated.”
The 365 Main data center is supported by 10 Hitec 2.1 megawatt generators, which are tested every month. The 277,000 square foot 365 facility is partitioned into eight data center “pods,” some of which remained online while others went dark.
The facility’s backup systems use flywheel UPS systems – rather than batteries – to provide “ride-through” electricity to keep servers online until the diesel generator can start up and begin powering the facility. A flywheel is a spinning cylinder which generates power from kinetic energy, and continues to spin when grid power is interrupted. In most data centers, the UPS (uninterruptible power supply) system draws power from a bank of large batteries. AboveNet, the original builder/owner of the 365 Main data center, was an early adopter of flywheel UPS systems, which have recently gained attention as a “greener” alternative to batteries.
Some customers speculated about a flywheel issue. Trouble shooting the exact reason for the generator failure will take some time, according to 365 Main. “Due to the complexity and specialization of data center electrical systems, we are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause of why certain generators did not start,” the company said in its incident report.
The downtime quickly became a public relations setback for 365 Main, as the blogosphere pounced on a failure that knocked many of its leading hosts and services offline. The outage was highlighted at O’Reilly Radar, Scobleizer and TechCrunch, among others.
Earlier in the day the company issued a press release noting two consecutive years of uptime for a customer at the San Francisco data center, RedEnvelope. The press release was noted on Slashdot and Techdirt and has since been removed from 365 Main’s web site.
Misinformation spread swiftly, propelled by the blogs and forums not affected by the outage. CNet, which hosts its servers at 365 Main, debunked reports from ValleyWag that a drunk employee had gone on a rampage and that a “mob of angry customers” assembled outside the 365 Main building. The “mob” was actually a line of customers who were forced to enter through the front door and have badges checked manually to get into the building because the parking garage gate was affected by the power outage, according to CNet. ValleyWag’s “drunk employee” post quickly became one of the most popular posts on the front page at Digg.
The problems began when parts of PG&E’s San Francisco area network began experiencing voltage fluctuations, which apparently caused a transformer to fail in a manhole under 560 Mission St. Witnesses told the San Francisco Chronicle they heard a blast shortly before 2 p.m. and then saw flames licking up through the manhole grate. PG&E could not confirm that an explosion had occurred, but said that 30,000 to 50,000 customers were affected.
The 365 Main data center was originally built by AboveNet, which spent $125 million to construct and “earthquake proof” the facility. After AboveNet filed for bankruptcy, 365 Main bought the property for $2.6 million in a court-approved deal. 365 Main has since expanded its network to seven data centers, including facilities in Oakland, Phoenix, Chantilly, Va. and two centers in Los Angeles (El Segundo and Vernon/Irvine).