Rackspace has shared additional details about the outages Monday at its Dallas data center, which knocked many popular Web sites offline for about three hours. There were actually two separate incidents causing customer downtime at the Dallas facility, according to Rackspace, which said it had been "a long and hard Monday." The second outage illustrates the challenges created by rising data center heat loads during a temporary loss of cooling. Rackspace indicated that affected customers will likely receive account credits for the downtime.
The first incident happened Sunday at 4 am Dallas time, when an unspecified mechanical failure knocked some customers offline. Rackspace said service was restored fairly quickly. The outage Monday night was more problematic, and knocked many sites offline. Here are the details:
In the second incident at approximately 6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center's chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim. This repeated cycling of the chillers resulted in increasing temperatures within the data center. As a precautionary measure we decided to take some customers' servers offline. These servers are now back up, as are the chillers.
While the first outage was linked to an equipment failure, it appears the second was a result of the temporary loss of cooling as the chillers restarted. It highlights the challenges posed by the increasing heat and power loads in modern data centers, which have dramatically shortened the time window in which cooling must be restored before servers overheat.
A recent study by Opengate Data Systems found that a typical data center running at 5 kilowatts per server cabinet will experience a thermal shutdown within three minutes during a power outage. Higher density cabinets with 10 kilowatts will shut down in less than a minute. "Thermal runaways can wreak havoc on a data center causing instant data loss and lost revenue," said Martin Olsen, director of Product Management and Development for Active Power, the maker of UPS flywheel systems that sponsored the study.
Rackspace has a service level agreement (SLA) that promises 100 percent update and outlines customer refunds in the event of downtime, as noted by Jesse Robbins at O'Reilly Radar. Rackspace will credit the customer 5 percent of its monthly fee for each 30 minutes of downtime.
In a comment on the Radar post, Rackspace chairman Graham Weston indicated that Rackspace will live up to its SLA terms. "We pledge to pursue 100% uptime," Weston wrote. "We do not consider a minute of downtime acceptable. Today, we broke our promise to a lot of customers in our Dallas datacenter. We will make it right with or without this SLA."