Rackspace Expects Credits of $2.5 Million
July 2nd, 2009 By: Rich Miller
Rackspace Hosting (RAX) expects to issue customer service credits from Monday’s data center outage of between $2.5 million and $3.5 million, the company said in an SEC filing.
An unusual series of equipment failures contributed to Monday’s power outage at a Rackspace data center near Dallas, in which several parts of the facility lost power. The most significant failure involved a bank of generators that malfunctioned, leaving several computing clusters without backup power. Other failures involved equipment that connects the data center to its two utility feeds, a UPS system and a transfer switch. The outage affected two of the four phases of data center space in the facility in Grapevine, Texas.
“We sincerely apologize for this disruption and know that it impacted our customers’ businesses as well as the experience of many who use the web,” Rackspace CEO Lanham Napier said Tuesday in summary on the Rackspace blog. “Although we have had some issues with this data center before, please know that we will do what it takes to improve its reliability and performance. We owe you an action plan to prevent this type of thing in the future, and we’ll get that to you as soon as it is ready.”
The DFW data center is the company’s largest facility, with 144,000 square feet of space. The facilty in Grapevine figured into a 2007 power outage that interrupted service for many prominent web sites.
An analysis of the “root cause” of the outage has not yet been completed. But Rackspace made a preliminary incident report public after it was posted online. In an update Wednesday afternoon, Rackspace said it is “making progress in understanding what caused the interruption. We have our suppliers and external consultants onsite working with us on this process. We will continue to provide status updates as we learn more.”
Data centers are engineered to avoid a “single point of failure.” When a facility loses power, it is usually due to a combination of failures. That was true at Rackspace, which experienced problems at multiple points in its power infrastructure. The incident was triggered when the breaker on the data center’s primary utility feed tripped, and the facility switched to generator power. The generators then failed to hold load properly.
“What we saw yesterday was a situation where the generators started fighting with one another on the bus,” Rackspace said of the generator challenges. “The generators were unable to get properly synchronized. Eventually, they failed in a cascading manner and we lost all of the generators. Each generator failed on a loss of excitation – an inability to maintain the magnetic field. But it was really the inability to get synchronized that created that fault.”
In an update Thursday afternoon, Rackspace said it would conduct maintenance early Friday morning on its main utility breaker and generators, which will mean shifting some customers to generator power from midnight until about 6 a.m.
As a customer that spends nearly 5k per month with Rackspace, I expect a credit for the downtime. There is no excuse for what happened, we pay a premium for their service.
JasonPosted July 6th, 2009
Outages happen, nothing is ever fool proof, it seems that they have a pretty good idea of what went wrong and from the sounds of it – you’d never be able to reproduce it during normal drills and tests. If your site is so critical that it can’t go down – host it in two locations.
Jason: Outages do happen.
4 outages have happened in 30 days, all due to their power “redundancy” plan failing.
That is beyond reasonable.
yes, outages do happen.
But I have two questions:
1) When was the Outage detected in the Network Management system at Rackspace? An outage like this only shows that the Org has no NMS plan for managing its network. I would be interested to meet Mr Napier, and explain the technologies which could have been put in place for ensuring service availability in such scenarios.
2) Where was the BCP plan for a power outage?