-
Rackspace: ‘We Have Work To Do’
November 4th, 2009 : Rich MillerRackspace has apologized to customers of its cloud computing service for an outage early Tuesday, and provided additional details about the incident. The company said it still has “high confidence” in the capabilities of its Dallas/Fort Worth data center, which has been plagued by a series of power events this year that have disrupted customer service.
“The experience you had last night is not acceptable to us,” wrote Emil Sayegh, General Manager of The Rackspace Cloud, in an e-mail to customers. “We have work to do to earn back your trust. We will not rest until we have.” As part of that process, Rackspace is reviewing its procedures for recovering from outages in its cloud computing operation.
Power Loss at PDUs
The downtime occurred during maintenance as Rackspace continued to work on refinements to the power infrastructure at the DFW data center. A short during testing of a Power Distribution Unit (PDU) caused a loss of power to all PDUs in Cluster G of the Dallas facility.“The power disruption was approximately 5 minutes in duration,” Sayegh wrote. “Despite this short power disruption, many customers experienced downtime that was significantly longer. Since the power disruption hit the core of many of our cloud services, recovery of full operations required more effort than simple recovery of power.”
Hard Reboot, Unforeseen Failures
Sayegh said that athough the power outage was brief, it forced a hard reboot on some of Rackspace’s cloud infrastructure. “As our engineers worked to bring hardware back online, we experienced several unforeseen hardware failures,” Sayegh added. “Further complicating our recovery effort, the incident also created internal DNS issues, which caused additional delays. With that said, the vast majority of cloud customers affected by this outage had service restored within one hour’s time … however, depending upon the service, a few customers experienced service interruptions for up to few hours.”The outage comes after Rackspace vowed a thorough review of the Dallas facility, which experienced power-related outages on June 29 and July 7, as well as a significant outage in 2007. Rackspace CEO Lanham Napier said the company would “bring every resource to bear” on improving the reliability of the DFW facility.
Sayegh said Rackspace has “invested massively in the DFW facility to ensure it delivers at a level you expect from Rackspace – despite last night, we feel very good about our plan and have high confidence in the DFW facility – clearly we have to prove it.”
Wednesday morning data center bits … « The Server Room
Posted November 4th, 2009[...] says that they have “work to do” on their DFW data center that recently experienced another in a string of power failures. I think [...]
Outage resolved « Blogging Bandits
Posted November 4th, 2009[...] are some additional comments from Rackspace on this page, as well. Tagged as: outage, rackspace No Comments Comments (0) Trackbacks (0) ( subscribe to [...]
Thank you Rich for writing this. Again, apology for letting our customers down. We are working hard to regain everyone’s trust.
Let me know if you or any of your readers would like to talk or have any questions.
Emil Sayegh (emil.sayegh@rackspace.com)
General Manager, The Rackspace Cloud
Joe
Posted November 5th, 2009The Rackspace failures went beyond simple electrical, electronic and mechanical problems. Support would not give any information during those agonizing hours but just said “check the status page” even though my Cloud server was down for another hour after the status page said systems had been restored. And twice the link support gave me to the status page was the wrong one. During a similar outage at another host they gave constant, detailed updates on their status page and in their forums. “Fanatical” is a great PR phrase, but in my 11 years as an admin I’ve had better than “Fanatical.”
X
Posted November 5th, 2009Yawn, must be a dupe…
Wait, what?!? RackSpace did it AGAIN?!? They are so screwed…
There’s a map of CERTIFIED commercial Tier III & IV data centers at the Uptime Institute here (http://professionalservices.uptimeinstitute.com/TierCertifiedServiceProvidersMap.htm) in case someone’s looking for a high-availability data center.
Tech Interrupted: Lessons From T-Mo and Rackspace – GigaOM
Posted November 20th, 2009[...] followed the outage. On Monday night Rackspace, which provides managed hosting and cloud services, also experienced problems that took some customers out for hours. The interesting things here are the difference in how both [...]
RESOURCE LINKS:
