Rackspace reports that parts of its Dallas data center lost power early today during testing of power distribution units (PDUs) during scheduled maintenance. This resulted in downtime for sites hosted on SliceHost and The Rackspace Cloud, including the leading tech blog TechCrunch, which ensured that the outage was widely noted on blogs and Twitter.
The Dallas data center has experienced power problems before, including outages on June 29 and July 7 that prompted Rackspace CEO Lanham Napier to issue an apology to customers and provide a detailed explanation of the outage and the operations of the Dallas/Fort Worth facility.
This morning's problems started at about 12:30 a.m. central time. "We were testing phase rotation on a Power Distribution Unit (PDU) when a short occurred and caused us to lose the PDUs behind this Cluster," Rackspace reported on its blog. "All power has been restored and devices are being brought back online. The PDUs were down for a total of about 5 minutes. We have aborted the maintenance for the remainder of the evening and will reschedule this for another date."
Althought the PDUs were offline for only 5 minutes, many customer sites were unavailable for a longer window. Most sites returned to service by 2 a.m., while several cloud servers continuing to experience problems until after 5 a.m., according to a timeline on the Cloud Servers status blog.
Data center maintenance windows are typically scheduled for overnight hours or weekends, when traffic is lower.
The Rackspace DFW data center in Grapevine, Texas is the company’s largest facility, with 144,000 square feet of space. The facilty in Grapevine figured into a 2007 power outage that interrupted service for many prominent web sites. In that incident, a vehicle struck a power transformer, and public safety officials turned off both the facility’s power feeds during their emergency rescue operations.
In August Rackspace leased an additional data center in Chicago and has expanded its disaster recovery services and DNS failover services (via Neustar) to allow customers to configure sites to automatically switch to a secondary site in the event of a failure.