Human Error Cited in Hosting.com Outage
Hosting.com said human error was responsible for a data center power outage that left more than 1,100 customers without service. The downtime occurred as the company was conducting preventive maintenance on a UPS system in the company’s data center in Newark, Del.
“An incorrect breaker operation sequence executed by the servicing vendor caused a shutdown of the UPS plant resulting in loss of critical power to one data center suite within the facility,” said Hosting.com CEO Art Zeile in a statement. “This was not a failure of any critical power system or backup power system and is entirely a result of human error.”
Power was restored within 11 minutes of the incident, Zeile said. But customer web sites were offline for between one and five hours as their equipment and databases required more time to recover from the sudden loss of power.
“This past night was not an easy one for our affected customers in Newark,” said Zeile. “We have shared our sincere apologies and have kept them continually informed on the situation as it unfolded. Our operations team has taken serious corrective action to minimize and/or eliminate the possibility of this kind of human error while carrying out routine operations.”
The risks involved in preventive maintenance were a hot topic at the 24×7 Exchange conference last fall, when reliability expert Steve Fairfax asserted that preventive maintenance can introduce more errors than it prevents.
JasonPosted July 28th, 2012
Ah okay, this explains why speedtest.net was down for about 4-5hrs yesterday
JeffPosted July 30th, 2012
“The risks involved in preventive maintenance were a hot topic at the 24×7 Exchange conference last fall, when reliability expert Steve Fairfax asserted that preventive maintenance can introduce more errors than it prevents.”
I am surprised that the difference between factory-direct service and third party service hasn’t even been mentioned in this debate over whether PMs cause more problems than they catch… The details of this case are proprietary to Hosting.com (since they want to avoid a lawsuit by saying something out of turn) but often third party service is turned to in order to cut costs but it has a negative impact on the efficacy of the task in question.
Also, not all equipment is created equal (kirk key interlock systems can drastically reduce this kind of error, were they in use?) On top of that, why no dual bus? Service gets the short end of the stick if they show up and a problem happens, but there are many other design factors that affect this.
These sorts of stories pop up often and usually the only details that come out were “operator error”, but unless it was “EPO misfire due to errant hand” there is usually a whole lot more to the story.
FrankPosted July 30th, 2012
You only hear about the service errors that result in an outage — when the redundancies prevent outage it’s just assumed that systems and processes are working like they’re supposed to.
StevePosted October 1st, 2012
I couldn’t agree more with what you guys have posted. It’s often overlooked that the facilities with downtime are maintained by 3rd party vendors though out and it’s a known fact that one vendor/company cannot know everything. I’ve experienced these mishaps first hand and had the onsite UPS techs questioning each other about the operation of the system hey were working on. To all out there that are responsible for maintaining their facility, don’t be cheap or you’ll be out of a job in no time!!!
Downtime…it happens to every hosting provider. The questions is does publishing the news on Twitter or Facebook or on your website that “we have an emergency downtime and we are working on it ” makes sure that customers have received that message and they are also working on needful stuffs on their end?
All the best