Joyent event schwag with CTO Bryan Cantrill’s name tag on display at the 2011 Node Knockout hackathon. (Source: Joyent’s Facebook profile)

Joyent event schwag with CTO Bryan Cantrill’s name tag on display at the 2011 Node Knockout hackathon. (Source: Joyent’s Facebook profile)

Admin Error Brings Down Joyent’s Ashburn Data Center

1 comment

Joyent, a San Francisco-based provider of high-performance cloud infrastructure services, saw one of its data centers go down Tuesday as a result of an error made by an administrator. The company had to reboot all servers in its US-East-1 data center, located in Ashburn, Virginia.

The provider has not released information on what exactly caused the outage, but is promising a “full postmortem.” In a forum post on Hacker News, Joyent CTO Bryan Cantrill wrote that the company would be providing the information “as soon as we reasonably can.”

Cloud outages sting more than others

Outages of service provider data centers cause a lot more damage than enterprise data center outages do because they host infrastructure for many companies instead of one. Cloud data center outages are especially painful because each physical server may be a host to multiple customers’ virtual compute nodes.

Another service provider, Internap, which offers cloud hosting services, experienced three outages at its New York City data centers during the past two weeks. The company did not say how many customers the outages affected overall, but at least 20 companies were affected by one of incidents.

Internap’s problems were caused by electrical equipment failure. This kind of an outage is different from Joyent’s. Internap’s outage happened at the facilities layer of the stack, while Joyent’s incident happened at the IT administration level.

‘Fat finger’ shouldn’t hurt so much

While human error was at fault, Joyent’s system ideally would have been built to withstand such errors. “While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a data center,” Cantrill wrote, adding that the company would be improving software and operational procedures to prevent such incidents from happening in the future.

Joyent does not plan to discipline the administrator that made the error, Cantrill told The Register, explaining that the company was more interested in learning from the incident than punishing people.

Joyent provides public and private cloud infrastructure services for companies that need more computing horsepower than the mainstream Infrastructure-as-a-Service providers, such as Amazon Web Services, can offer.

In addition to the Ashburn data center, brought online in February 2012, its cloud infrastructure lives in data centers in San Francisco, Las Vegas and Amsterdam.

Add Your Comments

  • (will not be published)

One Comment

  1. I've been out of the server admin business for a while now, but it's surprising to me that one mistake could cascade through the server farm so quickly. I suppose that's the flip side of efficiency. I always advise my clients to keep local backups of their sites / data - an admin mistake like this could be a lot worse than just requiring a reboot.