Data Center Outages and Staff Scalability

failsnail Over the weekend social media icon Robert Scoble confirmed that he will be joining Rackspace Hosting (RAX) to head a new online community known as Building 43. Scoble is probably the best-known social media professional to be hired by a hosting or data center company. But he won't be last. I expect we'll see hosting providers hiring a bunch of social media specialists in coming months, particularly in customer support.

The growth of social media, and Twitter in particular, is remaking the nature of data center outages on the web. This presents big challenges for service providers, as reputations can be made or undone in a matter of hours.

Until recently, discussions of hosting outages were confined to customer support forums or industry boards such as Web Hosting Talk. That's changed with the rise of social media, as blogs provide quick updates on downtime. Twitter has emerged as perhaps the most important venue for discussion of hosting outages, powering fast-moving conversations in which unhappy customers share information and complaints.

Hosting providers ignore this development at their peril. The speed and volume of customer chatter on Twitter and other social media channels during outages is becoming unmanageable, even for early adopters in using new media for customer feedback.

The recent outage at Media Temple is a good example. The company's problems led Pingdom to note "how hard it is for modern connected organizations to respond quickly enough to system outages."

Media Temple hosts many prominent blogs, and has in-house blogs and RSS feeds dedicated to system status reports, as well as several Twitter profiles for customer interaction. During last month's lengthy outage, these channels proved inadequate to the task, and Media Temple took heat from customers over its responsiveness - including its failure to keep pace with Twitter inquiries.

Media Temple's response reflects a tough reality for hosting providers: getting the systems back online will always take priority over customer communications.

"One very important point here is during an outage of this magnitude, our entire admin staff is dedicated 100% to working on the Grid, and at times, can not simply deliver the information we need to disclose to the public,” MT community relations director Jason McVearry wrote to one critical blogger. “This is a constant struggle for many service providers and again, we’re working on improving this constantly.”

We've reached the point where outage-driven chatter on forums and social media resemble Internet traffic spikes from Slashdot or Digg - they occur quickly, and can easily overwhelm the available resources. The availability of real-time feedback has raised customer expectations, which has in turn raised the stakes for providers that experience outages.

Yes, customers want their site back online ASAP. But they also want to know what's happening during the downtime, and radio silence only ensures customer static.

The challenge: support staff are not like cloud computing clusters. You can't auto-provision and scale the staff needed to manage your reputation in a crisis. You need dedicated staff.

Isn't this advocating for excess staff overhead in times of typical demand - the equivalent of buying extra dedicated servers in the age of cloud computing? Sure. But until some innovator invents the robot customer communications expert, it will have to do.

In the meantime, there's plenty of ways social media specialists can boost your business during non-outage scenarios. But that's a topic for another post.

FailSnail graphic by Ronin691.

Comments

Plain text