Dealing With Downtime
The discussion following last Sunday’s New York Times story on web site outages may have made the industry more self-conscious about downtime, but that doesn’t seem to have translated into fewer problems. This week we’ve had outages for Apple’s Mobile Me service, Facebook (and again today), Google Docs, 37signals and LiveSide.
When downtime strikes, what’s a data center operator or online service to do? Finding the right way to acknowledge and manage downtime can be crucial to maintaining the confidence of your users, according to companies that have weathered outages. A good first step is to get over “downtime denial” and accept that your communications strategy is often as important as your efforts to restore service.
“One of the critical areas is listening to your users,” said Sandy Jen, co-founder and VP of engineering at Meebo, whose instant messaging service has more than 35 million users. “It’s all about expectations. The more honest you are, the more forgiving your customers are going to be.”
“When you provide a compelling service to your user base, you become an essential part of a user’s life,” said Raj Patel, Vice President, Network Systems of Yahoo. “You have to develop trust. There’s really no other way.”
That trust gets tested when a site goes down. The basic framework of provider downtime messaging usually looks a lot like this:
- 1. Sorry. We’re offline at the moment.
- 2. We know this site/service is important to you.
- 3. We have our best people working really hard on fixing the problem.
- 4. We’ll keep you informed as best we can.
- 5. Once we’re back online, we’ll sort out what went wrong and how to prevent it from happening again.
The past year has had its share of outages and downtime, and some data center providers have responded to customer concerns with unusually detailed information for customers in the wake of incidents. The days of brief outage advisories on customer-only support forums seem to be numbered.
Paying attention to customer expectations is critical. But caring about your customers doesn’t mean you’ll satisfy them. And once they lose trust, things can get ugly. During the recent downtime for The Planet, wild rumors circulated advancing alternate explanations for the extended downtime, and were picked up by at least one well-read blog. This prompted threads on The Planet’s support forum in which customers sought photographic evidence of the damage to the data center.
Sound crazy? Not if you ask the folks at HostDime/Surpass Hosting, which after a May outage encountered customer skepticism about whether it really had the backup infrastructure it described. The company posted a video of their data center to address the crticism.
You can’t satisfy everyone. But being straight with your customers/users and acknowledging their pain is better than heavy spin, according to marketing expert Seth Godin, whose thoughts on lessons learned from a lengthy 2006 outage at DreamHost are worth repeating.
“Lesson one: when things get messed up, being clear, self-critical and apologetic is really the only way to deal with customers if you expect them to give you another chance,” Seth writes. “Lesson two: your story is all you’ve got. If you sell the ‘up-time’ story, better over-invest in whatever it takes to be sure your story is true.”
Having had to explain unplanned outages to a hundred thousand or so customers a bit more than I’d have preferred, I agree that straight talk when it comes to problems, performance or outages is the only path to take. I’m surprised though, at the number of executives/managers/directors/PR people who would rather ignore or deny the obvious.
Your customers already know that you’ve had a service affecting outage. You might as well own up to it.
There is no doubt that the downfall of some online businesses will be poor handling of unplanned downtime. In my opinion, the more information provided about what went wrong and what is going to be done to solve the problem, the better.
Of course the reason many of these businesses prefer denial to disclosure is that they never actually get to the root cause of re-ocurring brownouts and outages in the first place. How can you confidently address your users if you don’t know what happened and how to avoid it in the future?
When you are managing performance and availability of your online service using static threshold-based monitoring and tribal knowledge, getting to root cause is a massive, labor intensive effort. You sift through thousands of alerts from each IT silo trying to find out which are relevant and which are not. You do human correlation based on experience. If you don’t find the problem quickly, you end up re-booting servers and moving on to the next problem before a root cause is determined. And, of course, the problem keeps re-occuring and you keep re-booting…. With limited resources, post-mortem problem analysis is an afterthought and it seems better to say nothing rather than admitting to your user community that you have no idea what caused the outage and have no plans for how you will eliminate re-occurrences.
These organizations need to be looking at real-time analytics-based solutions if they want to have a prayer of getting ahead of these problems and fixing them for good. These solutions eliminate the need to set static monitoring threshold by learning the normal behavior of the entire infrastructure supporting the online service. They also do automated correlation of abnormal behaviors to pinpoint root cause of performance degradations and outages in real time – something that is not humanly possible when you are dealing with thousands of devices and hundreds of thousands of metrics. These solutions also proactively alert to problem behaviors before they occur so that brownouts and outages can actually be prevented. IT Operations is then out of its traditionally reactive mode.
In my opinion, until online businesses adopt these types of solutions we’ll be hearing a lot of silence when unplanned downtime occurs…