Web Outages and Damage Control 2.0
Web site outages have never been more public. When a site or service goes down, companies are using blogs to rapidly update their users about what happened and why. The inevitable finger-pointing takes place in real time, and within a matter of minutes, a server failure can generate a headline on TechCrunch or ValleyWag. This creates a challenge for web hosts and data center providers whose business is built upon a reputation for reliability. In this fast-moving environment, how do you balance the need to be accountable to customers, and also work to mitigate headline risk?
The most recent example involves Rackspace, which has earned a reputation for uptime and “fanatical” customer support but is taking heat from two customers in the Web 2.0 space, 37Signals and Tumblr, who are blaming the San Antonio company for recent service outages.
37signals, which offers a suite of Web-based software for businesses, was offline for two hours on Friday (Jan. 18) in an outage that was widely noted around the blogosphere. 37signals blamed Rackspace for its two-hour outage: “All the 37signals properties were offline as our load balancer blew out and knocked out the network connection for all our servers. … Naturally, we’re going to have a long, serious talk with our service provider (Rackspace). They’re supposed to be the best in the business, but in this instance they failed us, so we in turn failed you.”
The free microblogging service Tumblr was offline for several hours Monday, and founder David Karp didn’t pull any punches in his explanation to customers, saying a firewall upgrade “was botched” by Rackspace. “Their accessibility and performance during this experience was completely unacceptable for the level of service we demand for our users,” Karp wrote. “In light of these events, we’ve raised our requirements and are preparing a move to a new hosting provider.”
Pretty tough stuff. Rackspace, which has historically been among the most reliable providers in Netcraft’s monthly survey of hosting company uptime, has had its defenders in the post-outage conversations. Both incidents involved equipment specific to a single customer, rather than the entire data center.
The past year has had its share of outages and downtime, and some data center providers have responded to customer concerns with unusually detailed information for customers in the wake of incidents. The days of brief outage advisories on customer-only support forums seem to be numbered.
Sometimes the details of an incident focus attention on the customer, as well as the host. 37signals came under critique for its web architecture, which relied upon a single load balancing server. Rackspace offers a guarantee that it will replace any piece of hardware that fails in one hour. It took Rackspace technicians two hours to get 37signals’ load balancer back online. But even if Rackspace had delivered on its guarantee, 37signals would already have been offline for an hour. A fair assignment of blame might attribute the first 60 minutes of the outage to 37signals’ system design, and the second hour to Rackspace’s delay in meeting its hardware replacement guarantee.
Customer anger over an outage is understandable, and is to be expected. But there are also signs that the “shoot the provider” mentality doesn’t play well everywhere, either. Harper Reed, a software architect working with social networks, reflected on the way 37signals and Tumblr responded to their outages:
I don’t like how these companies are blaming their vendors when their sites are down. I understand how frustrating it is. I totally realize how you want a scapegoat … It seems to me that these companies are basically saying “this is the entity who is at fault. It isn’t us. In fact – we are so smart that we don’t need them. we are changing hosting.” But they should be saying “We were down. bummer. we are up now.”
What’s clear is that the awareness of outages is amplified in the Web 2.0 world, and management of post-outage communications needs to be part of every provider’s contingency planning.
It is certainly clear that outages such as those descibed can have damaging affect on both the companies providing services via the web and their hosting providers. It is also clear that bandwidth heavy Web 2.0 applications, as well as other new initiatives such as virtualization, have created a new level of IT complexity that current management approaches cannot scale to.
It staff need a new approach that can allow them to take a more proactive stance, solving problems before costly downtime affects end users and the reputation of the company. There are now solutions available that use real time problem modeling to proactively notify IT staff of pending issues before they occur. With solutions like this in place, companies can protect invaluable brand reputation, revenues, and other elements that are vital to business success.