Los Angeles web hosting company Media Temple will issue $100,000 in service credits to 3,000 customers affected by a major outage of its (gs) Grid-Service hosting service, the company said yesterday. The downtime on (gs) Cluster.2 was caused by multiple failures in a storage system that required Media Temple to check the integrity of 2.9 terabytes of customer data, totaling more than 80 million files. That led to some customers being offline for as long as 38 hours, the company said in an incident report.
The problems began on 12:30 pm Pacific time on Saturday, when one of the company's BlueArc storage file-systems became corrupted. "Moments later one of the redundant BlueArc controller heads crashed completely," Media Temple reports. "Within minutes of that event, the second controller head crashed resulting in complete unavailability of the storage system. These events made Cluster.2 completely unavailable. After approximately 2 hours, storage engineers were able to gain access to the corrupted file-system which allowed recovery efforts to begin."
As the outage persisted during the lengthy file check, customers expressed their growing unhappiness on blogs and Twitter. Breannan Novak started a new blog for Media Temple Customers to document the outage and responses, through direct reports and customer reaction on other blogs, including an Open Letter to Media Temple from customer Steve Reynolds. The outage was also widely discussed on Twitter, where BlueArc and MediaTemple each have accounts - although Media Temple's light usage of its Twitter account became an issue for some unhappy customers. Media Temple later communicated directly to Novak and Reynolds to address their concerns about the company's response.
The company cited the difficulty of managing both the recovery process and customer communications. "One very important point here is during an outage of this magnitude, our entire admin staff is dedicated 100% to working on the Grid, and at times, can not simply deliver the information we need to disclose to the public," MT community relations director Jason McVearry wrote to Reynolds. "This is a constant struggle for many service providers and again, we’re working on improving this constantly."
"We are deeply apologetic for the impact and duration of this System Incident," Media Temple wrote in its incident report, noting that it has developed a new storage architecture "which dramatically improves uptime for GRID customers. This technology increases stability overall and reduces the need for these lengthy file-system checks - the cause of this weekend’s extended unavailability."
Media Temple previously promised a new internally-developed storage system for Grid-Service following an outage in December 2007.