Some grid hosting customers of Media Temple have been offline for more than a day due to a recurrence of problems with its storage systems that caused a 38-hour outage earlier this year. Sites hosted on Cluster.02 of Media Temple’s Grid-Service platform went down Monday afternoon at about 4:15 p.m, which the Los Angeles hosting company attributed to file corruption issues in a storage system.
The problems are only affecting customers of its grid hosting service, and not those using MT for shared hosting. On Tuesday afternoon, Media Temple said 1,800 customers remained offline, and the company apparently is having differences with storage vendor about the best restoration options and the prospect of additional downtime.
Media Temple said the problems were similar to those that crashed Cluster.02 for nearly two days in early March. At the time, Media Temple blamed the problems on its legacy “first generation” storage system from Bluearc and announced plans to transition its grid hosting service to a new “second generation” system. Two months later, the migration is still not complete.
“This System Incident is taking place on a cluster using our 1st-generation storage architecture,” Media Temple said on its status page. “While we have made significant progress transitioning this cluster to gen 2 storage … there are still components that rely heavily on the gen 1 technology.”
Media Temple initially promised a new internally-developed storage system for Grid-Service following an outage in December 2007.
Media Temple said that some customer sites were returned to service early Tuesday morning, but others remained offline pending a lengthy file check of a storage subsystem. As of 11 a.m. Pacific time Wednesday, Media Temple said the process was nearly complete, but still had no firm ETA for when service would be restored.
UPDATE: As of 4 p.m. Pacific on Tuesday, Media Temple says 1,800 sites remain offline. “Our storage vendor has recommended a full file system check on Segment.01, amounting to an additional estimated 48 hours of downtime for those 1800 users,” MT writes in its latest update. “This is not acceptable to us. Our engineers are already in the process of restoring all Segment.01 sites from our backup servers. … he vendor-recommended repair action will be completed in tandem to prevent data loss. When it is complete we will then merge the repaired data to bring all data current.”