As the outage affecting Joyent's online storage systems continues into its fourth day, the company has issued a more complete explanation of the issues.
A customer advisory yesterday by CEO David Young blamed a "massive bug" in the ZFS file system in OpenSolaris. It turns out that the issue was a known bug that has been identified and fixed since Solaris Nevada Build 60, which was published in Feb. 2007. A post at the OpenSolaris forum indicates that Joyent was running build 43, which was published in June 2006. Young addresses this in a customer update today:
It has been pointed out elsewhere that we were running an older version of the OpenSolaris operating system on this X4500. That is true. However, since this particular X4500 also housed two services (rather than just backups), we had been waiting to upgrade the X4500 in anticipation of some software updates that were/are in the pipeline for OpenSolaris itself. ... Unfortunately, OpenSolaris does not currently provide a straightforward upgrade process from build-to-build. If all the stars aligned, an upgrade takes about six hours. Realistically, we estimated we would have needed to schedule a multiday downtime given the historical uncertainties around importing zpools from older version of ZFS into newer versions of ZFS.
Managing updates and system stability on live services is a complex task. But given that the bug was nearly a year old, some commenters at TechCrunch felt ZFS was being unfairly blamed for the outage.
In an interesting collision of downtime sagas, it turns out that one of Joyent's marquee clients for much of 2007 was Twitter, the fast-growing microblogging service that had already logged more than six days of downtime this year before yesterday's crash during Steve Jobs' MacWorld keynote. Twitter said last month that it was switching data centers and it's not clear whether it continue to use Joyent's Accelerators.