Sun’s ‘Data Center Meltdown’ Prediction
December 11th, 2007 By: Rich Miller
There’s been a growing buzz in the IT blogosphere about a recent statement by Sun Microsystems’ Subodh Bapat, who said the Internet would see “a massive failure within a year.” Bapat, Sun’s eco-computing vice president, said there would be a data center failure that would cause disruptions on the scale of the Morris Worm from 1988.
Bapat’s comments were first noted on News.com, and have since been picked up by Nicholas Carr, Michael Krigsman, Deal Architect and Ashlee Vance at The Register, which had great fun with a headline about “voracious, collapsing data centers.” Vinnie Mirchandani said the importance of Bapat’s warning was missed or underplayed by the blogosphere and media.
Data center failures are newsworthy. That’s why we write about them here at Data Center Knowledge. But what’s new here? The only thing unusual about Bapat’s prediction is the scope of the failure he’s anticipating. Frankly, the Morris Worm doesn’t strike fear into many hearts because it happened 20 years ago and didn’t affect Main Street. Do you know anyone outside IT who remembers what they were doing when the Morris Worm struck?
To anticipate data center failures in 2008, one need only look at what has happened in 2007. Let’s review the last six months:
- July 23: A generator failure at 365 Main knocks its San Francisco data center offline for hours, disrupting many of the web’s most popular destinations.
- July 31: A data center migration at ValueWeb goes badly, leaving thousands of sites offline, some for as long as three days. About 500 servers experience hardware failures while being moved from Affinity Internet’s Miami, Fla. hosting facility to a Hostway data center in Tampa.
- Early November: Up to 175,000 sites at Alabanza were offline for three to five days as they were migrated to a data center at NaviSite (NAVI), which acquired Alabanza in August. NaviSite was moving customer accounts from Alabanza’s main data center in Baltimore to a NaviSite facility in Andover, Mass.
- Nov. 13: A Rackspace data center in Dallas loses power after a traffic accident takes out a utility transformer, knocking popular sites offline.
- Nov. 26: On Cyber Monday, the payment processing system at Yahoo Small Business crashes, leaving web merchants unable to process orders on the most heavily-hyped day for online shopping.
In this context, the notion that a major failure will occur next year isn’t exactly a stretch. I’ve also learned to be cautious about apocalyptic predictions from vendors. In the late 1990s I covered technology for a daily newspaper and wrote about the Y2K bug. Later at Netcraft I blogged about Internet security, and found that vendor predictions of imminent doom were a monthly event.
Failure happens, as noted by EYP’s Richard Sawyer in a recent presentation at Data Center World, in which he urged a roomful of data center professionals to engineer their facilities to “fail small.” The best way to fail small, he said, is to acknowledge that failure happens and design around it. In an industry that exists to keep data centers online all the time, Sawyer says being honest about the likelihood of failure is critical to developing effective mitigation strategies.
Yes, there will be data center failures in 2008. Whether any of them resemble the “meltdown” foreseen by Bapat depends upon the fences currently being built around the foreseeable risks and small failures that work together to create large failures.
Although Bapat’s comments about the future state of the Internet were extreme, they also bring up a good point about the state of the data center. As companies keep adopting new technologies to improve operations (virtualization, SOAs, etc.) they are increasing the complexity of their systems and the management challenge for their IT Ops teams.
Unfortunately, software vendors have adopted a fundamentally reactive posture, focusing on fixing problems after the fact (“mean time to resolution” is their preferred metric).
A new generation of systems management vendors are emerging that apply real time analytics to performance monitoring data in order to correlate alarm and metric behavior and identify a building pattern of problem precursors BEFORE they impact a mission critical application or business service.
The challenge is to process the metrics on a massive scale (my company’s product “Integrien Alive” processes 4 million metrics every 5 minutes), but the benefits of automating this correlation and providing proactive, preventive insights to the Ops team is tremendous.