Lessons Learned from The Planet’s Outage

5 comments

In the aftermath of major data center outages, there are usually lessons to be learned from the incident, which can help identify refinements going forward. Those lessons can be useful to the community as well as the staff at the affected facility. As The Planet continues its recovery from Saturday’s electrical explosion and fire, the staff at the Internet Storm Center (ISC) have identified some possible lessons from the outage for business continuity and disaster recovery:

  • Communications is critical, but even the best laid plans can go awry from a Slashdotting. Center Networks, RepuMetrix and ISC handler Swa Frantzen were impressed with the regular updates The Planet’s staff was sharing on its customer forum. But the forum was soon bogged down by heavy traffic, which only got worse when the outage was posted on Slashdot. “I think this is about the worst moment to get on Slashdot you can imagine,” the ISC notes. “But it’s a likely result of the incident that those things you still have will attract more visitors than ever before.
    Again something to plan for …Make sure to have your emergency communication as solid as you can, as static as possible, and as lightweight on the server(s) as you can imagine. The last thing you want to do during an emergency is to have to survive a DDoS from curious people.”

  • Consider DNS Redundancy and Diversity: The ISC notes that both The Planet and some of its customers are rethinking DNS management to address some of the “what if” scenarios raised by this event. The Planet has six data centers, but although 9,000 servers were housed in the facility that went offline, the power outage appears to also have affected sites in functional data centers whose DNS servers were hosted in Houston. Also, customers who registered their domains through The Planet may have found themselves without access to their control panel, making it impossible to repoint the DNS at a backup version of their sites.
  • Think Like A Fire Marshal: CEO Doug Erwin noted that The Planet was “not allowed to activate our backup generator plan based on instructions from the fire department.” This issue also came up in last November’s outage at Rackspace, as well as earlier incidents at Alabanza in Baltimore in 2004 and 2003, when public safety officials shut off power to the building. “I had seen plans for BCP/DRP derail before due to officials stepping in and doing their response to an emergency in their way and not in the way the organization itself had planned it,” Frantzen wrote on the ISC blog. “I think it would be interesting for most of us to actually talk to fire departments and/or police officers on what their normal responses are and take them into account in our plans.”

The staff at The Planet have their hands full, and presumably have not had time to develop or share a root cause analysis. But it’s good to see the incident has prompted some productive discussion as well as complaints.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

5 Comments

  1. Bullie Pups R Us

    All I know is that I need a way to get my website up and running as a small business this timeframe is crucial to our overall family survival. Each day is a huge loss to our financial stability. Bullie Pups R Us

  2. Tracy

    I'm with you on that. This outage is seriously hurting my business, too. What ever happened to redundancy and disaster recovery plans?

  3. Robert

    Not only does the outage continue to hurt our small family business, but the mail server associated with out web hosting being down is known to have caused several problems and until I get email back I will not know how bad the problem really is.

  4. Rusty

    I should have just run everything on my DSL line at home, it would have proven to have more uptime than the planet. :-( The worst part is that they're not communicating useful info. For the last 48 hours, they've been telling me that it will be fixed in 4 hours. We're still down. Their communications are terrible. It's always "we'll have it fixed in 4 hours". We've heard that 15+ times already.

  5. Bullie Pups R Us

    We left and went with another more reliable company after this. How mnay times can a data center havea massive fire before you realize there isa problem? This is the second fire located at the same center. HELLO!