In the aftermath of major data center outages, there are usually lessons to be learned from the incident, which can help identify refinements going forward. Those lessons can be useful to the community as well as the staff at the affected facility. As The Planet continues its recovery from Saturday’s electrical explosion and fire, the staff at the Internet Storm Center (ISC) have identified some possible lessons from the outage for business continuity and disaster recovery:
- Communications is critical, but even the best laid plans can go awry from a Slashdotting. Center Networks, RepuMetrix and ISC handler Swa Frantzen were impressed with the regular updates The Planet’s staff was sharing on its customer forum. But the forum was soon bogged down by heavy traffic, which only got worse when the outage was posted on Slashdot. “I think this is about the worst moment to get on Slashdot you can imagine,” the ISC notes. “But it’s a likely result of the incident that those things you still have will attract more visitors than ever before.
Again something to plan for …Make sure to have your emergency communication as solid as you can, as static as possible, and as lightweight on the server(s) as you can imagine. The last thing you want to do during an emergency is to have to survive a DDoS from curious people.”
- Consider DNS Redundancy and Diversity: The ISC notes that both The Planet and some of its customers are rethinking DNS management to address some of the “what if” scenarios raised by this event. The Planet has six data centers, but although 9,000 servers were housed in the facility that went offline, the power outage appears to also have affected sites in functional data centers whose DNS servers were hosted in Houston. Also, customers who registered their domains through The Planet may have found themselves without access to their control panel, making it impossible to repoint the DNS at a backup version of their sites.
- Think Like A Fire Marshal: CEO Doug Erwin noted that The Planet was “not allowed to activate our backup generator plan based on instructions from the fire department.” This issue also came up in last November’s outage at Rackspace, as well as earlier incidents at Alabanza in Baltimore in 2004 and 2003, when public safety officials shut off power to the building. “I had seen plans for BCP/DRP derail before due to officials stepping in and doing their response to an emergency in their way and not in the way the organization itself had planned it,” Frantzen wrote on the ISC blog. “I think it would be interesting for most of us to actually talk to fire departments and/or police officers on what their normal responses are and take them into account in our plans.”
The staff at The Planet have their hands full, and presumably have not had time to develop or share a root cause analysis. But it’s good to see the incident has prompted some productive discussion as well as complaints.