Disaster Recovery Scenario: Stick to the Plan

2 comments

Richard Dolewski is a certified systems integration specialist and disaster recovery planner and Chief Technology Officer and Vice President of Business Continuity Services for WTS. His recent book, System i – Disaster Recovery Planning, is available at amazon.com.

RICHARD DOLEWSKI
WTS

Imagine this scenario. Your servers are down. The computer room is dark. A major disaster has occurred, you don’t know the details, but you need to determine your next move. What task should you do first? What are your priorities? Should you start recovery of your servers, and if so, in what order? If you ask the business experts, they’ll tell you everything is a business priority, but you have to make some critical decisions. Advice: lock the doors before the rush of self-proclaimed experts comes through your door and starts telling you what has to be done.

Will you simply listen to the person who screams the loudest and get his server back up and running first? If not, then what IS your top priority? The computer systems may or may not be recoverable in the short term, and perhaps not in the longer term either. You’ll need to take a deep breath and remember that this is what you’ve been documenting and practicing for all these years. But, even if you have a disaster recovery plan, does it include prioritization of server recovery in a disaster?

Managing Mission Critical Servers for Business Continuity

A lot of work goes into managing the on-going requirements for mission critical servers. When you have downtime, for any reason, data is unavailable to your customers, and this generally means that business — yours and your customers’ — abruptly stops. When business stops, it gets very expensive in a hurry. That is why critical server requirements need to be reviewed twice a year, to ensure that effective server processes are carried out to support the needs of your business and to ensure that the identified servers are still in alignment with business priorities and goals. The list below includes elements that should be reviewed on a regular basis to support critical server definition requirements.

  • Business impact analysis and risk assessment
  • Strategy for server recovery
  • Change in prioritization based on different business cycles
  • Application dependencies and interdependencies
  • Application downtime considerations for planned and unplanned outages
  • Backup procedures
  • Offsite storage for vital records
  • Data retention policies
  • Recovery time objectives (RTO)
  • Recovery point objectives (RPO )
  • Hardware for critical server recovery
  • Alternate recovery site selection
  • IT and business management signoff

Classifying Systems for Disaster Recovery Priority

Your computer room is likely filled with rows and rows of servers. Numerous hardware platforms are powered on and ready to serve some business purpose. Your servers most likely span several hardware generations. You should have a planned roadmap and prioritized recovery plan for your complete critical server infrastructure. You’ll need to understand the supporting business needs of all servers in advance of any disaster ever occurring. Don’t wait for that phone call in the middle of the night to decide your server recovery strategy. Not all servers in your computer room are of equal importance to your business. That is why you need to consider the difference between:

  • what you need
  • what you want to have
  • what you don’t need at all to run your business in a disaster.

The backup recovery team should assign priorities to the servers as they relate to your business support priorities. There will be a mix of opinions, of course, but a good Business Impact Analysis will reveal which of those opinions carry the most weight. You should categorize the business requirements and supporting servers as Critical, Essential, Necessary, or Optional, as follows:

  • Critical Systems – These servers must be in place for any business process to continue at all. These systems have a significant financial impact on the viability of your organization. Extended loss of these servers will cause a long term disruption to the business, and potentially cause legal and financial ramifications. These should be on the A-List of your disaster recovery strategy.
  • Essential Systems - These servers must be in place to support day-to-day operations and are typically integrated with Critical Systems. These systems play an important role in delivering your business solution. These should also be on the A-List recovery strategy.
  • Necessary Systems – These servers contribute to improved business operations and provide improved productivity for employees. However, they are not mandatory at a time of disaster. These might include business forecasting tools, reporting, or maybe improvement tools utilized by the business. In other words, minimal business or financial impact. The targeted systems can be easily restored as part of the B-List recovery strategy.
  • Optional Systems – These servers may or may not enhance the productivity of your organization. Optional systems may include test systems, archived or historical data, company Intranet and non-essential complementary products. These servers can be excluded from your recovery strategy.

The above server classifications will provide you with the baseline for your decision-making matrix. The most important thing is that your IT recovery team and your business management team must agree with the disaster recovery planning scope for classifications of the servers. By differentiating between critical, essential, necessary and optional, the reduction in the number of servers required to support the disaster recovery plan not only helps increase backup and recovery efficiency for the servers, but it also helps reduce your financial budget for disaster recovery.

The Big Picture
When compiling the list of mission critical applications, you must also consider application interdependencies. First, many software solutions are considered modular in design but the software must be 100 percent intact – in other words, fully restored to function correctly. You cannot break the applications apart from the supporting infrastructure of the server. You may choose not to use specific business functions, but the entire solution must be rebuilt 100 percent to function normally.

Second, consider the flow of information. Follow the flow of a transaction from initial order through to product delivery. You may find that a server not considered critical by the Business Impact Analysis does indeed have a significant role in feeding information back to another mission critical application. Therefore, IT input is needed in addition to the defined business needs. The restoration process for most servers is generally recovered in its entirety which includes every user library saved on the system. The question is, are you restoring too much? Omitting non-critical libraries can save hours, which translates to the business coming online more quickly in a disaster. The libraries and user directories that could be omitted include:

  • Performances data
  • Audit journals
  • Test libraries
  • ERP walk-through libraries
  • Online education
  • Developer libraries
  • User test environments
  • Data archives
  • EDI successful transmission objects
  • Trial software
  • Temporary product work directories
  • Auxiliary Storage Pools (ASPs)
  • Independent Auxiliary Storage Pools (IASP )

Required Hardware for Your Disaster Recovery Plan
In the development of every disaster recovery plan, you must determine the minimum hardware requirements for your mission critical servers. Some IT professionals will say that any equipment is better than none in a disaster, but this statement, while true, should not be accepted at face value. The reality is, only mission-critical applications absolutely need to be restored in a disaster, not everything. However, you will need to ask whether your business will accept running the “Mission Critical ” business functions at say 50 percent less capacity or throughput. In most cases, the answer will be no – totally unacceptable.

In the Business Impact Analysis you identified the financial impacts for your organization of being down for an extended period of time. Running your business at half speed will only further cripple your long term business capabilities and will not ensure customer satisfaction. Reduce the disaster recovery footprint by eliminating non-essential applications rather than providing lower processing capabilities. Invest your disaster recovery budget wisely by supporting your business requirements in a disaster, and that means getting the right hardware. The last thing you want is your sales order desk telling customers to be patient because you can only process half the orders right now due to a disaster.

The Human Element
What if you declared a disaster and your staff did not show? Your servers can’t recover themselves. Many companies have plans that address their equipment requirements and recovery processes but often underestimate the amount of staff required to successfully execute their plan. Equipment only works if somebody is able to operate it. In Gulf Coast hurricanes, it has happened that key personnel have been displaced or unavailable due to health risks or personal priorities. When regional disasters hit, transportation within the area can be difficult and may result in your staff being unable to reach their assigned locations. Equipment may be accessible, but it will be ineffective if your staff cannot access the recovery site.

What is the level of expertise your employees possess when they finally do reach the recovery site? Too many companies, especially those that perform recovery tests with no more than their data center staff, often count on IT heroics to pull them out of a crisis. Expecting IT to perform a miracle in an outage is difficult for your staff and avoidable today when full recovery tests can be performed without impacting your production users. When your disaster recovery plan includes cross-departmental staffing, it is important to have detailed and precise documentation. Companies should create recovery documentation so that anyone in the business, from the shipping manager to the CFO, can start a recovery.

In a well tested plan, an employee from another department should be able to start the recovery in the event employees from your IT staff are not available. You may never know if all your key personnel will be able to assist with the recovery. After identifying your critical equipment, it is a good idea to test your disaster recovery plan with a subgroup of assigned individuals while leaving the remainder of the team to run normal business operations. The success or failure will be a good indicator of your corporate readiness.

A Good Plan Assists Recovery

When the servers are down, your disaster recovery plan will determine the precise server recovery strategy and recovery priorities. So, lock the doors to keep the stampeding herd of users away, and start recovering the business as stated in the plan. Step through the tasks and follow the precise order of server recovery by predetermined importance criteria versus listening to who screams the loudest.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)

2 Comments

  1. Gregg Jacobsen, CBCP

    We do BIA's to asses the impacts of business operations interruption so we can establish recovery requirements and, to some extent, priorities. But until an actual event occurs, no one can know what the actual impacts are. Damage assessment views everything in terms of impacts to business operations but with an "all things being equal" perspective. At time of disaster, equalities fall away and the "real" priorities become more clear. BIA data is not just for setting RTO's and RPO's for developing availability and storage requirements. Good BIA data necessarily establishes which business operations have peak demands and when they occur. Peaks may be calendar driven, like payroll, e.g. runs every other Thursday night. Other peaks may be event-driven, like a new product roll-out or a transition cutover from one ERP system to another. So, the DRP plan needs to include a "recovery action planning" process in which business operations leadership - CEO, COO, CFO, and others - quickly meet and assess whether any particular operations are more critically exposed than normal. If not, the established DR plan priorities can be followed. But if some specific impact to ongoing operations threatens a far more dire downside risk, a recovery action plan should be developed to move supporting systems' recoveries to the front of the line. And no, this does not mean to ignore the basic infrastructure recoveries that must come first. One last point: any application/system that can put the enterprise at risk of serious loss of revenue stream, market share, stock price, or other such measures of impact, should already have hot failover in place, a "hands off" solution that bypasses the declaration process. Such recoveries need only be described or referenced in the DR plan, since they are self-executing.