Richard Dolewski is a certified systems integration specialist and disaster recovery planner and Chief Technology Officer and Vice President of Business Continuity Services for WTS. His recent book, System i – Disaster Recovery Planning, is available at amazon.com.
When it comes to backup and recovery, only regular testing will ensure success.
Business success is directly tied to the availability of our computer systems, and the common goal of companies today is to accept no tolerance for downtime. Business disasters can happen at any time, anywhere. They don’t have to be at the magnitude of a major disaster, like a flood or earthquake, to cause serious damage to a business — any disaster can negatively affect the bottom-line profits of any organization.
To minimize your risk, the first thing you need is a solid disaster recovery plan (DRP). A DRP is a set of processes developed for your company outlining the actions for your IT staff to take to quickly resume operations in the event of a major service interruption or outage. By establishing a firm list of activities to be followed, your organization can minimize potential losses incurred by downtime. But once you’ve developed and implemented a DRP, your job is just beginning; you must test it — not just once, but regularly — to reflect the changing dynamics of your computing environment.
It Can Happen to You
For years, I have had the opportunity to study disasters firsthand. The one similarity that all those events shared was that none of the victims believed a disaster would happen to them. As a disaster recovery planner, I sometimes see organizations take all of the necessary steps to anticipate and plan for disaster, then place the plan on the shelf or the network drive and forget about it. Developing a plan does not guarantee success because planning itself is not the total solution — you must exercise your DRP to ensure its readiness and integrity and to train your staff.
Disaster recovery testing is an essential part of developing an effective disaster recovery strategy for iSeries or other systems environments. In most cases, however, these tests are not truly predictive of the actual response needed in a real disaster scenario. An actual disaster can reveal where a company’s disaster recovery plan may fall short and can assist in finding ways to better prepare for possible future events.
To be truly ready for disaster, you need to experience simulated disasters and evaluate the effectiveness of your current procedures in meeting the disaster challenge. Disaster testing is more than just running through the motions — it requires postmortem analysis of every test to identify where the plan has failed. The failure might not be due to a bad plan; it could be the result of changing business conditions or the performance of an outside organization, such as a backup data center. Testing is thus a cycle of exercise, evaluation, and remediation. Learn what’s important in a disaster recovery test and how to make such tests effective, and you’ll be well positioned to ensure your organization’s survival in a calamity.
Practice Just Like the Pros
How many professional sports teams do you see taking the field without any preparation? Even the most talented teams do not assume “we have the skills, so practicing would be a complete waste of time.”
Testing has several objectives:
- Ensure the accuracy, completeness, and validity of recovery procedures
- Verify the capabilities of the personnel executing the recovery procedures
- Validate the information stored in the disaster recovery plan
- Verify that the time estimates for recovery are realistic
- Ensure that all changes in the computing environment are reflected in the disaster recovery plan
- Familiarize IT personnel with the disaster recovery plan and its procedures
- Verify that outside agencies, such as backup data centers, perform adequately
- Discover business conditions that require changes to the plan
Different parts of the team need to practice working together to improve performance, determine what works, and most importantly, plan for the unexpected. In the event of a systems failure, your DRP requires flawless execution and teamwork. Your disaster recovery team needs to practice — in IT terms, it needs to test. You may have the best-written plan money can buy, and the best personnel, but the whole reason you put the plan in place is to prepare for the unexpected. You cannot assume that in the event of a disaster that everything will run smoothly. Your staff needs to know in advance what actions to take and how to execute them.
As a continuous process, DRP testing ranges from simple reviews of the test plan document to detailed exercises of your company’s ability to restore your computing environments quickly, either locally or at alternate facilities. It’s not sufficient to conduct only a single technical test once a year. You should incorporate a variety of tests designed to exercise all components of the plan staggered throughout the year.
Furthermore, you should incorporate the element of surprise into some of these tests; disasters vary in the amount of warning they give you before they actually occur. Some disaster scenarios, such as those that might occur during a data center move, offer a substantial amount of warning. Others, such as power outages or employee sabotage, can occur with no warning at all. Blizzards and hurricanes offer some warning but are completely unpredictable. All disasters offer some element of surprise; therefore, your recovery testing should do the same.
I recommend two categories of testing — active and passive — to ensure the accuracy of the plan.
Active testing requires that the procedures under review be executed exactly as written. You should test the procedure for declaring a disaster with your hot-site vendor, test the ability of your off-site tape storage provider to deliver to the hot site in a timely manner, and test your method for restoring your systems. Each step must be executed completely and the data tested thoroughly by end-user departments to validate recovery. A wide variety of active tests should be performed, including:
- a full technical test of restoration of production application systems on the iSeries and other mission critical hardware
- a technical test of LAN and WAN, including any existing WAN failover mechanisms
- a test of the high availability solution that switches your users to the alternate facility, then checks the validity of the data
A technical test demonstrates your ability to move processing into the recovery facility within the required time. Planning for the test should proceed as follows:
1. At least 60 days in advance, schedule the test with your hot-site provider. Notify plan participants of your selected date and time.
2. Meet with your IT recovery team to establish test objectives 30 days before the test date. This will determine the participants’ requirements for the test and let you develop a suitable test schedule.
3. One week before the test, publish the test plan to participants and confirm your test date.
4. Initiate the transfer of tapes from the off-site tape storage office to the recovery services facility.
The role of the plan manager during a technical test is to
- ensure that each objective is fully realized
- ensure that each test participant follows the procedures from the DRP as precisely as possible
- document changes necessary to make the DRP procedures work
- record problems and their resolutions as they arise
- record the duration of each procedure
- summarize all the changes to the DRP
These exercises will help change senior management’s perception and maybe yours. Many times, testing will reveal nontechnical issues. We in the IT industry are generally technically sound in our work, but the “procedural stuff” will bite us. A common problem I find is that management is unable to declare a disaster properly because they are unfamiliar with crucial procedures. Testing creates a safe “make believe” situation that is free of embarrassment. Everyone can demonstrate their abilities and understand the relative importance of these procedures without suffering damage or great costs.
Making a management commitment to regularly testing, validating, and refreshing your DRP can protect your company against the greatest risk of all —complacency. Today’s computing environments face rapid business and technological changes; the smallest alteration to a critical application or system can cause an unanticipated failure that you might not be able to recover from if you do not test.
Passive testing does not exercise the procedures or actions of the plan. It is a walk-through of the procedures, typically with the members of the IT recovery team jointly reading and reviewing the procedures, literally page by page. A dry run of a procedure will verify the completeness of the steps in the procedure.
Twice a year, the disaster recovery owner should outline the testing objectives and develop the test plan for computer systems recovery. This test should include a combination of active and passive tests of the DRP. The test plan shows the planned tests with their timing, duration, staff resource requirements, and explanatory comments.
As a general guideline, IT should conduct passive tests as follows:
- At least once a year, the IT technical recovery teams will conduct an incident-based walk-through of the DRP. This test will verify that the plan is consistent with team members’ expectations and that it can work regardless of the type of disaster.
- Twice a year, the technical recovery teams will need to test their ability to recover at a hot site or redundant data center site with an active technical test. In the second test, they should test the recovery of the network infrastructure.
As the disaster recovery owner, you are the chairperson for a controlled walk-through of DRP procedures, and one of your responsibilities is developing a realistic scenario that will be used during the exercise.
You must both communicate this scenario to the team and formally document it. Typically, you achieve this via a series of handouts for the participants. The first handout describes the disaster, its timing, and its impact. As the walk-through progresses, additional handouts describe how the events of the disaster have progressed. For example, in a fire scenario, the first handout would be quite vague on the extent of the damage. The second handout would offer more information outlining the extent of the damage, and the third handout could introduce a complicating factor; for example, the fire department might have concerns for the safety of the physical structure of the building and prohibit you from accessing the computer room for an extended period.
You conduct passive testing with all IT recovery team participants, but you may wish to invite additional primary participants, depending on the scenario and the components of the plan to be tested. Participants should bring their current copy of the DRP with them. Each participant is assigned a specific role (or set of roles) to play in the disaster scenario — usually, the one he or she would play in a real disaster. Sometimes, you may wish to have participants switch roles for cross training purposes. Remember to test everything — even the obvious: Have staff members in the DRP validate their contact numbers; call vendors after normal business hours to ensure that their hotline and service numbers are correct and manned; and execute the notification, escalation, and assembly tasks on a non-business day.
Rules for the walk-through are simple:
1. Using only the DRP and the formal scenario descriptions, decide which tasks to execute.
2. Have the team leader and team members verbalize how they would execute the procedures using the scenario and the required information.
3. After the discussion, jointly approve or modify each task and procedure.
Sometimes the scenario does not unfold as you expect, so as the walk-through progresses, you may find it necessary to make some changes and clarifications to achieve the walk-through objectives. Hold a post-walk-through debriefing to recap the action points noted during the exercise, and make sure each participant leaves with a copy of the list of action points. The summary report should contain:
- the objectives of the walk-through
- a list of the participants
- a scenario summary
- the scenario definition handouts
- a summary of changes for the computer contingency and a plan and schedule for their completion
Be a Survivor
It’s true that disasters — even simulated ones — don’t happen often. However, it is also true that without DRP testing, you will never know whether your plan will work when “the big one” hits. Companies have suffered and survived disasters, but only when they have properly tested. Backup and recovery can be a good experience if you plan, and more importantly, if you test.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.