How to Create a Reliable DR Strategy: Best Practices

Modern IT platforms are designed to handle more users than ever, but what happens when these systems become the primary access point for most, if not all, users? What happens when a critical system experiences a fault or goes down entirely?

A survey by the Disaster Recovery Preparedness Council found two years ago that only 27 percent of companies received a passing grade for disaster readiness. The more we rely on data centers, the more costly data center outages become. A recent study by the Ponemon Institute and Emerson Network Power found that:

The cost of downtime has increased 38 percent since 2010.
Downtime costs for the most data center-dependent businesses are rising faster than average.
Maximum downtime costs increased 32 percent since 2013 and 81 percent since 2010.
Maximum downtime costs for 2016 are $2,409,991.
UPS system failure continues to be the number one cause of unplanned data center outages, accounting for one-quarter of all such events.
Cybercrime represents the fastest growing cause of data center outages, rising from 2 percent of outages in 2010 to 18 percent in 2013 to 22 percent in the latest study.

With this in mind, what is your DR strategy? Are you ready for an emergency?

DR Sizing and Planning

Since every environment is unique, disaster recovery capacity planning may take different shapes and forms depending on the goals of the organization. However, the following four metrics are a good starting point:

User requirements. By establishing user count and future growth, you’ll be able to tell how much storage, RAM, and CPU resources are necessary. In DR planning, this number helps you align resources to the number of users you must support to remain operational.
Apps, desktops, workloads, and user resources. By knowing the workload type, we are able to size and plan more effectively. For DR purposes, what will keep your users most productive? Is it a virtual application? Or, is it a full desktop? Maybe it’s a cloud-based DR solution offering Office 365. Know your workload in a DR scenario and how it will be delivered to your users.
WAN link considerations. Bandwidth must be considered in designing a DR environment. Furthermore, building in redundancy is critical. Do you have various Ethernet services? Are you ready for a primary link to fail? Make sure to plan around this step as well.
Planning around the data center. There are many kinds of data center technologies to work with. For the most part, data centers provided as a service are very flexible, offering management options for almost all levels. Resources must be managed and distributed appropriately – otherwise, an organization may be wasting money on misallocated workloads. In a DR scenario, where is the data center located? Do you own it? Make sure your secondary site is well-planned and ready for an emergency.
Content and resource delivery methodologies. How is the workload delivered to the end user by a DR environment? What speeds are optimal? Where will certain types of content be rendered? Do we need to make adjustments to compensate for latency? Does the user have easy access to the app or resource? These points must be worked out for a solid DR plan.

DR Documentation

With DR planning comes the important task of documentation. The reality is that this step is often either forgotten or put off until the last minute. Poor documentation can lead to a very bad DR experience. Administrators must not only create current distributed environment documentation, but they must also create what is known as a “living DR workbook.”

Consider the following when working on a DR plan and documentation:

This workbook is a truly all-encompassing document, which will evolve as the environment changes.
The document will reflect each IT team and their direct responsibilities should an event occur.
This document will also spell out different scenarios for different departments.
There will be remediation steps for each team and each person responsible will have a task when an outage or pre-designated event occurs.
Managers must continuously present this workbook to their staff and ensure that they understand their roles and functions should an event happen.

And don’t let these documents get stale. Update them and ensure that DR plans are set in place and kept fresh.

DR Testing, Maintenance, and Best Practices

What good is a robust DR plan if no one knows what to do when a disaster actually happens? The only way an environment can be used properly with disaster recovery is if all the right people are able to make good decisions based on a planned out directive.

All IT team staff and key business personnel must be trained in DR event management. Should an actual disaster occur, all key people involved, business or IT, must know the course of action to be taken. This will include alerting, immediate remediation, and damage control.

The only way a DR plan stays relevant is if there is continuous training happening at all levels.

This includes the business layer. Today’s businesses are heavily reliant on their IT infrastructure, which means business stakeholders must have a say and action items in the living DR plan.

DR environments must be tested and verified to be optimally functional. These tests can happen during off hours or through a mirrored offsite environment. There are numerous testing options, and the best one will be dependent on the needs of the IT team.

You don’t have to pull the plug on a data center to make sure things are working. Consider the following testing recommendations to validate DR environments:

Creating shadow users. There are powerful tools that can help create very robust DR strategies. For example, LoginVSI allows organizations to shadow users to mimic impacts on an environment, system, application, and even the business. Using these kinds of tools can help you understand threshold planning, how users interact with an environment, and even test out a secondary site without actually having to failover live users.
Leverage virtualization. Load-balancing technologies and failover systems have come a really long way. For example, Citrix’s NetScaler and the F5 ADC each have powerful global load-balancing capabilities. They can also be deployed as virtual appliances. You can test out failover by ensuring that load balancing is working and that users are seamlessly transferred to a secondary environment.
Use infrastructure intelligence to test DR. Physical systems can help with DR testing as well. Multi-pathing features allow you to failover entire network components. You can ensure that critical systems continue to stay live by testing critical networking components without having to take your systems down.

Remember, a DR strategy is absolutely critical for your business. If something happens, you’ll be able to be up and running very quickly. Just think of how much it costs your business to be down for an hour… or a whole day. These strategies are critical with keeping a business agile and very resilient. Make sure to plan, test, document, and maintain your entire DR strategy.

Comments

Plain text