Your Amazon Cloud Just Went Down. Now What?

With more organizations leveraging the power of the cloud, it’s very important to understand the infrastructure that supports it all. With more users, lots of consumerization, and a lot of new data, the infrastructure of tomorrow must be as resilient as possible.

Organizations are seeing the beauty of working with a cloud provider like Amazon. There are a lot of benefits too – data distribution, multiple access points, bringing the data closer to the users, deploying a fog, decrease in hardware expense, and a lot more workload flexibility.

However, there is something important to understand about the cloud. Nothing in IT is ever perfect. This includes the cloud. Think your cloud won’t go down? Think you’ve got resiliency built in? You should review some of the very recent outages:

April 2011 – Major Amazon Outage Ripples Across Web
June 2012 – Power Outage Affects Amazon Customers
October 2012 – Software Bug, Cascading Failures Caused Amazon Outage
December 2012 – Amazon Cloud Back Online After Major Christmas Outage
September 13, 2013 – Network Issues Cause Amazon Cloud Outage

When deploying a cloud environment, an Amazon cloud in this case, you must plan around disaster recovery, business continuity, and infrastructure failures. Remember, whether a networking component goes out or there is a complete power failure a cloud outage is a cloud outage. Regardless of the circumstances, you will lose access to data, workloads and core resources. As a result, your user is negatively affected, and your business is financially impacted.

What do you do if your cloud infrastructure just went down? Now what? Here's a look at key pieces of your failover and recovery plan:

Develop a contingency plan. How will you know what to recover if you don’t know what’s running in your cloud? I don’t mean VMs or applications – I’m referring to the entire infrastructure. You can have the best environment in place, but if you have little to no visibility into dependencies, data access, advanced networking, and data control, you’ll have a very hard time recovering your cloud should an event occur. When your cloud infrastructure is a critical part of your business, you must develop a Business Impact Analysis and/or a Disaster Recovery/Business Continuity document. This plan not only outlines your core systems, it’ll also detail infrastructure dependencies, core components, and what actually needs to be recovered. Remember, not everything in your cloud is critical. To save on recovery time and efficiency, it’s crucial to know which systems hold priority. Without this document and without a recovery plan, knowing what to recover and in what order can really slow down the process.
Understanding WAN traffic and optimization. As cloud platforms become more distributed, there will be a direct need for data to retain its integrity and arrive at its destination quickly and without much latency. Organizations have spent millions of dollars on physical hardware components to help with the optimization task. Now, organizations are able to create entire virtual network architectures where traffic is handled at the virtual routing layer. WAN traffic control and WANOP play a big role in the cloud DR industry. Folks like Silver Peak and Riverbed are designed to create highly efficient network paths between vast distances. Furthermore, these appliances can be deployed at the virtual layer. Controlling cloud-based WAN traffic not only helps with optimization, but it also helps direct users should there be an ISP or even networking-related outage.
Utilizing dynamic load-balancing (at the cloud layer). Load balancing has come a long way from just directing traffic to the most appropriate resource. Now, with network virtualization, new types of load balancers are helping not only with traffic control, but cloud-based Disaster Recovery (DR) and High Availability (HA) as well. Features like NetScaler’s Global Server Load Balancing (GSLB) and F5’s Global Traffic Management (GTM) not only port users to the appropriate data center based on their location and IP address, they can assist with a disaster recovery plan as well. By setting up a globally load balanced environment, users can be pushed to a recovery data center completely transparently. This virtual cross-WAN heartbeat would check for availability of a data center and push users to the next available resource. Furthermore, there can be policies set around latency, network outages, and even network load. How can this help? Not only are you able to port users around the Internet to the most available resource – you can recover your cloud from a completely different data center if needed.

Data and workload replication. There is a set list of core cloud components and data sits square in the middle. But how do you know which data is critical? How do you know which data has dependencies tied around it? How will you know what database to recover first if there is an emergency? The only way to truly know this information is through a disaster recovery and business continuity plan. The idea that all data must be replicated is not true. In some, fairly rare situations, an organization may require that their entire data structure be replicated. However, most others really need critical data components to continue operating if there is a cloud outage. With improvements within the modern cloud and general infrastructure – your organization can replicate data cross-cloud as needed. You can even push information from your Amazon cloud to a separate cloud backup provider.
Minimizing the impact on the end-user. A cloud outage will certainly have negative impacts on your environment. However, never forget that the user is the one that feels the brunt of it. Throughout the planning and design phases of the contingency plan, you must take into consideration how the user will experience the entire emergency event. A truly robust cloud strategy would allow the user to simply refresh their screen or workload and be given access to new resources within a set amount of time. Network-layer and DNS configurations would allow users coming in from the outside to use the same access URL while being ported to a DR data center. The less the user has to experience an outage, the better. In fact, when you experience an emergency situation and the end-user doesn’t notice, you’ve probably done something right.
Internal and external disaster recovery. There can be bad situations, and then there can be really bad situations. Contingency plans need to include as many logical scenarios as possible. This means that internal cloud resiliency, or HA, must be built in. Internal recovery might include things like VMware HA or Veeam backup strategies on the workload and data layer. Or, it might include power, cooling and server HA from a hardware perspective. You can even create switch-layer multipath algorithms which allow entire racks to stay up even if part of the network goes down. These sort of internal recovery measures are crucial because components with a cloud can and will fail. External disaster recovery takes a look at major outages. This is where you would examine recovering to a separate data center or some type of backup solution. A good contingency plan will account for both scenarios.
Existing Amazon Cloud features will help you survive outages. The Amazon cloud is a powerful platform. With a vast grid, regional outages can be completely mitigated when utilizing existing Amazon failover features. Amazon’s Region and Availability Zones allow customers to logically spread out their workloads to reduce chances of a single point of failure. The Amazon cloud infrastructure is hosted in distributed locations all over the world. These sites are composed of regions and Availability Zones. Each region is a separate geo-located area. These Amazon regions have multiple, isolated locations known as Availability Zones. Amazon allows your infrastructure the ability to place resources, such as instances, and data in multiple availability zones. Remember, in this model, resources aren't replicated across regions unless you do so during your configurations. There are a lot of ways to create very robust failover designs utilizing existing Amazon architecture. The following are some detailed resources revolving around Amazon design, architecture, failover and disaster recovery:
- AWS Reference Architecture – Application Hosting, Fault Tolerance and HA, DR for Local Apps, File Synchronization, Log Analysis and more.
- AWS Reference Architecture: Fault Tolerance and HA [PDF] – Amazon EC2, Amazon EBS, Elastic Load Balancing, and Amazon S3.
- RightScale: Designing Failover Architectures on Amazon EC2 – Best practices for Elastic IPs and availability zones, failover architecture, setting up intermediate failover, and advanced failover methodologies.
- RightScale: High-Availability in the Cloud | Architectural Best Practices [PDF] – Designing a complete failover solution, using Amazon cloud best practices, deploying backup and replication, and creating true resource availability.
- AWS Webcast - High Availability with Route 53 DNS Failover – How to use DNS failover, high-availability architecture, and advanced multi-region failover designs.
- AWS Webcast - Best Practices in Architecting for the Cloud – Designing scalability, deploying applications, utilizing monitoring, applying auto-scaling and replication, and creating cloud replication.
DRBC testing and development. As part of the recovery plan – testing and DR development must be a part of the cloud environment. Also, only having it on paper won’t work. You actually have to conduct structured tests around your environment. This type of proactive cloud maintenance is absolutely required when the cloud infrastructure is a part of the core business model. Furthermore, documenting tests will allow other engineers to understand how the environment is functioning and what they must do if an event occurs.

Let’s face facts – there is no silver bullet for a failed cloud environment. There are some organizations that are willing to spend the money on a completely mirrored environment. That type of parallel infrastructure can be very pricey – but in some cases completely necessary.

The first concept to understand around cloud computing is that just like any other IT components; anything can and will happen to the environment. This can be a small networking hiccup to a full-blown data center outage. Regardless of the type of outage, having a plan in place with a solid testing/training methodology can help bring an environment back online quickly.

The modern cloud has come a long way and can be deployed with a lot more resiliency and redundancy. It’s important to continuously test your cloud platform and never become complacent around your infrastructure. The current trends around IT indicate that there will be growth within cloud-based data and cloud-ready workloads. This means that high uptime will be even more critical. Take the necessary time and steps to ensure that when your cloud experiences and issue – you’ll know exactly what to do next.

Comments

Plain text