Your Amazon Cloud Just Went Down. Now What?
September 16th, 2013 By: Bill Kleyman
With more organizations leveraging the power of the cloud, it’s very important to understand the infrastructure that supports it all. With more users, lots of consumerization, and a lot of new data, the infrastructure of tomorrow must be as resilient as possible.
Organizations are seeing the beauty of working with a cloud provider like Amazon. There are a lot of benefits too – data distribution, multiple access points, bringing the data closer to the users, deploying a fog, decrease in hardware expense, and a lot more workload flexibility.
However, there is something important to understand about the cloud. Nothing in IT is ever perfect. This includes the cloud. Think your cloud won’t go down? Think you’ve got resiliency built in? You should review some of the very recent outages:
- April 2011 – Major Amazon Outage Ripples Across Web
- June 2012 – Power Outage Affects Amazon Customers
- October 2012 – Software Bug, Cascading Failures Caused Amazon Outage
- December 2012 – Amazon Cloud Back Online After Major Christmas Outage
- September 13, 2013 – Network Issues Cause Amazon Cloud Outage
When deploying a cloud environment, an Amazon cloud in this case, you must plan around disaster recovery, business continuity, and infrastructure failures. Remember, whether a networking component goes out or there is a complete power failure a cloud outage is a cloud outage. Regardless of the circumstances, you will lose access to data, workloads and core resources. As a result, your user is negatively affected, and your business is financially impacted.
What do you do if your cloud infrastructure just went down? Now what? Here’s a look at key pieces of your failover and recovery plan:
- Develop a contingency plan. How will you know what to recover if you don’t know what’s running in your cloud? I don’t mean VMs or applications – I’m referring to the entire infrastructure. You can have the best environment in place, but if you have little to no visibility into dependencies, data access, advanced networking, and data control, you’ll have a very hard time recovering your cloud should an event occur. When your cloud infrastructure is a critical part of your business, you must develop a Business Impact Analysis and/or a Disaster Recovery/Business Continuity document. This plan not only outlines your core systems, it’ll also detail infrastructure dependencies, core components, and what actually needs to be recovered. Remember, not everything in your cloud is critical. To save on recovery time and efficiency, it’s crucial to know which systems hold priority. Without this document and without a recovery plan, knowing what to recover and in what order can really slow down the process.
- Understanding WAN traffic and optimization. As cloud platforms become more distributed, there will be a direct need for data to retain its integrity and arrive at its destination quickly and without much latency. Organizations have spent millions of dollars on physical hardware components to help with the optimization task. Now, organizations are able to create entire virtual network architectures where traffic is handled at the virtual routing layer. WAN traffic control and WANOP play a big role in the cloud DR industry. Folks like Silver Peak and Riverbed are designed to create highly efficient network paths between vast distances. Furthermore, these appliances can be deployed at the virtual layer. Controlling cloud-based WAN traffic not only helps with optimization, but it also helps direct users should there be an ISP or even networking-related outage.
- Utilizing dynamic load-balancing (at the cloud layer). Load balancing has come a long way from just directing traffic to the most appropriate resource. Now, with network virtualization, new types of load balancers are helping not only with traffic control, but cloud-based Disaster Recovery (DR) and High Availability (HA) as well. Features like NetScaler’s Global Server Load Balancing (GSLB) and F5’s Global Traffic Management (GTM) not only port users to the appropriate data center based on their location and IP address, they can assist with a disaster recovery plan as well. By setting up a globally load balanced environment, users can be pushed to a recovery data center completely transparently. This virtual cross-WAN heartbeat would check for availability of a data center and push users to the next available resource. Furthermore, there can be policies set around latency, network outages, and even network load. How can this help? Not only are you able to port users around the Internet to the most available resource – you can recover your cloud from a completely different data center if needed.
Allocating power usage to align with critical business applications will increasingly become an important part of every disaster recovery plan. The ability to divert workload from a cloud based environment to another facility during an outage will depend on your global load balancing strategy but it can be greatly aided by the energy management policies in place and how well you understand your workload. You can learn more about this subject here: http://bit.ly/17LXDl2