How to Survive a Cloud Meltdown

One of the biggest questions following Amazon's cloud outage last week was whether you can use the world’s biggest cloud provider and still avoid downtime when the provider has a major outage – a common if infrequent occurrence. If you can, how to do it? And if there is a way to do it, why isn’t everybody doing it?

The answer to the first question is clearly "yes." While lots of websites and other services delivered over the internet took a hit – by one analyst firm’s estimate, the outage collectively cost hundreds of millions of dollars to AWS customers among S&P 500 companies and companies in the financial services industry alone -- many did not.

There are several potential answers to the “how” question, while the reason why everyone isn’t using those methods appears to be mostly cost.

The ways to avoid going down along with the cloud provider are essentially different ways to build redundancy into your system. You can keep multiple copies of stored objects and virtual machines in multiple data centers located in different regions and use a database that spans multiple data centers, Adam Alexander, senior cloud architect at the cloud management firm RightScale, told Data Center Knowledge via email.

More on the incident: AWS Outage that Broke the Internet Caused by Mistyped Command

The ways to implement this include using multiple regions by the same cloud provider, using multiple cloud providers (Microsoft Azure or Google Cloud Platform in addition to AWS, for example), or using a mix of cloud services and your own data centers, either on-premise or leased from a colocation provider. You can choose to spend the time and money to set this architecture up on your own or you can outsource the task to one of the many service providers that help companies do exactly that.

You can also use a caching service by a CDN provider like Cloudflare, which stores redundant copies of the data you store in Amazon’s S3 service – the cloud storage service that was the culprit in the AWS outage – in its own data centers. This in effect is outsourcing redundancy completely (at least for storage).

Companies that used S3 “behind” Cloudflare did not lose access to that data during the incident, Cloudflare CTO, John Graham-Cumming, said in an interview. “That’s quite a common configuration for us,” he said. “We can deal with any kind of outage like this.”

Asked why all cloud users don’t have some sort of redundant, fault-tolerant scheme in place, Graham-Cumming cited cost and complexity. “It’s not necessarily easy to do that,” he said. “There’s a financial cost to it.” One of the most attractive characteristics of cloud services is the pay-for-what-you-use model. If you have redundant VMs and multiple copies of data, your cloud bill can easily skyrocket, and that’s before you take the cost of setting up the automatic failover mechanism into account.

The extra cost of cloud resiliency includes the cost of “storing additional copies of your data in another location, maintaining standby compute resources to handle disasters, and additional in-network bandwidth required to keep the two locations in sync,” Philip Williams, principal architect at Rackspace, wrote in a blog post last week.

And cost is already a big deal for companies using cloud services at any significant scale. Managing cost was the most frequently cited challenge among mature cloud users in this year’s State of the Cloud survey by RightScale. Along with security, spend is one of the two top challenges cloud users report.

While 85 percent of enterprise respondents to RightScale’s cloud survey said they had multi-cloud strategies in place, it’s unclear whether they have those strategies for resiliency or simply use different cloud services for different purposes. For example, a company could use AWS for Infrastructure-as-a-Service for testing and development and Google's Platform-as-a-Service for a production website: two different clouds being used for completely different things while technically amounting to a multi-cloud strategy.

Cloudflare’s Graham-Cumming said multi-cloud strategies for the purpose of resiliency are on the rise, however. “More and more customers are looking for a multi-cloud solution,” he said. And resiliency isn’t the only reason they’re interested. It’s partly to avoid outages, but many companies also want to avoid being locked into a single cloud provider. “That’s going to become more and more of a trend.”

Helping that trend along is the recent rise of application containers and container orchestration tools, such as Docker, Kubernetes, and Mesosphere's DC/OS, designed with the aim of making applications independent of the type of infrastructure they run on. If the promise of portable workloads is truly realized, multi-cloud strategies will be a whole lot easier to execute.

"Outages happen; that's a fact," Zac Smith, CEO of the cloud upstart Packet, said via email. "However, there's a huge silver lining in the promise of workload portability and the power of open source orchestration tools like Kubernetes and others. In short, my advice is: If you're moving to the cloud or are already there, make sure you're portable and that you control your own orchestration."

Correction: A previous version of this article incorrectly used Salesforce's cloud CRM as an example of a cloud service used by respondents to RightScale's State of the Cloud survey. The CRM is a Software-as-a-Service solution, and RightScale specifically excluded SaaS from its survey data.

Comments

Plain text