Not Every Cloud Outage is Created Equal

Add Your Comments

Andres Rodriguez is the CEO and Founder of Nasuni, a unified storage company that serves the needs of distributed enterprises. Previously he was a CTO at Hitachi Data Systems and CTO of the New York Times.

Andres-Nasuni-tnANDRES RODRIGUEZ
Nasuni

Amazon is down again! We’ve all heard the phrase recently. Replace Amazon with Microsoft Azure or Rackspace or Google and it still seems like “the cloud is down” every few months. Outages from public cloud providers do happen, but not every outage is created equal and not every outage impacts customers in the same way.

We can compare cloud services to similar real-world services like electricity. As with electricity, cloud compute is  readily available, has no real limits, and is purchased completely on demand. A power outage means you have no access to electricity, just as a cloud compute outage means you have no access to your servers. However, Amazon, Rackspace, Azure, and others offer much more than just a single service like electricity. They also provide the equivalent of water, cleaning, payroll, and a broad variety of other services. Just as water and electricity are not the same thing, compute and object storage are not the same thing, and neither are databases and block storage. When the power company has an outage, only one service – electricity – becomes unavailable. When the cloud has an outage, one or several services may become unavailable. And while all cloud services are rarely, if ever, down at the same time, an outage can still have a big impact on your business if you haven’t given the issue some attention beforehand.

What Happens During An Outage?

Microsoft Azure had the most recent black eye, thanks to its SSL certificate fiasco. In February, Azure forgot to renew an expired SSL certificate for their “blob” storage. Did the storage crash? Nope. Did the storage layer stop working? Nope. Was data lost? No. However, because of the expired certificate, HTTPS traffic failed to reach the storage layer and as a result, a variety of services across the Azure spectrum that rely on the storage layer were impacted.

In December 2012, Amazon had a public outage in its U.S. East data center when its Elastic Load Balancers (ELB) failed to scale up and down efficiently. The exact reason for the outage was an errant maintenance delete by one of the developers. In this case however, the storage systems were completely unaffected. In fact, many customers using ELB were unaffected. Only those who were actively scaling up or down their numbers of load balancers were impacted.

To extend the real-world analogy, Azure stopped providing local water services, while Amazon failed to deliver payroll services. They were not the same outage, did not affect the same services, and were not caused by the same problem. The fact is that each cloud outage is unique and often only affects one or two services. Why do we so often assume that all outages are the same? Mostly because many of the obvious public-facing users of cloud services don’t just use compute or storage or load balancing, they rely on all of them working together. So whenever one service fails, the user’s whole system fails – that’s what happened to Netflix during the Amazon ELB outage.

Know Your Architecture

Enterprises considering (or actively using) cloud systems should seek to understand the architecture of the offerings. Whether building your own applications in the cloud or investigating solutions from third-party vendors, be prepared to ask:

  • What cloud storage services are employed?
  • Are multiple cloud providers used? Are they deployed in different geographies?
  • What redundancies are built in? What happens if storage becomes unavailable? Compute?
  • How did you handle the Amazon outage? Azure? How will you handle the next one?

Tips for Success

When evaluating partners to help deliver infrastructure, you can never ask too many questions. Make sure you understand what the vendor promises (their SLA) and how well they’ve delivered on that. Talk to customers — especially ones who were affected during the outages. Often times you learn more about your vendors when their back is against the wall.

Cloud solutions are entering the enterprise at light speed. While infrequent, outages are a real challenge that IT leaders will need to understand and in all likelihood, confront. The best way to minimize the impact is to ask the right questions, make informed decisions, and prepare for the inevitable.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)