Users of Amazon Web Services saw fewer service interruptions than users of the other two big providers of public cloud infrastructure services – Google Cloud Platform and Microsoft Azure – in 2014, according to two different companies that analyze cloud downtime.
Azure saw more than 500 service interruptions and two major outages last year, according to CloudEndure, a company that helps companies migrate applications onto public cloud infrastructure. AWS experienced about 230 service interruptions and no widespread outages, according to CloudEndure.
CloudEndure’s analysis, released this month, is consistent with that of CloudHarmony, which specializes in cloud reliability analysis. CloudHarmony’s CloudSquare service provides much more granular data across many more providers than CloudHarmony does.
Amazon EC2, the compute component of AWS, saw about 10 outages across multiple regions in 2014, lasting from 19 seconds to about 9 minutes, according to CloudHarmony. Google Compute Engine had about 60 that lasted from 10 seconds to 37 minutes. Azure Virtual Machines saw more than 100 outages, lasting from 10 seconds to about 12 hours on one occasion.
Each outage CloudHarmony tracks is an outage of an entire region, versus service errors tracked by CloudEndure.
Azure’s big multi-region outages happened in August, when the provider reported full service interruptions across multiple regions, and in November, when about 20 services were interrupted in most of its availability zones. The 12-hour single-zone outage was on November 5 in the asia-east zone, which went down multiple times that day and once on the preceding day.
The uptime ranking of the three providers’ storage services was different. Google Cloud Storage had eight single-zone outages during 2014; Amazon’s S3 service had about 20; and Azure Object Storage saw nearly 140 outages.
Azure and AWS error rates reduced consistently throughout the year. AWS went from 127 errors in the first quarter to 26 in Q4, according to CloudEndure. Azure went from about 260 errors in Q1 to about 200 in Q4. CloudEndure has not analyzed Google’s cloud reliability record.
As Ofir Ehrlich, vice president of research and development at CloudEndure, pointed out in a blog post, service providers’ ability to maintain uptime is important, but the top reason user applications go down is human error. To guard against that, it’s important to set up resilient multi-region failover architecture whatever cloud provider you’re using.
The most resilient option is being able to failover from one provider to another. As Verizon Cloud Services demonstrated earlier this month, outages across a single provider’s entire cloud are possible. The two days of downtime were intentional, to apply some upgrades, yet customers that did not set up cross-provider failover had to accept the unusually long maintenance window. Verizon later said part of the upgrade was to make it possible to apply updates in the background, without taking services offline.