Will Monday’s downtime for some customers of Amazon Web Services have a lasting impact for AWS, the largest player in the cloud computing arena? The incident is unlikely to cause any kind of mass exodus, and doesn’t discredit cloud as a business model, as some pundits may claim. Some Amazon customers have been through five outages in 18 months, and continue to remain on the AWS cloud.
But the recurring outages may hamper Amazon’s’ enterprise ambitions, and have prompted a slew of its competitors to leap into action, attempting to court away the massive AWS customer base.
Lingering Issues for Some Customers
Amazon attributed Monday’s incident to a “small number’ of storage volumes in a single availability zone located in its outage-plagued US-East-1 region. The outage affected Minecraft, Reddit, imgur, and (to some extent) Pinterest, to name a few of the more public ones. While the outage has largely been resolved, there are lingering issues. Volumes affected during this event continued to re-mirror on day two, leading to increased volume IO latency.
US-East-1 has faced continued problems. Some users assert that it is overcrowded. However, there is a way to use the service and not confine yourself to a single Availability Zone, providing a way to ensure proper failover. Netflix is one example
Netflix learned from a serious outage following a 2011 service outage. It changed its architecture to avoid using Amazon Elastic Block Storage (EBS) as its main data storage service. The company released a thorough explanation that is a worthwhile read for those interested in reliability on Amazon’s platform.
High Profile Customers Make Outages More Visible
However, many sites with massive amounts of traffic don’t make Netflix-style money. There are cost considerations, and the complex architecture used by Netflix isn’t automatically built into AWS. High traffic consumer sites are sometimes particularly vulnerable to the costs of proper failover, or at least are the ones that decide to roll the dice with a lower level of redunancy.
They’re also among the most publicly visible AWS customers. Yesterday’s downtime affected a lot of consumer web properties that receive massive amounts of traffic, but generally like to keep costs low. Unfortunately, outages at widely-used consumer web properties – like Reddit, Imgur and Minecraft- are what often get the most attention.
Adding to the problem, Amazon said some users attempting to shift workloads to unaffected zones may have been unable to do so. “Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery,” the dashbaord said. “Customers receiving this error can retry failed requests.”
The crux of the issue is this: Is this more of a warning of the dangers of remaining on the AWS’ cloud, or a warning that sites need to be architected better and not confined to a single availability zone?
Problematic for Enterprise Push
Arguably, this outage is problematic for AWS’ efforts to win over the enterprise. The company has highlighted high-profile enterprise wins such as Nasdaq for its FinQloud in a bid to court the enterprise. It wants to reverse the belief that AWS isn’t suitable for mission critical systems and/or enterprise usage by highlighting these wins. These outages affect those efforts.
The cycle occurs once again: AWS goes down, the discussion about the importance of redundant availability zones picks up, and competitors pounce on a perceived opportunity to potentially poach customers. Some examples:
- ExtraHop, which provides tools for application performance management, argues that its API monitoring capability would have been able to quickly identify the AWS problem, allowing customers to work with Amazon to resolve it.
- Following the outage, SunGard emphasized the importance of SLAs, and how it can offer SLAs that AWS can’t.
- Peer 1 Hosting issued a press statement that its cloud was recently tested against Amazon EC2 for both speed and consistency by Cloud Spectator. The results showed that PEER 1 Hosting’s cloud outperformed Amazon in performance measurements, including CPU and network performance.
- Centrilogic CEO Robert Offley offered an analogy. “Overall Amazon isn’t a bad product, but has been designed for a different market and application,” Offley wrote. “The same way both Four Seasons and a low cost hotel brand have the same basic product, but they serve very different markets. Good for them that they were first to market, but as the market develops it will become clearer that they are focused on a specific segment.”
- Cloud provider Joyent made a specific plea to Reddit on its blog, in a post titled ”If I was your cloud provider, I’d never let you down.”
- Even those that remain on AWS’ infrastructure attempt to capitalize on these outages. For instance, PaaS provider Heroku went down again; following a very public outage the last time, other PaaS provider Engine Yard were quick to point out how their service remained available.
This strategy often works – particularly with enterprises who get spooked. It’s another reminder that traditional hosting service providers with IaaS offerings can provide levels of service and customer care that AWS simply cannot due to the volume of customers on its cloud, and the fact that this kind of hands-on, individual customer support isn’t economically feasible for Amazon to provide. However, this outage is one in a string of outages over the years, and AWS continues to grow and perform well. This won’t cause a mass exodus. However, it does provide an opportunity for other service providers to differentiate and highlight their own clouds.