This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in hundreds of millions of dollars in losses for AWS customers and others who rely on third-party services hosted by AWS.
The long list of popular web services that either suffered full blackouts or degraded performance because of the AWS outage includes the likes of Coursera, Medium, Quora, Slack, Docker (which delayed a major news announcement by two days because of the issue), Expedia, and AWS’s own cloud health status dashboard, which as it turned out relied on S3 infrastructure hosted in a single region.
The four-hour AWS outage caused S&P 500 companies to lose $150 million, Cyence, a startup that models the economic impact of cyber risk, estimated, a Cyence spokeswoman said via email. US financial services companies lost $160 million, the firm estimated.
That estimate doesn’t include countless other businesses that rely on S3, on other AWS services that rely on S3, or on service providers that built their services on Amazon’s cloud, such as credit unions, for example, many of whom have third-party firms provide their mobile banking applications, often hosted in AWS.
The engineer that made the expensive mistake meant to execute a command intended to remove only a small number of servers running one of the S3 subsystems. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” according to a post-mortem Amazon published Thursday, which also included an apology.
— Barry Schwartz (@rustybrick) March 2, 2017
The servers removed supported two other crucial S3 subsystems: one that manages metadata and location information of all S3 objects in Amazon’s largest data center cluster, located in Northern Virginia, and one that managed allocation of new storage and relies on the first subsystem.
Once the two systems lost a big chunk of capacity they needed to be restarted, which is where another problem occurred. Restarting them took much longer than AWS engineers expected, and while they were being restarted, other services in the Northern Virginia region (US-East-1) that rely on S3 – namely the S3 console, launches of new cloud VMs by the flagship Elastic Compute Cloud service, Elastic Block Store volumes, and Lambda – were malfunctioning.
Amazon explained the prolonged restart by saying the two subsystems had not been completely restarted for many years. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
To prevent similar issues from occurring in the future, the AWS team modified its tool for removing capacity to prevent it from removing too much capacity too quickly and to prevent capacity from being removed when any subsystem reaches its minimum required capacity.
The team also reprioritized work to partition one of the affected subsystems into smaller “cells,” which was planned for later this year but will now begin right away.
Finally, the Status Health Dashboard now runs across multiple AWS regions so that customers don’t have to rely on Twitter to learn about the health of their cloud infrastructure in case of another outage.
Correction: A previous version of this article said S&P 500 companies lost $150 million to $160 million as a result of the outage. Cyence has provided us with a more precise figure ($150 million) and the additional estimate of the $160 million lost by US financial services companies.