Last month, AWS customers experienced an outage if they used the firm’s site-to-site VPN and Internet Connectivity through its US-East-2 availability zone. The outage lasted for exactly 40 minutes from 12:26 p.m. to 1:06 p.m. PST.
During that span, AWS customers who depended on that availability zone faced a myriad of issues. Some were notified by their own customers of the outage, which caused embarrassment if not brand degradation or even a loss of customers. They also received pressure from executives to provide a mitigation plan in case a similar outage occurs. To provide such a plan, organizations need a key element: the cause of the outage.
As of press time, AWS has not issued what some call a ‘post-mortem’ or a post-incident summary of the outage. When asked when or if AWS would provide a post-mortem, an official with the firm had this to say:
“We do not publish Post-Event Summaries … for every service event,” wrote an AWS spokesperson to Data Center Knowledge in an email response to our query this week. “When an issue has broad and significant customer impact that results in the failure of a significant percentage of control plane API calls, impacts a significant percentage of a service’s infrastructure, resources or APIs or is the result of total power failure or significant network failure, AWS is committed to providing a public Post-Event Summary (PES) following the closure of the issue.”
AWS customers whose businesses rely on the firm’s US-East-2 availability zone are left scratching their heads, wondering how to mitigate an issue with no reported cause.
AWS’ Had Another US-East-2 Availability Zone Outage Last Year
The incident on Dec. 5 is the second one this year in the US-East-2 availability zone for AWS. On July 28, the outage incident was farther-reaching, and one could say it did meet the requirements the AWS representative shared.
For 2.8 hours, AWS customers had no access to 38 of the leading Cloud Service Provider’s services in the US-East-2 availability zone. Those 38 services included: API Gateway, CloudWatch, DynamoDB, and the firm’s flagship offering Elastic Compute Cloud (commonly referred to as EC2).
EC2 was listed as a degradation in services on the AWS Health Dashboard during the time period of the “loss of power” incident.
Currently AWS has published no post-incident summary detailing the cause and mitigation of that issue either.
While there’s no additional guidance AWS offers in these instances of smaller outages (other than their customer support options), customers do have access to the AWS Health Dashboard and the Personal Health Dashboard (which is a customer-specific dashboard of services and environments).
The last post-incident summary AWS provided for any service disruption was for a Dec. 10, 2021, internal connectivity issue that affected EC2 API and container API, among other service access issues.
Plan Around CSPs to Mitigate Impact of Outages
In our previous coverage of the AWS outage in December, we shared a five-step plan for addressing outages. Two key points from that coverage apply directly to the issue of CSPs and disaster planning.
We asked Mike Gibbs, CEO of Go Cloud Careers, to expand on his guidance.
1. Plan around cloud failures and the CSPs themselves.
Enterprises don’t design cloud architecture around brands but rather to solve customer challenges. For business continuity, considering brands can’t be avoided.
The cloud is really nothing more than a virtual network / data center, and these systems fail. Because of the additional cloud software, there is more to go wrong in a cloud computing environment than a traditional, on-premises datacenter. It’s essential for cloud architects designing mission-critical customer systems to use multiple clouds and multiple network providers to connect their systems to multiple clouds. This follows the Navy SEAL motto, “Two is one and one is none.”
2. Business continuity starts in the C-suite.
Architects would do well to consider every threat and mitigate threats when designing systems for the cloud.
The business requirements determine what type of contingency planning needs to occur. For instance:
- How much downtime can a business tolerate?
- What percentage of the businesses systems need to function in a disaster situation?
- What are the security concerns?
These are the requirements that the cloud architect needs to know in order to properly design a disaster recovery plan and the technology to support that plan.
Updated on Jan. 6, 2023 in paragraph 10 to include a mention of AWS' customer support options.