Can Your Company Endure an AWS-size Outage?

Whether hackers worm their way into key data center systems and cause an outage, or a technical mishap takes down multiple servers, human error is usually to blame. As you know, outages can be extremely costly to businesses due to loss of customers, reputation, and revenue.

Most companies today rely on digital information sharing and connectivity as critical elements in their business model. When this function breaks, business grinds to a halt.

The most recent example happened when an engineer for Amazon Web Services accidentally mistyped a command while debugging the company’s billing system for its cloud storage service, S3. The right command would have removed just a small numbers of servers running on one of the S3 subsystems.

“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” according to an Amazon blog that also apologized for the ramifications.

This proverbial pushing of the wrong button resulted in a four-hour outage that brought down or compromised web services across the internet and cost AWS customers and others reliant on third-party services hosted by AWS hundreds of millions of dollars. They included: Coursera, Medium, Quora, Slack, Docker, Expedia, and more.

Take a look at all 11 sessions in Data Center World’s Security, Data Sovereignty and Risk Management track.

Cyence, a company that estimates the economic impact of cyber risk, told Data Center Knowledge that the outage caused S&P 500 companies to lose $150 million and US financial services another $160 million as a result. That number could easily exceed $160 million because it doesn’t include many other types of companies, such as credit unions, that host mobile applications on Amazon’s cloud.

An international study by the Business Continuity Institute showed that IT and telecom outages were the top three sources of disruption for financial services, IT and communications, transport and storage, and government. Fourteen percent of those surveyed estimated that a previous disruptive incident cost their businesses between $1.3 million and $13 million.

John Parker, who is responsible for disaster recovery and global data center operations management for ESRI, thought AWS handled the outage efficiently. “I was impressed with the AWS Root Cause Analysis (RCA) report, but even more impressed that they also looked at other processes during the incident that may have slowed down the resolution process,” he said. “Every company should not only determine the RCA but look at opportunities to improve all processes during an incident, thus reducing the amount of downtime.”

Parker is speaking at the Data Center World conference next month in Los Angeles. Details below.

AWS is taking precautions to prevent similar and future outages by modifying its tool for removing capacity to prevent it from removing too much capacity too quickly and to prevent capacity from being removed when any subsystem reaches its minimum required capacity. The team also reprioritized work to partition one of the affected subsystems into smaller ‘cells,’ which was planned for later this year but will now begin right away.

Of course, developing a plan to restore systems and business as soon as possible goes hand-in-hand with preventing outages from the get-go. That’s just not always realistic, considering the nature of humans and unpredictability of natural disasters.

“No company is immune to outages as long as humans are involved as human error is the leading cause of outages,” Parker concurred. “That’s why having Solid Incident Management processes in place is so critical for companies: Companies can recover much faster with them.”

Data Center World Keynote Speaker and professional hacker Kevin Mitnick blames humans for cybersecurity weaknesses in corporate America as well.

“My presentation will clearly illustrate why people are the weakest link in the security chain,” Mitnick says. “Attendees will see real demonstrations of some of the most current combinations of hacking, social engineering and cutting-edge technical exploits my team and I actually use to penetrate client systems, with a 100 percent success rate. They will also gain strategies to protect their organizations, and themselves, from harm and to help mitigate the risks they face.”

Once on the FBI’s Most Wanted list for hacking into 40 major corporations, Mitnick will present the Data Center World keynote address on Tuesday, April 4, from 4 p.m.-5:15 p.m. Register today.

Parker will be part of a two-hour panel discussion, “Will the Cloud Replace the Need for Data Centers?” on Wednesday, April 5 from 9:30 to 11:45 a.m. He will also present, “Cloud Computing 101: A Primer” as part of the All Access Pass Workshop series on Monday, April 3 from 8 a.m. to 11:30 a.m.

This article originally appeared at AFCOM.

Comments

Plain text