Skip navigation
Error 503 Alamy

Lessons Learned from Recent Major Outages

Today’s more interconnected business world makes infrastructure and cloud outages all the more impactful. Here’s a recap of recent outages and their root causes.

In 1988, one broken power line kicked off a series of events that cut off phone service to over 50,000 Chicago-area businesses, hospitals, Chicago's O'Hare and Midway airports, and consumers for more than two weeks. At the time, that event, the Hinsdale Central Office Fire was called the greatest telecommunications disaster ever.

Yet even the impact of the largest pre-Internet/cloud event ever does not compare to what happens on a regular basis these days with cloud outages.

The nature of today’s more interconnected business world makes cloud infrastructure and service disruptions more damaging. In the past, an outage was typically restricted to a small geographical area, and there were relatively easy ways to minimize the impact. For example, a cable cut would disrupt service to those on that one circuit. Many companies would routinely protect themselves by using services from two providers, such as a leased T1 line from one and an ISDN from another. If the primary line was down due to a cable cut, a site could still run core traffic over the lower speed link until service was restored.

Putting an Outage’s Impact into Perspective

 

CloudFlare, June 2022

The provider suffered a roughly one-hour outage impacting many companies and sites, including Discord, Shopify, Fitbit, and Peloton. Traffic in 19 of CloudFlare’s sites was impacted due to a change to the network configuration in those locations that caused the outage.

Microsoft Azure and M365 Online, June 2022

East coast companies that accessed services via Microsoft’s Virginia data center suffered a 12-hour outage. The cause of the outage, according to Microsoft, was "an unplanned power oscillation in one of our data centers” … “Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.” Customers with always-available or zone-redundant services in that region were not impacted.

...

Read the full article on our sister site, InformationWeek.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish