Microsoft experienced outages yesterday across its online services including Teams, M365, and Outlook, according to Bloomberg News.
This comes on the heels of positive earnings reports for Microsoft on Tuesday but contrasts with the firm’s announcement of a 5% workforce reduction, rendering 10,000 of its workers jobless. The layoffs included members of the firm’s revenue growth engine Azure, which is Microsoft’s cloud services offering. It is of note that while Azure is a growth engine for Microsoft, growth across the cloud services industry has slowed, signaling a maturation of the cloud services industry.
Azure is at the center of Tuesday’s outage, and Microsoft continued its track record of revealing the root cause of outages by providing an impact summary on its Azure status history site. The outage in multiple regions lasted for three hours and affected Azure resources in Public Azure regions. Popular services M365 and PowerBI were also affected.
Wide area network (WAN) troubles were the cause of the outage, according to Microsoft’s own disclosures on the matter. A change the firm made to its WAN severed connectivity between the internet and Microsoft’s core suite of services.
The U.S. Federal Aviation Administration (FAA) also experienced an outage in its critical pilot safety notification system, also known as NOTAM, last week. And their outages were due to system changes. According to the FAA, the outage was caused by a corrupted file in both its primary and secondary databases. When a contractor deleted said files, the system slowed and NOTAM alerts were unavailable to pilots, grounding domestic flights across the U.S.
Outages remain a critical downside to our growing dependence on cloud service providers and, in the case of the FAA, in antiquated systems.
While the two outages vary in source, the widespread impact is a common feature of these and all outages from major organizations. The financial impact of system outages, no matter the source, can’t be overstated. The Uptime Institute found outages costing firms more than $100,000 increased to more than 60% of all connectivity failures (up from 39% in 2019). And more firms are paying upward of $1 million to survive the aftereffects of an outage, with the number of firms paying out seven figures rising to 15%, up from 11% in previous years.
Azure is the second largest cloud service provider (CSP), according to reports, second only to the originator and market leader of the CSP segment Amazon.
Microsoft commits to providing a full root cause analysis or Post Incident Report (PIR) in the next three days and then a final PIR 14 days after that.
We spoke with Chip Gibbons, CISO at managed services firm Thrive, to discover mitigation plans post-outage. Here are the highlights:
- Planning is imperative for companies of all sizes – Many businesses can leverage a comprehensive data backup and recovery plan with relative ease. Larger organizations might require more details to be addressed, specifically how systems are to be recovered, as well as applications and working conditions. However, certain aspects of data recovery always need to be addressed, such as understanding how a backup system works, who is in charge of it, what the responsible recovery point objective (RPO) is, and the amount of data you need to back up. This can dramatically reduce the time it takes to get back in business following a disaster to help you meet your specified recovery time objective (RTO).
- Routine testing of DR strategies – Testing is a must, but it can interfere with your business operations and potentially even cut into productivity. Whenever systems are tested, IT teams will be bound to find something wrong with the DR strategy and would have to adapt it over time as you address these issues. If these issues are appropriately addressed during the testing phase, organizations will have a better chance when they need to truly utilize a DR strategy.
- Remember that IT infrastructure is governed by people – So a DR strategy must take human behavior into account. For example, if a company’s location is compromised by a disaster, organizations need to check if they can get employees to access the data they need to effectively do their jobs.
Continue to check this space for updates on this emerging story.