Yesterday’s outage that grounded more than 7,000 flights landing in the U.S. was triggered by a corrupt file in the U.S. Federal Aviation Administration’s (FAA) primary and backup systems, according to CNN.
While not a data center outage, the FAA’s system outage holds many lessons for the data center industry, including the need to confront “paleoware” (also known as outdated software) and unmaintained systems before they become a major issue. Tracking the outages cloud service providers have experienced in the last six months, many were caused by faulty equipment such as Alibaba’s recent cooling-related outage, and Kakao’s lithium ion battery/fire-caused outage. It’s also important to note that for some outages, CSPs provide no causation at all.
Current statements from leaders surrounding the FAA outage place blame on a single point of failure in an otherwise workable system. Yet there is a nod to the role complexity has in spurring the single outage trigger.
“This is an incredibly complex system,” says U.S. Department of Transportation Secretary Pete Buttigieg to NBC News’ Andrea Mitchell. “So glitches or complications happen all the time.”
How complex systems fail
Let’s dive deeper into the systems complexity issue to unearth key insights for data center management and uptime.
A classic tome written by Dr. Richard I. Cook, MD, Cognitive Technologies Laboratory, University of Chicago titled “How Complex Systems Fail” highlights 18 key considerations for complex systems management. Here’s a selection of the most salient points applicable to data centers:
Catastrophe requires multiple failures – single point failures are not enough.
For simplicity’s sake, many organizations provide a simple and spare post-mortem of data center outages. Yet it’s operational mitigation of complex systems failures that prevent catastrophes. “Most initial failure trajectories are blocked by designed system safety components,” wrote Dr. Cook. “Trajectories that reach the operational level are mostly blocked, usually by practitioners.”
Complex systems run despite inherent flaws.
No matter how clear a process is or how ‘automatic’ an automation is, human intervention and institutional knowledge are critical to consistent uptime and smooth operations. Case in point: CNN interviewed a source familiar with the FAA outage. The source said, “the NOTAM system is an example of aging infrastructure due for an overhaul.”
“Because of budgetary concerns and flexibility of budget, this tech refresh has been pushed off,” the source said to CNN. “I assume now they're going to actually find money to do it.”
Human practitioners are the adaptable element of complex systems.
The complex system adaptations include adjusting to vulnerable parts / components; concentrating critical resources in areas of highest demand; creating to quickly retreat or recover from unexpected system faults; developing early-detection systems to alert operators to the need for more system resiliency.
One application of these concepts in the data center world would be DCIM. Even with the powerful visibility DCIM systems gives users, rack-level engineering expertise and institutional knowledge may be the key elements standing between your organization and a severe downtime incident.
Updated on Jan. 13, 2023 to include the "FAA Outage – A Timeline" graphic.