Data Center Backup Failure Blamed on 1 Person, not NYSE Leadership

Monday night an NYSE staffer neglected to stop a backup of that day’s trades. The next day, mayhem erupted. Here’s how to avoid a similar fate.

Data Center Knowledge

January 26, 2023

5 Min Read
New York Stock Exchange stock trading board listings moving across US dollar bills.
Sandra Baker / Alamy

Monday night a New York Stock Exchange (NYSE) staffer didn’t manually shut off the organization’s backup system once it completed a backup of the day’s trades, according to Bloomberg News This set off a chain of events that brought the NYSE to its knees for nearly two days.

This is the second high-profile system crash due to human error this month, indicating the size of the organization may not guarantee best practices in data center and disaster recovery management are being followed.

Two weeks ago, the Federal Aviation Administration (FAA) experienced an outage that grounded domestic flights causing thousands of delays and flight cancellations. That error was caused by a contractor deleting a corrupt file in the FAA’s primary and backup systems.

This week’s system failure began on Monday night when the NYSE staffer didn’t shut off the trading system’s connection to the Cermak data center in Chicago, where NYSE backs up daily trading data. The systems believed that Tuesday’s trades were a continuation of Monday’s trades. This caused wild price fluctuations on the stock exchange. NYSE has yet to disclose the financial damages stemming from this systems oversight.

The fallout for NYSE: Investment firms are filing claims with the stock trading organization to recuperate losses incurred from the unpredictable price fluctuations caused by the systems issue. Some estimate the cost of this disaster recovery error could reach into the hundreds of millions.

Related:News Update: FAA Outage Caused by Contractor; Canada Experienced NOTAM Outage the Same Day

Data Center Disaster Recovery Issues Snuffed Out with C-Suite Support

Both the FAA and NYSE identified an individual as the source of the systems failures. Yet analysts believe both issues were 100% avoidable.

“Automation removes human error,” Omdia analyst Dennis Hahn tells Data Center Knowledge. “If this [disaster recovery system] needed to be manually shutdown, this is ridiculous and asking for trouble.”

Hahn went on to say enterprises would do well to:

  1. Add more intelligence to automation through AI technologies in software.
    “Today's backup vendors are increasingly adding AI to their systems to detect misconfigurations,” says Hahn.

  2. Remove the specter of human error through C-suite buy-in for systems automation.
    While both the FAA and NYSE lay their recent disasters at the feet of individuals, Hahn says strategic leadership is ultimately at fault.
    “This [the NYSE systems error] would most likely be a misconfiguration in the DR systems’ scheduling. DCIM could have helped, but it typically seems higher level than this [staffer-level] problem.”

  3. Leverage AI automation to secure backups while preventing cyberattacks.
    “Know that these same AI technologies are also trending towards being invaluable in thwarting ransomware attacks and protecting recovery data in today's backup and DR systems,” says Hahn.

While portions of AI have been leveraged in DCIM software for several years now, new systems are emerging for data center infrastructure management that are designed around AI technology, especially in automation and predictive modeling.

In a recent article, Marc Garner, vice president of the Secure Power Division at Schneider Electric, said in addition to cloud capabilities, next-generation DCIM system must “connect to a data lake to use AI and deliver in-depth insights.”

Could AI-enhanced DCIM have prevented the NYSE system failure? With an automated backup component of a disaster recovery system integrated into DCIM, yes.

How AI-powered Backup Automation Removes System Failure Risks

Our sister publication AI Business touched on the topic of AI systems learning backup patterns to predict and prevent catastrophic system errors or full-on failures. Here’s an excerpt from a recent article on the topic:

For situations in which business continuity requires backups, policy based-backups incorporate what JG Heithcock, software engineering manager at Google, characterized as “a strong focus on algorithms” to facilitate a number of advantages with AI.

Instead of rigidly scheduling jobs in a predefined sequence at a set time, this type of back-end intelligence enables organizations to specify the requirements for their backups—which machines to backup, how frequently, and where—then lets the system handle them. Specifically, a combination of machine learning and static algorithms is responsible for:

Non-sequential backups: Intelligent backup solutions deliver their functionality in different sequences based on the availability of devices, which is bound to fluctuate for laptops, tablets and smartphones. If a certain employee’s laptop isn’t available for a daily backup at eight in the morning, the system will backup another while “still looking to see if [the first machine] has come online,” Heithcock said.

Preferential backup hierarchies: Part of the decision-making capabilities of the AI powering these systems is applied to determining which backups exceed others in terms of importance. For instance, if a policy calls for daily backups and one employee has been offline three days, while another has only been offline a day and a half, the system will backup the former as a priority. “That’s the proactive part, or the AI part,” Heithcock noted. “It’s going to jigger its priority list to try and get to people who are most out of policy first."

Temporal prioritizations: Another crucial aspect of machine intelligence used by smart backups is the capability to prioritize jobs based on the length of time they take. If two jobs should be done at the same time (meaning there’s little difference in the time since the last backup) but "Able is going to be somebody you can backup in 10 minutes and Fred takes you two hours, then it backs up the [former] first to get done with him and have more time to do the other," Heithcock said.

Bottom line: As organizations move to modernize, sole dependence on human intervention leads to hundreds of millions of dollars in damages, as in the case of NYSE. Humans and systems can work in concert but it’s clear without automation and predictive analysis from AI, companies may be creating their own disasters due to a dedication to “wetware” or human intelligence alone.


Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like