On the afternoon of June 1, a Friday, many people across the UK couldn’t use their Visa credit cards in shops, pubs, restaurants, and wherever else electronic payments are accepted.
Visa Europe, the European unit of the global payment processing giant, saw its data center infrastructure fail in an incident that took the company’s operations team about 10 hours to resolve. Foster City, California-based Visa, Inc. acquired London-based Visa Europe in 2016.
The incident, caused by a network-switch failure, resulted in millions of credit-card transactions failing to go through, prompting UK regulators to send a letter to the company a few days later requesting a detailed explanation of what happened and what the company’s plans were in terms of compensating those affected and preventing a similar incident from happening in the future.
In a response addressed to Nicky Morgan, member of the Parliament who chairs the House of Commons’ Treasury Committee, Visa’s European CEO Charlotte Hogg said the company held itself accountable for the incident and would compensate cardholders. She also provided some technical details about the cause of the incident, attributing it to a failure of a hardware switch component in one of the company’s data centers.
There was never a full system outage, Visa claims. According to Hogg’s letter, most of the transactions attempted over the course of the incident (91 percent) went through normally, the rest (5.2 million swipes) failing. About half of the remaining 9 percent went through on second attempt.
During the two periods of “peak disruption,” a 10-minute period followed by a 50-minute period in the afternoon, 35 percent of transactions failed to process on average. Outside the peak periods, the failure rate was about 7 percent.
The impact wasn’t limited to Visa cards. The company offers some third-party companies that provide services around electronic payments to merchants a way to send transactions from other networks through its systems, but “fewer than 1,000 transactions on this gateway service were impacted,” the company said.
Here’s what happened, according to Hogg:
Failure to Switch
Visa has two data centers in the UK in an “active-active” configuration. That means one is a mirror image of the other, and if one of them fails, the other is designed to handle the full transaction load on its own automatically.
They’re kept in sync by constantly communicating system status. This synchronization system was involved in the outage.
Two core switches (primary and backup) direct transactions for processing in each of the facilities. The backup switch is designed to take over if the primary one fails, and the switch failover scheme is also implicated in the incident.
When a component within a core switch in one of the data centers failed, the nature of the failure prevented the backup switch from activating. It was a “very rare partial failure,” the letter said but didn’t name the component or the manufacturer of the switch.
Because the failure was partial, the system continued trying to synchronize with the secondary data center, creating a backlog of messages there, and hampering the secondary site’s ability to process transactions.
As of the date on the letter, the exact cause of the component failure had not been identified. Visa’s data center operations team was still investigating together with the company that made the switch.
Hours of Manual Labor
It took hours for the data center team just to deactivate the system that was causing transaction failures.
Doing that involved switching off applications at the primary site and clearing message backlogs at the secondary site, both manually and automatically. Shortly after 7 p.m., the secondary data center was processing nearly all transactions normally.
Things were “largely resolved” another hour later, and back to normal service levels in both data centers after midnight on Saturday.
A Short-Term Fix and a Long-Term One
The malfunctioning part has been replaced in the offending switch. While the forensic analysis of the incident has not been completed, Visa Europe is planning to use software recommended by the switch’s vendor that it said will monitor component health on all four switches and take a switch offline immediately if a fault is detected, preventing a repeat of the partial-failure scenario.
But that’s a short-term fix. The project that should provide a long-term fix is something Visa has been working on since February, and that is migrating the European system onto VisaNet, the company’s global processing system. The migration is part of the post-acquisition process of integrating Visa Europe with Visa Inc.
VisaNet runs in more data centers, serves multiple geographies, has “significantly more capacity and scale” than the European system, and supports four active-active images working in tandem, or double the redundancy of the current European setup.
VisaNet also can “isolate and remove a failing component with one command, taking mere minutes to remove the malfunctioning component from the processing environment.”
Over the last several years, the global network processed 99.9995 percent of the 319 billion transactions that have gone through it properly, according to the CEO’s letter.