There's been a bit of a kerfuffle in Australia over a series of storage outages of a 3PAR 20850 SAN solution Hewlett Packard Enterprise built for the Australian Tax Office, an organization that services over 12 million taxpayers. Although the exact cause of the problem is still waiting for an official report from HPE -- expected in "late 2017" -- a report from the ATO indicates that a poorly designed system, compounded by a lack of preparedness by HPE, was largely responsible for the event. The report reads like a cautionary tale for service providers on how not to instill confidence.
Although designed specifically to meet the ATO's needs, the SAN was owned and operated by HPE, with ATO staff having no direct access to the system. A little before 1:00 a.m. on December 12, a little over a year after it went online, the system went down, with a cascading effect of data volumes entering preserved states to protect data, therefore becoming unavailable to applications and services.
According to the report, this eventually "resulted in a systems outage, causing the majority of the ATO’s online services to become unavailable." By 3:35 a.m. that morning, 455 out of 3,063 data volumes were out of service, as evidently faulty "firmware supporting impacted disk drives in the SAN prevented those drives from re‑booting."
By this time, enough conditions prespecified by ATO in its contract with HPE had been met to push categorizing the event as a Priority 1 incident. HPE didn't do so until around 7:00 a.m. The SAN wouldn't return to full functionality for about eight days.
Luckily for the ATO and Australian taxpayers, no data was lost, but "the impact of pre‑incident design and build decisions were material in extending the time to recover data and bring production and supporting systems online."
It turned out that the SAN had been indicating issues that needed addressing for a while.
"Analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage. While HPE had taken some actions in response to these indicators – including the replacement of specific cables – alerts continued to be reported, indicating these actions did not resolve the potential SAN stability risk."
In all, from May through November of 2016, 77 events related to components that later failed were logged by ATO's incident resolution tool. And although HPE went to work swapping out some cables and such in an attempt to remedy the situation, the tax agency was "not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN."
It turned out that some stressed or ill fitting fiber optic cables were the most likely culprit to have initiated the failure (a similar outage happened again in February when a data card was dislodged while replacing a cable), but design issues with the SAN had made it much worse. According to the report, the system had been designed for performance over stability, with some monitoring and resilience features disabled.
"This particular SAN configuration leverages a feature known as wide‑striping which is designed to significantly improve performance by reading and writing blocks of data to and from multiple drives at the same time, preventing single‑drive performance bottlenecks. When several physical disk drives were impacted by a drive firmware issue which prevented those drives from re‑booting, the result was that a small number of drives temporarily and in some cases permanently prevented access to a significant amount of application data. This also had the effect of extending the duration and complexity of the recovery effort."
Another issue that delayed recovery was that recovery tools for restoring the system were kept on the failed SAN.
Although the ATO report lays much of the problem at the feet of HPE, it admits to sharing some of the blame. For instance, there was another backup SAN running in a separate data center that due to a design issue couldn't be used for recovery, partly because ATO had "relied heavily on HPE recommendations," which in hindsight was a mistake.
"Full automated fail‑over for the entire suite of applications and services in the event of a complete SAN failure in Sydney was not part of the storage solution for the SAN. The cost of automatic fail‑over systems, as they exist in some areas of critical infrastructure or in large financial institutions, is very high."
In response to the failure, the ATO has commissioned a new 3PAR solution. When the data from the old system is transferred, it will be taken out of service and undergo forensic analysis by HPE.
The tax agency also indicates that when its current contract with HPE expires, it might consider other options.