The pandemic was just one factor among many believed responsible for changing patterns of data center outages in 2020. While the overall number of data center outages appears to still be growing, its rate of growth is being outpaced by the rate of IT infrastructure expansion — meaning, individual facilities may be experiencing fewer incidents, according to new data released by Uptime Institute.
“You might think there’d be more outages because of COVID,” Andy Lawrence, Uptime’s executive director of research, said during a webinar last week. “Actually, there were fewer serious and severe outages in 2020 than in previous years. Nevertheless… the impact and the cost of outages are definitely growing, even if the amount of outages per kilowatt of IT load is dropping. That’s because of growing dependency on IT.”
Uptime’s findings represent more good news for an industry that had been gearing up for a triple-threat: much greater demand for IT services from remote users, fewer personnel on-site to handle service issues, and the side-effects from fewer personnel at power stations and electrical service facilities. In perhaps the happiest of all possible shocks one can receive from electricity, 2020 turned out to be a good year in terms of service delivery.
“The big takeaway from this,” explained Uptime CTO Chris Brown, “is that [data center] outages are still continuing to happen. We’re not getting worse, but we’re not getting any better. As an industry, we really do need to figure out where those outages are coming from and start figuring out how to address them.”
The data from Uptime’s 2021 annual survey on the topic of data center outages and their causes acts as something of a double-edged sword. Although only 6 percent of respondents said their facilities experienced severe (“Uptime category 5”) outages in 2020, compared with 11 percent for 2019, the very fact of fewer outages renders it more difficult to assess the trends behind them.
Static Transfer Switches Under Scrutiny in Data Center Outages
Coupled with Uptime’s direct analysis of electricity usage patterns for its own clients, however, its experts used the survey data to draw some conclusions — and they’re not what you might expect. While on-site power systems continue to account for the greatest number of power-related data center outage events by volume, it’s the components of those systems, such as UPS batteries and automatic transfer switches (ATS) that are more often the eventual culprits.
“The transfer switch between the engine generator and the utility,” remarked Brown, “is typically a highly intelligent switch that’s looking for power problems in multiple ways.” A shift in the electricity frequency, a sudden reduction or increase in voltage, may be at least as likely a cause of a power event as a simple bottoming out of power. When the ATS works the way it should, it should be able to trigger the startup of on-site backup power generation and transferring the load to the generator.
In recent years standard ATS components have been replaced with customized switchgear, Brown told the audience. The switchgear presents the profile of an ATS for all the other power components downstream. Such custom parts give maintenance technicians a much greater number of controls and along with them an exponentially larger number of transactions between components that must always be optimally maintained. Everything must go off without a hitch at those moments where, by definition, there’s a hitch. What’s more, said Brown, these events must be synchronized.
“If a transfer switch fails,” he said, “it’s going to have the same impact as engine generators failing. Then, you just don’t have power to the facility.”
To reduce construction costs, Brown explained further, architects are advising facilities operators to install so-called distributed redundant systems (DRS). That sounds like a beefy, premium option, especially the way it’s presented in the literature. In practice, it’s the deployment of as few as two independent UPS arrays, each of which is capable of delivering the entire load to facilities, not just part. Because any single array is susceptible to failure (again, not partly but entirely), static ATS components are often put in place instead of rotary to connect a backup array to the primary power stream quickly.
“What we’re seeing over time is, the amount of static transfer switches used in data centers are increasing to help improve availability. One concern I do have is, they still account for 22 percent of the power-related outages on this list,” continued Brown, indicating his survey data. “The fact that they are such a fundamental part... and an increasingly growing part of those electrical distribution systems – and that we’re having quite a few outages – concerns me. Because if we’re having problems with static transfer switches, then the IT equipment’s going to face interruptions of power more often.”
Soft Causes of Hard Data Center Outages
When the cause of power failure is located off-site, survey respondents told Uptime, it’s twice as likely to be related to a software configuration problem – in a carrier’s network, for example – than a capacity or overload event at the utility company.
“As with all types of software, we’re seeing more configuration issues,” remarked Rhonda Ascierto, Uptime’s VP of research. “It’s not that the IT hardware is failing. It’s problems with the way systems are communicating and the way they’re being configured.”
Ascierto cited a routing configuration snafu last June by a fiber optic service provider, the cascade effect of which was an IP traffic storm on the T-Mobile data network, affecting customers nationwide. Some customers were locked out of placing 911 calls.
“Just the nature of networking means that failures can impact really, really large numbers of people,” said Ascierto. “Those impacts can be really severe, because we rely on those communications for emergency services.”
Uptime’s new report, as with prior editions, cited human error as a continuing cause of service outages. Could the lessons of the Year of COVID serve as evidence, we asked the Uptime team, that more and better automation in power systems could reduce the likelihood of outages caused by human error, and thus improve reliability and quality of service?
“I think we could see fewer outages as we move to automation,” Chris Brown told DCK. “Automation systems were used, at least historically, to reposition and re-coordinate electrical and cooling distribution systems and pathways. That took a lot of the decision-making out of the hands of humans... The big challenge with automation, though, is that computer systems can only make the decisions with the proper information coming in and the proper decision-making processes programmed into them. I think that, early on, automation is going to help some, but it may also cause some problems. But as we refine automation and gain back some of those skills we lost over the years... I think automation can, and should, start to reduce the number of outages [that are] just due to human error.”
“I think it’s worth noting, too,” added Ascierto, “that to date, most of the development and investment we’ve seen in newer areas of facility automation have been focused on efficiency rather than risk reduction. I think we’re probably quite a few years away yet from automation having an impact in terms of lowering risk and having fewer people in data centers.”
But counter-intuitively, Ascierto added, it may not necessarily follow that reduction of human error is accomplished with fewer people. She cited the increased use of so-called “remote hands” services by colocation facilities, where more competent and experienced IT and facilities professionals carry out on-site work on behalf of tenants. “You could argue that there were more professionalized, very well-trained folks doing work in colocation data centers,” she told us, “than a year ago, when customers had to meet those requirements [themselves], without being fully prepared.”