Alibaba Group’s Hong Kong availability zone experienced an outage on Sunday at 7 a.m. local time, affecting users who depend on the firm’s cloud services Elastic Compute Service (ECS) and PolarDB. The story was first reported by the South China Morning Post, which is owned by Alibaba.
Cloud outages affecting availability zones of the world’s largest cloud service providers (CSPs) spanned the globe this year, most notably with AWS just two weeks ago. Amazon didn’t reveal the cause of that outage and resolved the issue in a matter of hours. Alibaba was transparent throughout the outage and included the reason for the issue: a malfunctioning refrigeration unit.
This led us to wonder just how common it is for a cooling unit to cause a data center outage. We reached out to Christopher Brown, chief technical officer at the Uptime Institute for insights.
“It is not common for cooling units to cause outages," said Brown. "The 2022 Uptime Global Data Center Survey found that cooling was the primary cause of outages or impactful incidents in only 13-14% of cases."
Power, no surprise, accounts for 43% of outages whereas 14% are caused by cooling issues, according to data center professionals Uptime surveyed for their 2022 Global Data Center study. But with publicly reported outages, such as those caused by CSPs, cooling accounted for only 3% of outages, Uptime found.
While cooling-caused outages aren’t common, just like data center fires their impact can be far-reaching. For the 24 hours Alibaba was down, customers’ losses reached into the hundreds of millions, if not into the billions. Here’s a breakdown:
Data center outage math
With the Uptime Institute reporting outages cost most (60%) enterprises $100,000, those 24 hours of Alibaba’s downtime expanded across every firm depending on that availability zone and the damages reach the multi-millions. The cost of an outage for 15% of enterprises is at least $1 million, revealed the Uptime Institute in their recent findings.
In July 2020, Alibaba said 38% of the Fortune 500 use their cloud services. That’s 190 enterprises. For the sake of growth over the last two years, let’s up that to 300 firms.
Three hundred enterprises with three hours of downtime at $100,000 per incident adds up to $30,000,000. We can be confident Alibaba had more than 300 customers accessing their cloud services at 7 a.m. local time. For its part, Alibaba pledged to compensate customers “based on its product or service agreements with the relevant customers,” according to the South China Morning Post.
Watch your cloud … service providers
Enterprises simply aren’t trusting public cloud companies with critical workloads, found the Uptime Institute. Thirty-two percent of survey respondents would only trust some of their workloads on the public cloud, and 14% don’t trust the public cloud with critical loads at all. Those that do trust the public cloud ensure they have sufficient redundancies to weather the storm of an outage similar to the one Alibaba and AWS experienced this month.
"In regard to the implementation of better protection from challenges such as [outages caused by cooling equipment malfunctions], implementing additional available redundancy into the system is the best way to reduce unplanned events such as this one. This would be best served with a Tier IV Certification with a requirement for Fault Tolerance," advised Brown.
Updated on Dec. 20, 2022 at 3:08 p.m. EST to include comments from Christopher Brown, chief technical officer, Uptime Institute.