Azure Outage Proves the Hard Way that Availability Zones are a Good Idea

Microsoft Azure started patching a glaring hole in its availability strategy last September, when it started previewing in-region Availability Zones. Its rivals Amazon Web Services and Google Cloud had already been using the multi-zone strategy for some time.

Earlier this month – almost exactly one year later – came a confirmation that multi-zone cloud regions were a good idea that couldn’t be stronger. Lightning during a powerful storm caused a voltage swell in the utility feeds powering one of the Azure data centers in San Antonio, Texas, that overwhelmed the facility’s surge suppressors, knocking out its cooling systems. The “load-dependent thermal buffer” baked into the cooling system for such occasions was eventually depleted. Air temperature inside rose, triggering automatic shutdown of hardware.

“This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the data center that some hardware was damaged before it could shut down,” an incident postmortem on the Azure Status History page read. “A significant number of storage servers were damaged, as well as a small number of network devices and power units.”

‘It Was a Painful Experience’

The outage affected close to 40 Azure services hosted in the South Central US cloud availability region (which consists of multiple data centers), a few Azure services in other regions, as well as Office 365 services, such as Exchange, SharePoint, and Teams. The affected services either relied directly on the failed storage systems or had dependencies with services that relied on them.

Director of engineering for one of the customers, Microsoft’s own Visual Studio Team Services, or VSTS – the application lifecycle management service suite that was this week rebranded as Azure DevOps – wrote a postmortem of his own, apologizing for letting VSTS customers down, describing the incident as “unprecedented for us.” It was the longest outage in VSTS history, starting at 2:45 am Pacific on September 4 and ending at 5:05 pm on September 5.

“I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down,” the director, Buck Hodges, wrote. “It was a painful experience, and for that I apologize.”

The Problem With Long Distances

Before it introduced its own Availability Zones – which are currently supported in three Azure regions and are in preview mode in two more (South Central US not being one of them) – Azure’s answer to its competitors’ multi-zone strategies has been automatic SQL database backup and storage replication. Users’ SQL databases are automatically backed up in a region different from the original one, and so is the data associated with their Azure Storage accounts.

VSTS relies on both of those features but, as Hodges pointed out in his post, they don’t really provide seamless failover. If you don’t want to wait until full recovery to access a backup copy of your data, Azure Storage only allows access to it in read-only mode. That would cause degradation of some critical VSTS services “to the point of not being usable,” he wrote. Failing over to backed-up databases “would have resulted in data loss due to the latency of the backups.”

Replicating data synchronously across cloud availability regions, such as across South Central US and North Central US (located in Illinois), also isn’t an option for VSTS. “Even at the speed of light, it takes time for the data to reach the other data center and for the original data center to receive the response,” Hodges wrote. The roundtrip for each write would add 70 milliseconds of latency. “For some of our key services, that’s too long.”

Microsoft has launched a multi-zone feature just for storage replication, which is now available in more regions than the broad Availability Zones feature is (eight, according to this page) but not yet in South Central US.

Going forward, the VSTS team (or the Azure DevOps team as it’s now called) is building its resiliency strategy around Availability Zones. Wherever possible, they are planning to move services into regions that already have Availability Zones, while also exploring asynchronous replication across regions.

Because they are much closer to each other than different regions are, low-latency, high-bandwidth network links between multiple Availability Zones in a single region can enable synchronous replication, providing the kind of application resiliency that could keep services running even if a lightning strike had taken out an entire data center, Hodges wrote. “Availability Zones would enable VSTS services in a region to continue to be available so long as the entire region does not become unavailable.”

A Big Catchup Job

Only one AWS region (Osaka) has one availability zone. Most other Amazon cloud regions have three, with a handful of two-zone regions and a six-zone one (Northern Virginia). All Google Cloud regions have three availability zones except Iowa, which has four.

It’s undoubtedly not lost on Azure executives that the number-two cloud provider in the market has a lot of catching up to do in this area. While the company managed to outpace its biggest rivals in adding availability regions around the world, it now has to either build more data centers in all those regions or rearchitect the existing ones to give all its customers the multi-zone option.

Comments

Plain text