A weekly column on innovation in data center infrastructure design and management. More about the column and the author here.
Microsoft recently announced that it was adding availability zones to the existing regional setup for its Azure cloud service. Availability zones are essentially multiple data centers within a single cloud availability region.
“Availability Zones are fault-isolated locations within an Azure region, providing redundant power, cooling, and networking,” Tom Keane, head of global infrastructure for Azure, wrote in a blog post. “Availability Zones allow customers to run mission-critical applications with higher availability and fault tolerance to data center failures.”
Rivals such as Amazon Web Services, Google Cloud Platform, and Oracle Cloud already offer similar availability zones, so it could be argued that Microsoft was merely following the crowd with its announcement. But while the approach may not be that innovative in the world of public cloud service providers, it is indicative of wider industry advances in so-called distributed data center resiliency.
There is a view among some data center industry insiders that, at a very basic level, software and networks will take on an even greater role when it comes to ensuring service availability. That could have significant long-term implications for how future data centers are designed and managed, with some types of site requiring less redundant power and cooling equipment.
But, as with most things to do with data centers, the reality is more nuanced and complex. Most large operators – the sensible ones at least -- already build some form of resiliency into the IT and network layer. And even with this IT resiliency in place, there is still often a requirement for redundant power, cooling, and overall physical designs that conform to Uptime Institute’s Tier III specification or equivalent. This physical resiliency obviously comes at a price: up to $18 million per MW of IT load. When the cost of adding resilient IT architectures is added to redundant facilities infrastructure – including active backup facilities for example -- then the overall costs are even greater.
Still, while many operators already employ some form of IT and network-based resiliency, there is scope to take existing strategies to the next level, according to some experts. For example, a number of public cloud providers already employ advanced resiliency technologies that have been defined by organizations like Uptime as “cloud-based resiliency.”
According to Uptime, data center resiliency can be broken into four main types (the organization says these are more of a spectrum that discrete categories):
- Traditional single-site availability - One site with resilient facilities and IT
- Linked-site resilience - Two or more sites that are tightly connected
- Distributed site resilience - Two or more actives sites using shared networks (IBM does this, for example)
- Cloud-based resilience - Based on the use of distributed, virtualized applications, instances, or containers using middleware, orchestration, and distributed databases across multiple data centers
Speaking recently on the topic, Andy Lawrence, executive director of Uptime Institute Research, said the move towards advanced distributed resiliency could have major implications for the industry. “The choice of which system to use is based on business need,” he said. “However, cloud-based resiliency is probably the most effective but requires heavy IT investment, many sites, and a lot of bandwidth, so is only open for a handful of suppliers. Google is already quite advanced, and AWS and IBM are doing elements of this.”
Advanced cloud-level resiliency probably won’t be economical for enterprises for perhaps another five to ten years, unless a specific organization in question already has a highly evolved hybrid data center strategy, according to Uptime. The specific costs and benefits of distributed resiliency are directly related to the type of facility and the business it serves. “It’s a bit like asking how much 5MW of data center capacity costs – it depends on what kind of data center you want to build,” said Lawrence. “You may be able to reduce some of your physical infrastructure, but you will be paying for it in terms of redundant IT, software engineering, and staff.”
Some forms of advanced distributed resiliency do have a compelling return on investment, or ROI, compared with traditional designs, but operators still usually build Tier III sites or equivalent; efficiencies are achieved through not having to spend on disaster recovery and by utilizing all available data center capacity. But in certain circumstances – perhaps in edge facilities for example – there could be more opportunities for some capital cost savings and to build data centers that use less physical redundancy. “There will be situations at the edge, where it makes more sense to use software-based resiliency. Facebook is known for doing this with less physical resiliency, but I see it more at the edge than the core,” said Lawrence.
In the short term at least, it doesn’t seem likely that many enterprise or colocation operators will be able to use advanced distributed resiliency to eliminate generators and UPSs or drop a Tier level. However, as more workloads migrate to the cloud and edge data centers, inevitably, a greater proportion of IT work will be executed in sites that can take advantage of some of these emerging and disruptive technologies.