Microsoft’s Journey: Solving Cloud Reliability With Software


David Gauthier is director of data center architecture for Microsoft Global Foundation Services. He is responsible for the technical direction and reliability of the integrated infrastructure architectures of Microsoft’s global data center footprint. 

Traditional vs. Cloud-Scale

As people around the world entrust ever-increasing amounts of their digital lives to the cloud, the need for online services to operate continuously— regardless of physical failures—becomes imperative. As my colleague David Bills discussed in the first installment of this series, cloud service providers need to move beyond traditional reliance on complex hardware redundancy schemes and instead focus on developing more intelligent software that can monitor, anticipate, and efficiently manage the failure of physical infrastructures. When service availability is engineered in more resilient software, there is greater opportunity to materially rethink how the physical data center is engineered.

Until 2008, like much of the industry, Microsoft followed a traditional enterprise IT approach to data center design and operation—delivering highly available hardware through multiple levels of redundancy. This allowed software developers to count on that hardware always being available, or as near as such that active redundant copies of data were only thought of as disaster recovery. By relying on hardware availability, we enabled the development of brittle software. While we had our fair share of hardware failures and human errors under this early model, we still successfully delivered highly available services. However, as we grew significantly to what we term cloud-scale, we quickly saw that the level of investment and complexity required to stay this course was quickly going to be untenable. We also recognized that brittle software could cause much more significant outages than hardware.

Compile Code, Compile Availability

The cloud-scale data centers we operate today still require a lot of hardware, but software has become the key driver of service availability. In some cases, it significantly reduces the need for physical redundancy. By solving the availability equation in software, we can look at every aspect of the physical environment—from the central processing unit (CPU) to the building itself — as a systems integration and optimization exercise. Through software development tools and workload placement engines, we can ‘compile’ a data center availability solution much faster than we can install physically redundant hardware. This shifting of mindset rapidly creates compounding improvements in reliability, scalability, efficiency, and sustainability across our cloud portfolio.

In our data centers, we have embraced the fact that no amount of money will abate hardware failures or human errors. As such service availability must be engineered at the software layer. No matter what happens, the application or service should gracefully fail over to another cluster or data center while maintaining the customer’s experience. These failures are anticipated as regular operating conditions for the service and are not a reason to wake up the CIO at 2 am.

This approach has led us to accept measured risks in our environment and to delete significant portions of our redundant infrastructure. We are running elevated supply temperatures and have foregone chillers in all but one of our facilities since 2009, resulting in a tremendous reduction of water usage (averaging 1 percent of what traditional data centers use) and significant energy savings (averaging a 50 percent savings). We have also been operating tens of thousands of servers without backup generators since 2009 with no impact to user experience even though outages have occurred.  By optimizing the size of our application clusters against uncorrelated failure domains in the physical world, we’ve implemented a fail-small topology that allows us to compartmentalize failure and maintenance impacts.

Pages: 1 2

Add Your Comments

  • (will not be published)

One Comment