Designing For Dependability In The Cloud

David Bills is Microsoft’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs.

DAVID BILLS
Microsoft

This article kicks off a three-part series on designing for dependability. Today I will provide context for the series, and outline the challenges facing all cloud service providers as they strive to provide highly available services. In the second article of the series, David Gauthier, director of data center architecture at Microsoft, will discuss the journey that Microsoft is on in our own data centers, and how software resiliency has become more and more critical in the move to cloud-scale data centers. Finally, in the last piece, I will discuss cultural shift and evolving engineering principles that Microsoft is pursuing to help improve the dependability of the services we offer.

Matching the Reliability to the Demand

As the adoption of cloud computing continues to grow, expectations for utility-grade service availability remain high. Consumers demand access 24 hours a day, seven days a week to their digital lives, and outages can have a significant negative impact on a company’s financial health or brand equity. But the complex nature of cloud computing means that cloud service providers, regardless of whether they sell offerings for infrastructure as a service (IaaS), platform as a service (PaaS), or software as a service (SaaS), need to be mindful that things will go wrong — because it’s not a case of "if things will go wrong," it’s strictly a matter of "when." This means, as cloud service providers, we need to design our services to maximize the reliability of the service and minimize the impact to customers when things do go wrong. Providers need to move beyond the traditional premise of relying on complex physical infrastructure to build redundancy into their cloud services to instead utilize a combination of less complex physical infrastructure and more intelligent software that builds resiliency into their cloud services and delivers high availability to customers.

The reliability-related challenges that we face today are not dramatically different from those that we’ve faced in years past, such as unexpected hardware failures, power outages, software bugs, failed deployments, people making mistakes, and so on. Indeed, outages continue to occur across the board, reflecting not only on the company involved, but also on the industry as a whole.

In effect, the industry is dealing with fragile, (sometimes referred to as brittle), software. Software continues to be designed, built, and operated based on what we believe is a fundamentally-flawed assumption: failure can be avoided by rigorously applying well-known architectural principles as the system is being designed, testing the system extensively while it is being built, and by relying on layers of redundant infrastructure and replicated copies of the data for the system. Mounting evidence paints a picture that further invalidates this flawed assumption; articles continue to regularly appear describing failures of online services that are heavily relied on, and service providers routinely supply explanations of what went wrong, why it went wrong, and summarize steps taken to avoid repeat occurrences. The media continues to report failures, despite the tremendous investment that cloud service providers continue to make as they apply the same practices that I’ve noted above.

Resiliency and Reliability

If we assume that all cloud service providers are striving to deliver a reliable experience for their customers, then we need to step back and look at what really comprises a reliable cloud service. It’s essentially a service that functions as the designer intended it to, functions when it’s expected to, and works from wherever the customer is connecting. That’s not to say every component making up the service needs to operate flawlessly 100 percent of the time though. This last point is what brings us to needing to understand the difference between reliability and resiliency.

Reliability is the outcome that cloud service providers strive for. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain fully functional from the customers’ perspective. A service could be characterized as reliable, simply because no part of the service, (for example, the infrastructure or the software that supports the service), has ever failed, and yet the service couldn’t be regarded as resilient, because it completely ignores the notion of a “Black Swan” event - something rare and unpredictable that significantly affects the functionality or availability of one or more of the company’s online services. A resilient service assumes that failures will happen and for that reason it has been designed and built in such a way to detect failures when they occur, isolate them, and then recover from them in a way that minimizes impact on customers. To put the meaning of the relationship between these terms differently, a resilient service will — over time — become viewed as reliable because of how it copes with known failure points and failure modes.

Changing Our Approach

As an industry, we have traditionally relied heavily on hardware redundancy and data replication to improve the resiliency of cloud-based services. While cloud service providers have experienced successes applying these design principles, and hardware manufacturers have contributed significant advancements in these areas as well, we cannot become overly reliant on these solutions as paving the path to a reliable cloud-based service.

It takes more than just hardware-level redundancy and multiple copies of data sets to deliver reliable cloud-based services — we need to factor resiliency in at all levels and across all components of the service.

That’s why we’re changing the way we build and deploy services that are intended to operate at cloud-scale at Microsoft. We’re moving toward less complex physical infrastructure and more intelligent software to build resiliency into cloud-based services and deliver highly-available experiences to our customers. We are focused on creating an operating environment that is more resilient and enables individuals and organizations to better protect information.

In the next article of this series, David Gauthier, director of data center architecture at Microsoft, discusses the journey that Microsoft is making with our own data centers. This shift underscores how important software-based resiliency has become in the move to cloud-scale data centers.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text