Designing for Dependability in the Cloud

David Bills is Microsoft’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs.


This article builds on the previously published articles in this series, “Designing for Dependability in the Cloud” and Microsoft’s Journey: Solving Cloud Reliability With Software. In part three, I discuss the cultural shift and evolving engineering principles Microsoft is using to help improve the dependability of the services we offer and help customers realize the full potential of the cloud.

From the customer’s perspective, cloud services should just work. But, as we’ve discussed throughout this series, service interruption is inevitable — it’s not a matter of if, it’s strictly a matter of when. No matter how expertly online services are designed and built, unexpected events can — and will — occur. The differentiator is in how service providers anticipate, contain, and recover from these kinds of situations. We need to protect the customer experience from these inevitabilities.

Guiding Design Principles

There are three guiding design principles for cloud services: 1) data integrity, 2) fault tolerance, and 3) rapid recovery. These are three attributes that customers expect, at a minimum, from their service. Data integrity means preserving the fidelity of the information that customers have entrusted to a service. Fault tolerance is the ability of a service to detect failures and automatically take corrective measures so the service is not interrupted. Rapid recovery is the ability to restore service quickly and completely when a previously unanticipated failure occurs.

As service providers, we have to try to identify as many potential failure conditions as possible in advance, and then account for them during the service design phase. This careful planning helps us decide exactly how the service is supposed to react to unexpected challenges. The service has to be able to recover from these failure conditions with minimal interruption. Though we can’t predict every failure point or every failure mode, with foresight, business continuity planning, and a lot of practice, we can put a process in place to prepare for the unexpected.

Cloud computing can be characterized as a complex ecosystem consisting of shared infrastructure and loosely-coupled dependencies, many of which will be outside the provider’s direct control. Traditionally, many enterprises maintained on-premise computing environments, giving them direct control over their applications, infrastructure, and associated services. However, as the use of cloud computing continues to grow, many enterprises are choosing to relinquish some of that control to reduce costs, take advantage of resource elasticity (for example, compute, storage, networking), facilitate business agility, and more effective use of their IT resources.

Understanding the Team’s Roles

From the service engineering teams’ perspective, designing and building services (as opposed to box products, or on-premises solutions) means expanding the scope of their responsibility. When designing on-premises solutions, the engineering team designs and builds the service, tests it, packages it up, and then releases it along with recommendations describing the computing environment in which the software should operate. In contrast, services teams design and build the service, and then test, deploy, and monitor it to ensure the service keeps running and, if there’s an incident, ensure it is resolved quickly. And the services teams frequently do this with far less control over the computing environment the service is running in!

Using Failure Mode and Effects Analysis

Many services teams employ fault modeling (FMA) and root cause analysis (RCA) to help them improve the reliability of their services and to help prevent faults from recurring. It’s my opinion that these are necessary but insufficient. Instead, the design team should adopt failure mode and effects analysis (FMEA) to help ensure a more effective outcome.

FMA refers to a repeatable design process that is intended to identify and mitigate faults in the service design. RCA consists of identifying the factors that resulted in the nature, magnitude, location, and timing of harmful outcomes. The primary benefits of FMEA, a holistic, end-to-end methodology, include the comprehensive mapping of failure points and failure modes, which results in a prioritized list of engineering investments to mitigate known failures.

FMEA uses systematic techniques developed by reliability engineers to study problems that might arise from the malfunctions of (complex) systems. The possible problems are studied to understand the effects of the malfunctions by assessing severity, frequency of occurrence, and the ability to detect them, to prioritize the engineering investment required to cope with those malfunctions based on the risks they represent.

The FMEA process has five key steps.


Figure 1.0 FMEA key steps

Pages: 1 2

Add Your Comments

  • (will not be published)

One Comment

  1. Satish Mehta

    Hi David: Great article! Good to see this contribution from you! I would add following two as well to the guiding design principles:- - Capacity Provisioning - Failover Transparency Users should be totally immune to capacity provisioning on the cloud. If the users have to be heckled and hassled for capacity or the lack of it, the cloud fails in one of the most important attribute. This also means that the cloud provider has to say about 30% ahead on capacity. The lead time to provision hardware determine this %age. Users should also be should be kept somewhat removed from travails of BCP events between different clouds. Users should not have to know the difference between cloud A and cloud A' (BCP of A). Reasons for doing BCP events could be anything ranging from planned or unplanned outages on the cloud. Please continue to educate us with thought provoking articles. Best, Satish