David Bills is Microsoft’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs.
This article builds on the previously published articles in this series, "Designing for Dependability in the Cloud" and Microsoft’s Journey: Solving Cloud Reliability With Software. In part three, I discuss the cultural shift and evolving engineering principles Microsoft is using to help improve the dependability of the services we offer and help customers realize the full potential of the cloud.
From the customer’s perspective, cloud services should just work. But, as we’ve discussed throughout this series, service interruption is inevitable — it’s not a matter of if, it’s strictly a matter of when. No matter how expertly online services are designed and built, unexpected events can — and will — occur. The differentiator is in how service providers anticipate, contain, and recover from these kinds of situations. We need to protect the customer experience from these inevitabilities.
Guiding Design Principles
There are three guiding design principles for cloud services: 1) data integrity, 2) fault tolerance, and 3) rapid recovery. These are three attributes that customers expect, at a minimum, from their service. Data integrity means preserving the fidelity of the information that customers have entrusted to a service. Fault tolerance is the ability of a service to detect failures and automatically take corrective measures so the service is not interrupted. Rapid recovery is the ability to restore service quickly and completely when a previously unanticipated failure occurs.
As service providers, we have to try to identify as many potential failure conditions as possible in advance, and then account for them during the service design phase. This careful planning helps us decide exactly how the service is supposed to react to unexpected challenges. The service has to be able to recover from these failure conditions with minimal interruption. Though we can’t predict every failure point or every failure mode, with foresight, business continuity planning, and a lot of practice, we can put a process in place to prepare for the unexpected.
Cloud computing can be characterized as a complex ecosystem consisting of shared infrastructure and loosely-coupled dependencies, many of which will be outside the provider’s direct control. Traditionally, many enterprises maintained on-premise computing environments, giving them direct control over their applications, infrastructure, and associated services. However, as the use of cloud computing continues to grow, many enterprises are choosing to relinquish some of that control to reduce costs, take advantage of resource elasticity (for example, compute, storage, networking), facilitate business agility, and more effective use of their IT resources.
Understanding the Team's Roles
From the service engineering teams’ perspective, designing and building services (as opposed to box products, or on-premises solutions) means expanding the scope of their responsibility. When designing on-premises solutions, the engineering team designs and builds the service, tests it, packages it up, and then releases it along with recommendations describing the computing environment in which the software should operate. In contrast, services teams design and build the service, and then test, deploy, and monitor it to ensure the service keeps running and, if there’s an incident, ensure it is resolved quickly. And the services teams frequently do this with far less control over the computing environment the service is running in!
Using Failure Mode and Effects Analysis
Many services teams employ fault modeling (FMA) and root cause analysis (RCA) to help them improve the reliability of their services and to help prevent faults from recurring. It’s my opinion that these are necessary but insufficient. Instead, the design team should adopt failure mode and effects analysis (FMEA) to help ensure a more effective outcome.
FMA refers to a repeatable design process that is intended to identify and mitigate faults in the service design. RCA consists of identifying the factors that resulted in the nature, magnitude, location, and timing of harmful outcomes. The primary benefits of FMEA, a holistic, end-to-end methodology, include the comprehensive mapping of failure points and failure modes, which results in a prioritized list of engineering investments to mitigate known failures.
FMEA uses systematic techniques developed by reliability engineers to study problems that might arise from the malfunctions of (complex) systems. The possible problems are studied to understand the effects of the malfunctions by assessing severity, frequency of occurrence, and the ability to detect them, to prioritize the engineering investment required to cope with those malfunctions based on the risks they represent.
The FMEA process has five key steps.
Figure 1.0 FMEA key steps
Preparation: In this step, it is important to understand the complete ecosystem — generate a complete logical diagram for the ecosystem, including its components, data sources, and data flows. Using templates to accomplish this improves the overall outcome of the analysis by providing important visual cues of possible failure points that the design team can use to drill down into them.
Interaction discovery: Everything is in scope in this step. Start with the logical diagram previously noted to identify all of the components that are vulnerable to failure. Understand the interactions (the connectors) between all components, and how each component in the complete ecosystem works.
Failure brainstorming: In this step, identify all potential failure modes for each component, including the infrastructure elements and dependencies between all of the elements captured during discovery.
Effect, likelihood analysis: Identify all potential effects in this step for each failure mode, whether benign or catastrophic, and identify the downstream impact (follow cascading impacts beyond your own system).
Prioritization of investment: Typical FMEA templates contain a calculation based on the severity of a given failure, how frequently it happens, and the ability to detect the failure. The resulting value that is determined in this step, which is often referred to as a “risk prioritization number,” enables the design team to rank the engineering investments needed to address each of the failures captured in the FMEA worksheets.
The primary benefit of adopting failure mode and effects analysis versus a more targeted approach comprised of only fault modeling and root cause analysis, is that the design team emerges from it with a more comprehensive analysis based on the deep exploration of every aspect of the service required to complete the exercise. The results of the failure mode and effects analysis process provide the team with a deeper understanding of where the failure points are, what the impact of the failure modes is likely to be, and most importantly, the order in which to tackle these potential risks to produce the most reliable outcome in the shortest amount of time.
Disaster preparedness and business continuity are also important considerations, and FMEA can be applied to both routine, or typical, failures, as well as less predictable, or unforeseen, events.
Moving Beyond the Traditional Premise
It’s important for cloud providers to design their services to withstand unplanned interruptions, because things will go wrong — it isn’t a matter of if, it’s strictly a matter of when. It’s no longer sufficient to rely heavily on hardware redundancy and data replication to improve the resiliency of cloud-based services. Instead, we need to move beyond the traditional premise of relying on complex physical infrastructure to build redundancy into cloud services, and utilize a combination of less complex physical infrastructure and more intelligent software to build resiliency into services and deliver high availability for customers.
This article is the third in this series from Microsoft. See "Designing for Dependability in the Cloud" and Microsoft’s Journey: Solving Cloud Reliability With Software for previous articles.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.