Designing for Dependability in the Cloud

1 comment

Preparation: In this step, it is important to understand the complete ecosystem — generate a complete logical diagram for the ecosystem, including its components, data sources, and data flows. Using templates to accomplish this improves the overall outcome of the analysis by providing important visual cues of possible failure points that the design team can use to drill down into them.

Interaction discovery: Everything is in scope in this step. Start with the logical diagram previously noted to identify all of the components that are vulnerable to failure. Understand the interactions (the connectors) between all components, and how each component in the complete ecosystem works.

Failure brainstorming: In this step, identify all potential failure modes for each component, including the infrastructure elements and dependencies between all of the elements captured during discovery.

Effect, likelihood analysis: Identify all potential effects in this step for each failure mode, whether benign or catastrophic, and identify the downstream impact (follow cascading impacts beyond your own system).

Prioritization of investment: Typical FMEA templates contain a calculation based on the severity of a given failure, how frequently it happens, and the ability to detect the failure. The resulting value that is determined in this step, which is often referred to as a “risk prioritization number,” enables the design team to rank the engineering investments needed to address each of the failures captured in the FMEA worksheets.

The primary benefit of adopting failure mode and effects analysis versus a more targeted approach comprised of only fault modeling and root cause analysis, is that the design team emerges from it with a more comprehensive analysis based on the deep exploration of every aspect of the service required to complete the exercise. The results of the failure mode and effects analysis process provide the team with a deeper understanding of where the failure points are, what the impact of the failure modes is likely to be, and most importantly, the order in which to tackle these potential risks to produce the most reliable outcome in the shortest amount of time.

Disaster preparedness and business continuity are also important considerations, and FMEA can be applied to both routine, or typical, failures, as well as less predictable, or unforeseen, events.

Moving Beyond the Traditional Premise

It’s important for cloud providers to design their services to withstand unplanned interruptions, because things will go wrong — it isn’t a matter of if, it’s strictly a matter of when. It’s no longer sufficient to rely heavily on hardware redundancy and data replication to improve the resiliency of cloud-based services. Instead, we need to move beyond the traditional premise of relying on complex physical infrastructure to build redundancy into cloud services, and utilize a combination of less complex physical infrastructure and more intelligent software to build resiliency into services and deliver high availability for customers.

This article is the third in this series from Microsoft. See “Designing for Dependability in the Cloud” and Microsoft’s Journey: Solving Cloud Reliability With Software for previous articles.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Pages: 1 2

Add Your Comments

  • (will not be published)

One Comment

  1. Satish Mehta

    Hi David: Great article! Good to see this contribution from you! I would add following two as well to the guiding design principles:- - Capacity Provisioning - Failover Transparency Users should be totally immune to capacity provisioning on the cloud. If the users have to be heckled and hassled for capacity or the lack of it, the cloud fails in one of the most important attribute. This also means that the cloud provider has to say about 30% ahead on capacity. The lead time to provision hardware determine this %age. Users should also be should be kept somewhat removed from travails of BCP events between different clouds. Users should not have to know the difference between cloud A and cloud A' (BCP of A). Reasons for doing BCP events could be anything ranging from planned or unplanned outages on the cloud. Please continue to educate us with thought provoking articles. Best, Satish