Microsoft’s Journey: Solving Cloud Reliability With Software
April 3rd, 2013 By: Industry Perspectives
Every GPS Needs a Map
Resilient software solves problems beyond the physical world, but to get there, that software needs to have an intimate understanding of the physical environment it resides on top of. While the role of a data center manager might come with a satellite phone, it rarely comes with a GPS. Very few data center operators have a comprehensive view of how server or workload placements affect service availability. Typical placement activities are more art than science – balancing capacity constraints, utilization targets, virtualization initiatives, and budgets. Relying on hardware takes a variable off the table in this complex dance. But there are a few things you can do to start building the maps and turn-by-turn directions that will enable resilient software in your environment, whether you prefer private, hybrid or public cloud.
- Map physical environment and availability domains: From a hardware standpoint, it’s important to look at the physical placement of hardware against infrastructure. We automate and then integrate that automation to be able to communicate between the data center, the network, the server, and the operations team running them. Understanding the failure and maintenance domains of your data center, server, network, and manageability infrastructure is key to placing virtualized workloads for high availability. Trace the single line diagram to identify common failure points and place software replication pairs in uncorrelated environments. In most data centers, you’re limited to one or a handful of failure domains at best. However, with a cloud services’ application platform like Windows Azure, a developer or IT professional can now choose from many different regions and availability domains to spread their applications across many physical hardware environments.
- Define hardware abstractions: As you are looking at private, public, and hybrid cloud solutions, now is a good time to start thinking about how you present the abstraction layer of your data center infrastructure. How workloads are placed on top of data center, server, and network infrastructure can make a significant difference in service resiliency and availability. Rather than assign physical hardware to a workload, can you challenge your systems integrator or software developer to consume compute, storage, and bandwidth resources tied to an availability domain and network latency envelope? In a hardware-abstracted environment, there is a lot room for the data center to become an active participant in the real-time availability decisions made in software. Resilient software solves for problems beyond the physical world. But to get there, the development of this software requires an intimate understanding of the physical infrastructure in order to abstract it away.
- Total cost of operations (TCO) performance and availability metrics: Measure constantly and focus on TCO-driven metrics like performance/dollar/per kW-month, and balance that against revenue, risk, and profit. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to software developers—whether it be debuggers or coding environments—allow them to understand failures much more rapidly than we can model in the data center space. Enabling shared key performance indicators (KPIs) across the business, developer, IT operations, and data center is key to demonstrating the value of infrastructure to the businesses bottom line. Finally, building bi-directional service contracts with software and business teams will enable these key business, service, and application insights to be holistically leveraged on your journey to the cloud.
Resilient software is a key enabler of service availability in today’s complex IT landscape when operating at cloud-scale. By shifting the mindset away from hardware redundancy, Microsoft has made significant gains in service reliability (uptime), while lowering costs and increasing scalability, efficiency, and sustainability. So while we continue to deliver mission critical services to more than one billion people, 20 million business and in 76 market places around the world, we’re doing it on significantly more resilient, highly-integrated software that is delivered via hardware that is decidedly less than mission critical.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.
[...] Last week I read David Bills' (our chief reliability strategist) post Data Center Knowledge. David is responsible for the broad evangelism of the company’s online service reliability programs. His latest item is a follow on to his posts articles “Designing for Dependability in the Cloud” and Microsoft’s Journey: Solving Cloud Reliability With Software. [...]