The Hotmail servers got entirely too hot Tuesday, causing a major outage for Microsoft’s web-based email services. Both Hotmail and Outlook.com were offline for up to 16 hours after a failed software update caused the heat to spike in one part of a data center supporting those services. The outage also affected the Skydrive cloud storage service.
The temperature rose so swiftly that Microsoft was unable to implement automated failover processes designed to shift workloads within its infrastructure, according to an incident report posted on the Outlook.com blog.
Microsoft said the problems were focused on a single data center, where software managing the facility’s physical plant was receiving a firmware upgrade. Although previous updates had gone smoothly, this update “failed in an unexpected way,” according to Microsoft’s Arthur de Haan.
“This failure resulted in a rapid and substantial temperature spike in the datacenter,” de Haan wrote. “This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter. These safeguards prevented access to mailboxes housed on these servers and also prevented any other pieces of our infrastructure to automatically failover and allow continued access.”
Warmer Temperatures Offer Benefits, Risks
The report doesn’t provide details on the software or equipment involved, but it’s clear that the data center’s cooling system was affected, and that the temperature in the server area rose very quickly. Microsoft has been a pioneer in operating its data centers at warmer temperatures, a strategy that can generate significant energy savings by requiring less use of power-hungry chillers and cooling equipment. The flip side of raising the temperature in the data center is that it provides less thermal “cushion,” allowing server inlet temperatures to increase more rapidly and leaving less time to recover from the cooling failure. That’s particularly true in high-density data centers like those operated by Microsoft.
For some reason, automated failover systems were unable to manage the situation. “Based on the failure scenario, there was a mix of infrastructure software and human intervention that was needed to bring the core infrastructure back online,” de Haan wrote. “Requiring this kind of human intervention is not the norm for our services and added significant time to the restoration.”
The growth of web-scale infrastructure, with companies operating networks of huge data centers, has enabled changes in how the industry thinks about redundancy. In the past, redundancy meant having backup equipment on-site, requiring the purchase of additional generators and UPS units. With a network of cloud data centers, redundancy can be managed by moving workloads to one data center to another to route around problems.
Microsoft has been working to enhance its software to automate failure management by shifting workloads to manage reliability. In some cases, workloads may move from one location to another within the same facility. In other cases, they might move to a different geographic location. Failure domains are created to address different scenarios, which then guides utilization planning and maintaining reserved capacity both on-site and off. But clearly, data center operations remain complex enough that it’s impossible to predict and manage every failure scenario.