At Microsoft, Smarter Software Equals Fewer Generators
It takes both software and hardware for a data center to operate successfully. Sometimes a relatively small amount of software code can make a huge difference in how much you spend on hardware.
That’s the case at Microsoft, which has been able to reduce the number of diesel backup generators at its data centers by using software to move workloads from one location to another. Managing reliability with sophisticated software, rather than redundant hardware, has allowed the company to eliminate generators for portions of its facilities in Chicago and southern Virginia, according to David Gauthier, Director of Data Center Architecture and Design at Microsoft.
“It’s a heck of a lot easier to solve availability in software versus hardware,” said Gauthier.
While the Chicago data center is equipped with generators, many of the server-filled containers housed within the facility have no generator backup. The Chicago facility also includes traditional raised-floor space, primarily for storage, that has generator backup. A similar approach is taken in Boydton, Virginia, where the latest phase of modular data centers is housed outdoors and operate without generator support.
$15 Billion on Data Centers
Diesel generators cost hundreds of thousands of dollars apiece, so eliminating them can save serious cash. Microsoft isn’t bashful about investing in cloud infrastructure – it has spent more than $15 billion on data center since 1989 – but has been increasingly focused on getting the most mileage out of its data center expenses. There are also compelling operational arguments for shifting reliability management from hardware to software.
“We’ve seen that our environment, with all its redundancy, was getting really complicated,” said Gauthier. “We had been trusting in hardware redundancy. But to succeed in the cloud we need to focus on service resiliency. The services that run in the cloud must be able to gracefully migrate to other clusters and maintain customer SLAs.”
The growth of web-scale infrastructure, with companies operating networks of huge data centers, has enabled changes in how the industry thinks about redundancy. In the past, redundancy meant having backup equipment on-site, requiring the purchase of additional generators and UPS units. With a network of cloud data centers, redundancy can be managed by moving workloads to one data center to another.
“Software error handling routines can resolve an issue far faster than a human with a crash cart,” Gauthier writes in a blog post. “For example, during a major storm, algorithms could decide to migrate users to another data center because it is less expensive than starting the emergency back-up generators.
Faster Than A Crash Cart
“Of course, software isn’t infallible, but it is much easier and faster to simulate and make changes in software than physical hardware,” he continued. “Every software revision cycle is an opportunity to improve the performance and reliability of infrastructure. The telemetry and tools available today to debug software are significantly more advanced than even the best data center commissioning program or standard operating procedure.”
Microsoft isn’t the first company to discuss shifting workloads between data centers to manage reliability. Google has been doing this for years, and Yahoo has explored using data motion to eliminate generators and UPS units in some of its facilities. Microsoft has been focused on this topic since 2008, when it began shifting from an “enterprise IT” model using backup equipment to modular deployments and cloud-level workload management.
In some cases, workloads may move from one location to another within the same facility. In other cases, they might move to a different geographic location. Failure domains are created to address different scenarios, which then guides utilization planning and maintaining reserved capacity both on-site and off.
“It all comes down to the type of failure,” said Gauthier. “What do we do in utility failures? We want to think about how we move the load. What’s more likely is a failure in a single container.”
Using software to manage workload placement also allows more sophisticated matching of hardware to capture the best economics. “We consider more than just cost per megawatt or transactions per second, measuring end-to-end performance per dollar per watt,” said Gauthier. “We look to maintain high availability of the service while making economic decisions around the hardware that is acquired to run them. So while these services are being delivered in a mission critical fashion, in many cases the hardware underneath is decidedly not mission critical.
“The real magic is the agility to make these changes, which you just can’t do in hardware,” he said.
Joe BrunnerPosted February 11th, 2013
Great. How much cash will they waste when machines don’t come back up when they lose power and blow parts spontaneously? Generators are not just to keep stateless code from losing data – they also work with dc battery plants to prevent against non-graceful shutdowns frying servers… Lol
I guess MSFT should buy a generator company to get a better handle on their generator costs before surrendering the reliable power front
Curt GibsonPosted February 12th, 2013
Amazon is doing something like this (ELB), and it has not seemed to work well. They have had four outages due to complexity of the software that seems to have become confused.
Servers are becoming more efficient, and mechanical loads smaller, so the generator and infrastructure is becoming smaller.
Generator technology is improving and the cost of emergency power is also decreasing. By paralleling smaller generators, the cost is now down to $200/kW, where it was as high as $500/kW 5 years ago.