Our theme this month is intelligent data center management software tools. Data center management technologies have come a long way, as companies find themselves having to manage ever bigger and more diverse environments. From using machine learning to improve data center efficiency to using automation to manage everything from servers to cooling systems, we explore some of the latest developments in this space.
Overprovisioning, viewed in data center design and management as something between a best practice and a necessary evil, has been built into the industry’s collective psyche because of its core mission to maintain uptime, at all costs. If a data center team spends more than it really has to, it needs to improve efficiency, but if a data center goes down, somebody’s failed to do their job.
Data center managers and designers overprovision everything from servers to facility power and cooling capacity. More often than not, they do it just in case demand unexpectedly spikes beyond the capacity they expect they will need most of the time. The practice of overprovisioning is common because few data center operators have made it a priority to measure and analyze actual demand over time. Without reliable historical usage data, overprovisioning is the only way to ensure you don’t get caught off-guard.
Chief Suspect: the Server Power Supply
Today, however, more and more tools are available that can help you extract that data and put it to use. The key is figuring out what to measure. Should it be CPU utilization? Kilowatts per rack? Temperature?
The best answer is all of the above and then some, but one data center management software startup suggests server power supplies are a good place to start. The company, named Coolan, recently measured power consumption by a cluster of servers at a customer’s data center and found a vast discrepancy between the amount of power the servers consumed and the amount of power their power supplies were rated for.
It’s the latter number – so-called nameplate power – that is used in capacity planning to figure out how much power and cooling a facility will need. Overprovisioning at this basic-component level leads to overprovisioning at the level of the entire facility. Companies buy transformers of higher capacity than they will need, UPS systems, chillers, and so on.
The common practice in data center power provisioning is to assume that each system will operate at 80 percent of the maximum power its power supply is rated for, according to Coolan. Few systems ever run close to that once deployed.
A Case in Point
The customer, whom the startup could not name due to confidentiality agreements, is a cloud service provider, and the cluster that was analyzed consists of a diverse group of 1,600 nodes, including a range of HP and Dell servers. Not even 500 systems are of the same model.
More than 60 percent of the boxes in the cluster consumed about 35 percent of their power supplies’ nameplate power; the rest consumed under 20 percent. Considering that the sweet spot for maximum power-supply efficiency is between 40 and 80 percent, according to Coolan (see chart below), every single server in the cluster runs inefficiently and the facility infrastructure that supports it is grossly overprovisioned.
Because the customer is a large software developer, Coolan says it’s safe to assume many cloud-based service providers are in the same situation, overpaying for underutilized infrastructure.
Power supply utilization efficiency curve (Credit: Coolan)
How to Narrow the Gap?
So what are the action items here? Amir Michael, Coolan’s co-founder and CEO, says ultimately the answer can be anything that narrows the gap between the workload on every server and the amount of power it’s being supplied. It can be narrowed on the side of the workload itself, by loading the server more to get it into that efficient utilization rate, or on the side of the power supply: if it’s a new deployment, select lower-capacity power supplies (which are also cheaper), and if it’s an existing deployment, take a look at the way power-supply redundancy is configured.
Making adjustments on the power-supply side is often much easier than increasing server workload. “It’s a challenge for them to actually load the boxes, and there are lots of companies trying to solve that problem,” Michael said. Docker containers and server virtualization are the most straight-forward ways to do it: the more VMs or containers are running on a single server, the higher its overall workload is. But it’s not always that simple.
Changing the way redundant power supplies on a server are configured is a much lower hanging fruit. Two redundant supplies can be set up to share the load equally, which means it’s really difficult to get each of them close to the optimal operating range. But you can change the configuration to have one of the power supplies serve the entire load while the other is on hot standby. That will at least get you closer to an optimal utilization rate, Michael explained, adding that data center managers are seldom aware that they have the choice to change this configuration.
Wisdom of the Hyper Scale
Hyper-scale data center operators like Facebook or Google have been privy to the problem of underutilized power supplies for years. In a paper on data center power provisioning published in 2007, Google engineers highlighted a gap of 7 to 16 percent between achieved and theoretical peak power usage for groups of thousands of servers in the web giant’s own data centers.
The data center rack Facebook designed and contributed to the Open Compute Project several years ago features a power shelf that’s shared among servers. You can add or remove compute nodes, increase the power supply’s capacity or decrease it independently of the compute capacity to get an optimal match.
Michael is deeply familiar with server design at Google and Facebook, as well as with OCP, which he co-founded. He has spent years designing hardware at Google and later at Facebook.
Arm Yourself with Data
The best-case scenario is when a data center operator has spent some time tracking power usage of their servers and has a good idea of what they may need when they deploy their next cluster. Armed with real-life power usage data, then can select power supplies for that next cluster whose nameplate capacity is closer to their actual use and design the supporting data center infrastructure accordingly. “Vendors have a whole host of power supplies you can choose for your system,” Michael says.
Of course there’s little guarantee that your power demand per server will stay the same over time. As developers add more features, and as algorithms become more complex, server load per transaction or per user increases. “The load changes all the time,” Michael says. “People update their applications, they get more users.”
This is why it’s important to measure power consumption over periods of time that are long enough to expose patterns and trends. Having and using good data translates into real dollars and cents. Coolan estimated that the aforementioned customer would save $33,000 per year on energy costs alone, had the servers in its cluster operated within the power supplies’ efficient range. That’s not counting the money they would save by deploying lower-capacity electrical and mechanical equipment (assuming their cluster sits in their own data center and not in a colocation facility).
Read more about Coolan’s study in today’s blog post on the company’s site.