Clemens Pfeiffer is the CTO of Power Assure and a 22-year veteran of the software industry, where he has held leadership roles in process modeling and automation, and data center management and optimization technologies.
Data center operators are constantly assessing server, storage and network capacity needs as part of a multi-site disaster recovery (DR) strategy, but often neglect to take into account capacity considerations within an individual data center. The uptime levels possible with reduced power and/or cooling are the important aspects of this additional planning, particularly in high-availability environments.
This article highlights some best practices as part of capacity management in an effective DR strategy by addressing two common situations: a complete power outage and a cooling system failure.
Partial Power Triage
Power blackouts and brownouts always seem to occur at the worst possible time, and most data centers lack the UPS and generator capacity needed to operate at 100%, especially when the cooling system is not fully functional with either source of backup power. The top priority during this localized “disaster” becomes shedding and/or shifting the load, and that requires some advance planning by performing “triage” on the applications.
The triage determines how much server capacity is needed to run:
- all mission-critical applications;
- any highly-desirable applications (perhaps with diminished performance), along with how much power can be saved by shedding; and
- all non-essential applications and lower storage tiers.
It is also important to take into account both the power needed to run the IT equipment and the increase in temperature anticipated while operating with no or only partial cooling.
A Data Center Infrastructure Management (DCIM) system with Dynamic Power Optimization (DPO) is the ideal tool to both plan and implement partial power triage. In the planning stage, a capable DCIM system is able to determine the power required for and the heat generated by all individual applications. What-if analyses can then be performed to assess possible trade-offs, such as keeping more applications available, but at lower service levels. Multiple scenarios should also be created to address situations ranging from a brief brownout to an extended blackout.
The DCIM’s DPO capability is used to implement the triage during a power outage. DPO solutions optimize server utilization and reduce power consumption (typically by 50% or more). But the same ability to match capacity with demand is just as important during a power outage. The best practice here is to employ runbooks that fully automate the many steps involved in shedding and/or shifting loads. For example, one runbook might migrate all critical applications to a core set of virtual machines, then shut down the offloaded servers. Another runbook might simply shut down the servers being used for all (or some) non-essential applications. When the power outage is over, a different set of runbooks can then be used to restore normal operation.
Like power, air conditioners seem to fail when they are needed (and stressed) the most: at a peak period during the heat of the day. A similar form of triage is also required for this situation. Already having performed the application triage, the anticipated time-to-repair the A/C system becomes the all-important consideration here. Will it take at least a day? Better start shedding load now! Or will it take less than an hour? Depending on how much the current temperature is below the target maximum, it may be possible to keep all applications running by consolidating them onto fewer virtualized servers or power cap the existing servers with some diminished performance but reduction in heat output.
The DCIM system is also the ideal tool to use for planning and implementing a response to a partial or complete A/C outage. The planning here was (or should have been) part of an effort to improve Power Usage Effectiveness (PUE) and extend the life of the data center. For example, DCIM modeling tools can be used to optimize the placement of systems in suitable hot/cold aisles and even within the individual rack, and to minimize stranded power. What-if analyses allow the various permutations and combinations of power and cooling considerations to be evaluated easily and accurately to achieve the most efficient result.
The good news (at least in the context of loss of cooling) is that most data centers today operate far below the 80°F (27°C) cold isle temperature that ASHREA recommends, normally for fear of hot spots. Taking constant and accurate measurements of the server inlet temperature can minimize this risk, however, and the use of DPO can adjust capacity before any hot spots form—which they will during a cooling outage.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.