Patrick Donovan is a Senior Research Analyst with Schneider Electric’s Data Center Science Center. He has over 18 years of experience developing and supporting critical power and cooling systems for Schneider Electric’s IT Business unit including several award-winning power protection, efficiency and availability solutions.
Historically, the total electrical power consumed by IT equipment in data centers and network rooms has varied only slightly depending on computational load or mode of operation. However, once processors on notebook computers were re-designed to lengthen battery time – enabling laptop computer processor power consumption to be reduced up to 90 percent when lightly loaded – server processor design soon followed suit. As a result, newly developed servers with energy management capabilities can experience dramatic fluctuations in power consumption with workload level over time – causing a variety of new problems for the design and management of data centers and network rooms.
Once negligible (historically on the order of five percent), total power variation for a small business or enterprise server is now much greater. These fluctuations in power consumption can lead to unplanned and undesirable consequences in the data center and network room environment. Such problems include: tripped circuit breakers and overheating and loss of redundancy, creating entirely new challenges for the design and operation of data centers and network rooms.
Additionally, the growing popularity of cloud computing and virtualization has greatly increased the ability to utilize and scale compute power while in turn heightening the risk of physical infrastructure issues. In a virtualized environment, the sudden creation and movement of virtual machines requires careful management and policies that contemplate physical infrastructure status and capacity down to an individual rack level. Failure to do so could undermine the software fault-tolerance.
Data Center Virtualization and Magnitude of Dynamic Power Variation
Two decades ago, server power variation was mainly independent of the computational load placed on processors and memory subsystems. Most often, significant fluctuations were caused only by disk drive spin-up and fans. During this time, typical power variation was approximately five percent. In more modern processing equipment, however, new techniques to achieve low power states, such as changing the frequency of the clocks, moving virtual loads and adjusting the magnitude of the voltages applied to the processors to better match the workload in the non-idle state, have been deployed. Depending on server platform, power variation can be on the order of 45 to 106 percent – a significant increase from just twenty years ago. This type of dynamic power variation gives rise to the following five types of problems.
1. Branch Circuit Overload
Typically, servers operate at light computational loads, with actual power draw amounting to less than the server’s potential maximum power draw capabilities. However, because many data center and network managers can be unaware of this power use discrepancy, they often plug more servers than are necessary into a single branch circuit. This in turn creates the potential for possible circuit overloads, as the branch circuit rating can be exceeded by the total maximum server power consumption. While the servers will operate successfully at lower loads, when servers are simultaneously subject to heavy loading, overloads will occur. The most significant result of branch circuit overload is the tripping of the circuit, which will shut off power to the computing equipment. In general, these instances are undesirable, and since they occur during periods of high workload, they can be extremely detrimental to business continuity.
In the data center or network room, most electrical power consumed by computing equipment is released as heat. When the power consumption varies due to load, the heat output also varies. As such, sudden fluctuations in power consumption can cause dangerous increases in heat production, creating heat spots. While data center cooling systems are put in place to regulate overall temperature, they may not be designed to handle specific, localized hot spots caused by increases in power consumption. As temperature rises, equipment is likely to shut down or act abnormally. Furthermore, even if equipment functionality remains, heat spikes may effect equipment over time or void any warrantees.
Hot spots can also occur in a virtualized environment, where servers are more often installed and grouped in ways that create localized high-density areas. While this problem may be surprising due to the virtualized machine’s inherent ability to dramatically decrease power consumption, the act of grouping or clustering these high density virtualized servers can result in cooling problems.
3. Loss of Redundancy
To protect against potential power failure, many servers, data centers and network rooms utilize dual redundant power inputs that are designed to share power loads equally between two paths. When one path fails, the load once supported by the failed feed is then transferred to the active power feed – causing the feed’s load to double in order to fully support the server. In order to ensure that a remaining feed has the capacity to take over the complete load, if necessary, the main AC branch circuits feeding the equipment must always be loaded to less than 50 percent ampacity. However, this can be difficult when the loads are experiencing variations in power consumption – equipment that initially rated as less than 50 percent during installation can, over time, begin to operate at much higher loads.
Should the inputs begin operating at greater than 50 percent of their rating, the system’s redundancy, and protection capabilities are eliminated. In this case, should one feed fail, the second will overload, the breaker will be tripped and power lost, causing data lose or corruption.
4. Masking the Problem
Because the equipment that exhibit variations in power consumption may represent only a small portion of the total equipment in the data center or network room, the potential issues this equipment can cause is often overlooked. For instance, if just five percent of the equipment in a given server environment experiences power variation of 2:1, and the remaining equipment draws constant power, the resulting bulk power measurements at the main feed or power distribution unit (PDU) may only vary by just 2.5 percent. As such, an operator may be lead to believe there is no real power consumption issue, when in fact, it is just hidden.
Managing Dynamic Power Variation: Solutions
To alleviate the aforementioned problems data center and network room operators should become more aware of the potential for and results of dynamic power consumption. Below are a number of suggested ways to mitigate such issues.
1. Utilize Separate Branch Circuits for Each Server
Because every server is operating from a dedicated circuit, overloads and loss of redundancy cannot occur when separate branch circuits are provided to each server. While effective, this solution can be expensive and complex to deploy for small server systems, as it can require large numbers of branch circuits be utilized per rack. For example, a rack with dual corded 1U servers could require up to 84 individual circuit branches and utilize two separate circuit breaker panelboards. When larger servers, or blade servers, are used, this technique is more practical. Note this type of solution does not mitigate thermal problems, such as hot spots.
2. Establish Safety Margin Standards for Worst Case and Measure Compliance at Install or on an Ongoing Basis
Most data center and network room operators have standards for loading margins, which are typically expressed as a fraction of the full load branch circuit rating. Most often, these values fall between 60 and 80 percent of the branch rating, with values of 75 percent considered a reasonable balance between power capacity, cost, and availability. To verify compliance with the standard, actual branch circuit loads must be measured. However, problems with this approach can arise when systems exhibit dramatically varying power consumptions, as this will make it difficult to accurately confirm the computational load at the time of measurement. In an ideal situation, a heavy computational load would be placed on the protected equipment during measurement to ensure compliance during a worst case scenario.
Additionally, by keeping extensive inventories of what is equipment is connected to each branch circuit and measuring the potential sum total of the maximum load draw can help ensure that branch circuits do not suffer from overload (information regarding maximum load for various equipment is available from the individual equipment manufacturer). This type of inventory is commonplace in large data centers but is not practical in all installations, as it requires that operators know exactly what equipment is plugged into every branch circuit at all times. For small data centers and network rooms, where operators can more easily protect against accidental equipment movement, this approach isn’t necessary.
Establishing safety margins and continuously monitoring all branch circuits on an ongoing basis by an automatic monitoring system can be a third solution to mitigating issues caused by dynamic power variances. In this case, operators are alerted when branch loading begins to enter the safety margin area. For example, when using a 60 percent branch loading standard, alerts should be sent when the loading passes 60 percent. This safety margin is established to provide operators with significant advance warning of a problem area, allowing them to take corrective action before an over current condition occurs. This approach can also warn of impending loss of redundancy. The specific advantage of this method is that it is applicable to situations where users may, without the data center manager’s knowledge, install, move or plug equipment into a different outlet. This type of scenario usually occurs within a colocation facility or medium security data center, where various personnel will have access to the equipment. It is recommended that this method be used in conjunction with the aforementioned techniques.
3. Integrate a Data Center Management Solution
An additional method for ensuring protection against the problems caused by power variations is the use of data center infrastructure management (DCIM) software, which can monitor and report on the health and capacity status of the power and cooling systems and keep track of the various relationships between the IT gear and the data center or network room’s physical infrastructure.
DCIM can provide insight into which servers, physical and virtual, are installed in a given rack and which power path and cooling system it is associated with. This software can also help eliminate the potential risk of human error, a leading cause of downtime, which can take the form of IT load changes without accounting for the status and availability of power and cooling at a given location. Automating both the monitoring of DCIM information (available rack space, power, and cooling capacity and health) and the implementation of suggested actions greatly reduces the risk.
Dynamic power variation in IT loads is an increasingly important issue, one that can give rise to a number of physical infrastructure problems that can be detrimental to the overall continuity of a business. To mitigate the risks of potential server downtime, data center and network room operators should consider the above suggested steps for proper planning and monitoring.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.