DCIM Implementation – the Challenges | Data Center Knowledge | News and analysis for the data center industry

This is Part 4 of our five-part series on the countless number of decisions an organization needs to make as it embarks on the DCIM purchase, implementation, and operation journey. The series is produced for the Data Center Knowledge DCIM InfoCenter.

In Part 1 we gave an overview of the promises, the challenges, and the politics of DCIM. Read Part 1 here.

In Part 2 we described the key considerations an organization should keep in mind before starting the process of selecting a DCIM solution. Read Part 2 here.

In Part 3 we weighed DCIM benefits versus its costs, direct, indirect, and hidden. Read Part 3 here.

The first three parts of this series examined vendor promises, purchasing guidelines, and potential benefits of DCIM. However, while it all may look good on the whiteboard, actual implementation may not be quite as simple as the vendor’s sales teams would suggest. Existing facilities, and especially older ones, tend to have lower energy efficiency and also far less energy monitoring. In this part we will review some of the challenges of retrofitting an operating data center, as well as some of the considerations for incorporating a DCIM system into a new design.

Facility Systems Instrumentation

Virtually all data centers have Building Management Systems to supervise the operation of primary facility components. These generally monitor the status of the electrical power chain and subsystems include utility feeds, switchboards, automatic transfer switches, generators, UPS, and downstream power distribution panels. They are also connected to cooling system components. However, in many cases, the BMS systems are not very granular in the amount and type of data they collect. In some cases, the information is limited to very basic device status information (on-off) and alarm conditions.

Therefore, these sites are prime candidates for reaping the potential benefits of DCIM. In order for DCIM systems to gather and analyze energy usage information, they require remotely readable energy metering. Unfortunately, some data centers may not even have any real-time utility energy metering at all and can only base their total energy usage on the monthly utility bill. While this has been the de-facto practice for some sites in the past, it does not provide enough discrete data (or sometimes any data) about where the energy is used or about facility efficiency. More recently, DCIM (and some BMS) systems have been designed to measure and track far more granular information from all of these systems. However, the typical bottleneck is the lack of energy meters in power panels in these older facilities or the lack of internal temperature or other sensors within older cooling equipment (that can be remotely polled), such as CRAC/CRAH units or chillers.

The retro-fitting of energy metering and environmental sensors is one of the major impediments of DCIM adoption. This is especially true in sites with lower levels of redundancy of power and cooling systems. This requires the installation of current transformers (CT) and potential transformers (PT) to measure voltage. Although there are “snap-on” type CTs that do not require disconnecting a conductor to install them, more recently OSHA has restricted so called “hot work” on energized panels and therefore may require shutting down some systems to safely do the electrical work required. And of course in the mission critical data center world “shutdown” is simply not in the vernacular. So, in addition to getting funding and internal support and resources for a DCIM project, approving this type of potentially disruptive retro-fit work requires management approval and cooperation by facility and IT domains, an inherent bottleneck in many organizations.

Basic Facility Monitoring: Start With PUE

At its most elementary level, a DCIM system should display real-time data, historic trends, and provide annualized reporting of Power Usage Effectiveness (PUE). This involves installing energy metering hardware at the point of utility handoff to the facility and at a minimum also collecting IT energy usage (typically at the UPS output). However, for maximum benefit, other facilities related equipment (chillers, CRAH/CRACs, pumps, cooling towers, etc.) should have energy metering and environmental monitoring sensors installed. This allows DCIM to provide an in-depth analysis and permits optimization of the cooling infrastructure performance, as well as provide early failure detection warnings and predictive maintenance functions.

Whitespace: IT Rack-Level Power Monitoring

While metering total IT energy at the UPS output is the simplest and most common method to derive PUE readings, it does not provide any insight into how IT energy is used. This is a key function necessary to fulfill the promised holistic view of the overall data center, not just the facility. However, compared to the facility equipment, the number of racks (and IT devices), and therefore the number of required sensors, is far greater. The two areas that have been given the most attention at the rack level are power/energy metering and environmental sensors. The two most common places to measure rack-level power/energy is either at the floor level PDU (with branch circuit monitoring) or by metered PDUs within the rack (intelligent power strips, some of which can even meter per outlet to track energy used by the IT device).

From a retrofit perspective, if the floor-level PDU is not already equipped with branch-circuit current monitoring, adding CTs to each individual cable feeding the racks are subject to the same “hot-work” restrictions as any other electrical work, another impediment to implementation. However, another method to measure rack-level IT equipment power which has been used for many years is the installation of the metered rack power distribution units (rack power strips). This normally avoids any hot work, since the rack PDUs plug into existing receptacles. While installing a rack PDU does require briefly disconnecting the IT equipment to replace a non-metered power strip, it can potentially be far less disruptive than the shutdown of a floor-level PDU, since it can be done one rack at a time (and if the IT hardware is equipped with dual power supplies, may not require shutting down the IT equipment). While this is also true for A-B redundant floor-level PDUs, some people are more hesitant to do so, in case some servers may not have the dual-feed A-B power supply cords correctly plugged-in to the matching A-B PDUs.

The rack level PDU also commonly uses TCP/IP (SNMP), so it can connect via the existing cabling and network. However, while this avoids the need to install specialized cabling to each rack, it is not without cost. Network cabling positions are an IT resource, as are network ports on an expensive production switch. The most cost-effective option may be to add a low-cost 48-port switch for each row to create a dedicated network, which can also be isolated for additional security.

Environmental Monitoring

Besides power monitoring at the rack level, environmental monitoring is a top concern to ensure that the IT is not overheating.

The expected reliability of IT equipment is in part related to maintaining proper environmental conditions within the manufacturer’s requirements. According to ASHRAE’s Thermal Guidelines for Data Processing 3rd edition (2012), just taking average readings in the middle of the cold aisle is no longer sufficient. This is something that most BMS systems generally do not address very well or at all.

Environmental monitoring at every rack has become extremely critical in facilities where power densities have risen. Each rack should have one or more temperature sensors inside the front face of the rack to properly monitor the intake temperatures of the IT equipment in order ensure they remain within the recommended range. (ASHRAE recommends up to three per rack, depending in power/heat density). However, this can also be cable-intensive and expensive if separately hardwired. The two most common solutions are to use wireless sensors or rack-level PDUs that can be equipped with environmental sensors as plug-in accessories.

There is also the other potential benefit of monitoring the power being used by each rack to provide DCIM to create a real-time “heat map.” This is different than intake temperature sensing discussed above, since it provides the additional indication of the sources of heat loads. By using this in conjunction with rack-level intake temperature monitoring, airflow management issues (the major cause of “hot spots”), can be identified, corrective changes can then be made, and the results can be immediately monitored and tracked. Once the airflow management issues have been resolved or substantially mitigated, the cooling system temperatures can be slowly raised and the rack-level temperature ranges closely monitored to see the results. This allows energy savings while reducing the risk of IT equipment failure due to exposure to high temperatures. It is also the basis for DCIM systems to provide granular rack-level capacity management, showing the percentages of power and cooling capacity available for additional IT equipment.

Communications Interoperability and Network Bandwidth Issues

There is a number of facility-side communications interoperability issues. While generally speaking there are far fewer facility devices in the main power chain to monitor, (i.e. Utility, Generator, Power Distribution Panels, UPS, etc.), as well as cooling system components (CRAC/CRAH, chillers, pumps, cooling towers, etc.), they use a variety of communications protocols, which creates a problem.
First and foremost, there is the BACnet, which was originally developed by ASHRAE in conjunction with cooling and control equipment manufacturers in the late 1980’s. In 1995, ASHRAE revised the standard as ASHRAE 135-1995 (which also became ANSI/ASHRAE 135-1995). However, before then many equipment vendors chose to keep portions of the protocol proprietary, which is still a problem for many existing systems and even some newer systems.

There is also Modbus, a real-time signaling protocol used in many applications, which can interface with BACnet. Modbus was originally implemented over dedicated serial communications lines (and commonly still is). However, it is also now implemented as Modbus TCP, which can use standard Ethernet networks. In addition, there is also the proprietary LONtalk protocol, which can use unshielded twisted pair (UTP) cable, similar to standard Ethernet type cable, but it does not use the Ethernet protocol and therefore cannot be sent over IT Ethernet networks. Because of the widespread use and the global acceptance of TCP/IP, ASHRAE updated the BACnet protocol in 2001 to include BACnet/IP via TCP/UDP.

While technically speaking BACnet is now a nonproprietary protocol that became an ISO standard in 2003, many older implementations of those systems tended to be vendor-specific and somewhat propriety. Moreover, these multilayered intertwined protocols are still the BMS and cooling equipment manufacturers’ protocols of choice, (conceivably for competitive advantage or vested financial self-interest of customer lock-in). As was mentioned previously in this series, BMS vendors generally like to keep their systems within their domain, making system interfacing and integration with multivendor equipment difficult and more expensive to implement. Nonetheless, BACnet/IP, as well as Modbus/TCP, are being slowly accepted by the facilities world and being implemented to use standard Ethernet IP based networks (see issues in the security section below).
Whitespace Communications TCP/IP

Things are more straightforward for IT systems, since TCP/IP is the universal lingua franca protocol; moreover, interoperability is the mantra of virtually all IT hardware and software. However, in the whitespace the number of racks (and IT devices) is far greater. And although they most commonly can be accessed by a common TCP/IP network, this can generate a very large volume of traffic, as well as a huge number of data points that need to be written and stored in the centralized database, which over time can become very large. While large data storage is not an insurmountable impediment, managing the polling intervals and length of data retention and archiving in order to generate useful trend reports requires paying close attention to the system design parameters and its administration. While the data can be transmitted via the production network, good network practice would be to create and separate Ethernet network, which requires more cabling at the very least, a separate management VLAN, or, as mentioned above, a separate network with its own row-based switches, perhaps connected to separate routers and even firewalls (see security below).

Direct IT Monitoring

There is another, more sophisticated method of monitoring whitespace operations, which can mitigate the need and cost of a myriad of metering and environmental sensors. This involves polling the IT equipment directly. Virtually all modern IT hardware has onboard management systems that can report key operating parameters, such as power usage, air intake, internal temperatures, CPU load percentages, and many other conditions. In fact, for many years asset and network IT management systems have been monitoring the processing, storage, and network information, but generally not directly tracking the power and environmental information. Some DCIM systems are capable of polling IT hardware directly for these parameters, but this requires cooperation by IT administrators, which is a security issue.

Security

In today’s environment, security is clearly a top concern across virtually every aspect of every business. Data center facilities, and especially their IT loads, obviously represent high-value targets. Therefore, every aspect now needs to be scrutinized as a potential threat vector. One of the challenges for new builds or retrofits is the increased potential for previously unforeseen security holes related to DCIM access to both facility and IT systems. This is not to say that DCIM software is inherently insecure. However, by its very nature of having multi-pronged centralized octopus-like tentacles with access to various critical systems, it increases both the number of potential security threat points of entry, as well as the breath of targets, once a weakness is discovered and exploited. This type of potential threat scenario was recently demonstrated by the breach of Target, wherein IT systems were infiltrated (and millions of credit cards compromised), via the use of an HVAC system vendor’s login who was given remote network access to monitor the HVAC equipment. Although DCIM may not have been directly involved in this scenario, it only heightens security concerns and further impedes implementation.

For example, BACnet protocols were originally meant to operate over dedicated wiring within the facility that was “upgraded’ to allow it to be accessible remotely over standard TCP/IP networks (BACnet/IP). It used the UDP/IP protocol, which is typically blocked in most corporate firewalls for security reasons. And while in the past, sometimes, some firewall administrators could be persuaded to allow some limited UDP access, recent rashes of massive security breaches will further impede BACnet/IP as a preferred protocol.

Furthermore, one of the features of DCIM is the centralized management of larger multisite environments, which also increases both the potential threat points and the scope of damage. At the very least, filters and access lists of firewalls and routers will need to be examined and revised to allow DCIM to communicate devices across internal and external boundaries and domains. Even Simple Network Management Protocol (SNMP), a protocol commonly used by IT devices and most rack PDUs (which has been constantly revised to improve security), is still not considered as very secure and is therefore seen as another potential threat vector. So, in addition to all the other aspects and costs of implementation, prudence may dictate consideration of additional internal and external firewalls and intrusion detection systems.

Considerations for New Builds

From a facilities perspective, new builds offer the best and easiest opportunity to incorporate DCIM metering and sensors in the power and cooling systems in the design before it is built. When budgeting for the DCIM project, remember that facility-side systems have a comparatively long lifecycle (10-15 years or even more) when compared to IT equipment. So, when faced with making any compromises due to initial budgetary limitations, consider that DCIM software can be purchased after facility construction, upgraded or replaced without impacting data center operations. Conversely, installing energy metering which typically requires the electrical panel be de-energized can be disruptive and is best done once upfront, rather than adding more energy metering or other instrumentation (such as chilled water flow meters) afterwards, which may require equipment shutdowns for subsequent installations.

As for the ability to add IT rack level of energy/power metering and environmental monitoring, it offers more flexible options, some of which can be potentially less disruptive as described above. Nonetheless, every floor-level PDU should have branch circuit monitoring for any new build.

The Bottom Line

The above examples only represent a sample cross-section of challenges of a DCIM implementation. And while potentially disruptive electrical work is a significant factor inhibiting retrofit implementations, it is not insurmountable. Consider a phased approach to DCIM projects, especially for a retrofit solution. If done in coordinated phases (assuming there are some redundant systems), it is still highly recommended that the additional metering be installed to gain maximum functionality from a DCIM implementation.

The IT monitoring aspects can be less intrusive, but face heightened (and justified) security issues. This is not insoluble, but is an important consideration in general. To put this into context, IT systems are constantly being probed for unknown weaknesses from many avenues, and these threats need to be dealt with and mitigated virtually every day.

These issues will need to be weighed in relation to the size of your data center facility, and perhaps some tradeoffs and compromises may also need to be made. This could obviously impact the relative value or potential benefits and assessed against the overall costs. Consider doing a pilot project first to gain experience, and then use that as a basis to document the issues and more accurately project the resources required for overcoming implementation challenges for a full-scale deployment.

How do you justify the price tag of DCIM? Come back to read part five of this series in the Data Center Knowledge DCIM InfoCenter.

Read Part 1 here

Read Part 2 here

Read Part 3 here

Comments

Plain text