With more than 100 companies offering some type of Data Center Infrastructure Management (DCIM) solution (see Appendix 1 of the DCK Guide to DCIM for a partial list of vendors), it is difficult to narrow down a defined set of functional components. There are some critical elements found in many of the solutions, which include:
Asset, Change and Configuration Management
Asset management is a key component of DCIM. A data center can contain thousands of assets, from servers, storage and network devices to power and cooling infrastructure equipment. Tracking these assets is an ongoing and often monumental task. A Digital Realty Trust survey asked data center managers how long could it take to find a server that has gone down. Only 26% of the respondents said they could locate the server within minutes. Only 58% could find the server within 4 hours and 20% required more than a day. The inability to locate equipment in the data center increases the mean time to repair (MTTR) for the equipment and decreases the overall availability.
However, asset management encompasses more than simply locating a data center asset. It also involves knowing detailed information about the asset’s configuration. For example, a server may be powered by one or more rack power strips. Disconnecting these power sources will shut down the server. The server may be connected to one or more switches or routers.
Rerouting these network devices may make the server unreachable. The server may host multiple virtual machines. Shutting down the server will disable these virtual machines. Without knowing the details of the server configuration, it is very difficult to make reasonable decisions concerning that erver and its supporting infrastructure. Changes to any part of the configuration may render the server — and its associated services — unusable.
In order to accurately manage assets and their detailed configurations, we must also manage change. It is estimated that change is often the cause of as much as 80% of system downtime and that 80% of mean time to repair (MTTR) is used trying to determine what changed. Change management therefore becomes an important part of a DCIM solution. In the book The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps, the authors examined a number of high performing IT organizations and found that by just looking at the scheduled and authorized changes for an asset (as well as the actual detected changes on the asset) problem managers could recommend a fix to the problem over 80% of the time, with a first fix rate of over 90%. The authors also found that organizations which implemented automated change auditing were “surprised and alarmed to see how many changes are being made ‘under the radar’.” The ability to track both authorized changes and detected changes — changes made but not necessarily authorized — is key DCIM functionality which can reduce MTTR and increase overall system availability.
There are three categories of real-time monitoring systems in the data center:
- Building Management System (BMS) – A BMS is typically a hardware-based system utilizing Modbus, BACnet, OPC, LonWorks or Simple etwork Management Protocol (SNMP) to monitor and control the building mechanical and electrical equipment. These are often custom-built systems priced on the number of individual data points being monitored (a data point might be the output load on a UPS or the return temperature on a computer room air conditioner unit). In some cases, the BMS system is extended into the data center to monitor and control power and cooling equipment.
- Network Management System (NMS) – An NMS is typically a software-based system utilizing SNMP to monitor the network devices in the data center. Network devices can usually be auto-discovered, so installation can be automated to some degree.
- Data Center Monitoring System (DCMS) – A DCMS can be hardware-based and/or software-based and is used to monitor a data center or computer room. Device communication is typically done using SNMP, although some data center monitoring systems can also communicate using Modbus, IPMI or other protocols.
There are some important attributes to consider when evaluating the real-time monitoring capabilities of a DCIM solution. One of the key considerations is what devices you intend to monitor. The answer to this question may have the biggest impact on the solution chosen.
If, for example, you want to monitor some devices which use SNMP to communicate and others which use Modbus, it would be important to choose a solution which supports both SNMP and Modbus protocols. Avoid solutions which only work with one vendor’s specific equipment as you will then need to purchase multiple disparate systems to monitor your entire data center. Ideally, you want a DCIM solution that can work with a wide variety of hardware “out of the box” — in other words, without any vendor customization — and can also integrate with other existing monitoring systems such as a BMS.
Another attribute to consider is whether or not the real-time monitoring utilizes a hardware component. There is nothing inherently wrong with a hardware-based system. In fact, a hardware-based system may be capable of gathering data more quickly and frequently than a software-based system. Depending on the number of hardware components required and the price of each component, however, the hardware cost may cause the overall DCIM solution to become prohibitively expensive.
One additional attribute to consider is whether or not the system supports auto-discovery of devices. Auto-discovery provides many benefits, including faster, easier installation and less chance for user error in manually configuring a device. It is important to note that not all devices can be auto-discovered as discovery is dependent on the device configuration and the communication protocol used (SNMP devices can usually be discovered while Modbus devices cannot, for example.)
Many data centers have implemented at least some level of ITIL-like processes. A DCIM solution can help you to orchestrate these processes. For example, the installation of a new server typically has multiple steps, some of which may be performed by different groups within the data center.
A DCIM solution might allow tracking of the various steps, with different groups able to report status of their individual tasks in order to verify that all required steps have been completed. In this case, workflow functionality will coordinate the server installation steps so that all preparatory work been completed before the technician installs the server in the rack, thereby streamlining the entire process.
It is important that the workflow functionality provided by the DCIM tool is adaptable to work within your defined process structure rather than having to modify your processes to match a pre-defined workflow.
Analytics and Reporting
Another important capability of a DCIM solution is data analysis and reporting. With thousands of devices in the data center each reporting multiple measurements, the amount of data collected can quickly become overwhelming. It is imperative that the DCIM tool can quickly sort through this data and provide actionable recommendations for the management team. These recommendations can be presented in the form of alarm messaging, graphing of historical data to show changes over time, dashboards and reports. The DCIM tools may come with pre-defined reports but should also support ad-hoc reporting based on user-selectable parameters.
Visualization of the Physical and Virtual Infrastructure
One important component of a DCIM solution is the ability to view the physical and virtual infrastructure. The DCIM tools on the market today vary widely in their capabilities here. Some interact with visualization tools such as AutoCAD or Visio, while others provide a visual editor to allow you to lay out your infrastructure entirely within the tool. While most of the current solutions provide top-down views, some also provide 3-D views with the ability to “fly through” the data center. Many solutions provide various layered views of the data center with the ability to view various parameters such as temperature, rack utilization, power and so on.
This visual view is typically extended down to the rack level, with DCIM tools providing a visual view of the devices in the rack. This view shows the actual location of a device within a rack and also serves to provide additional data such as the temperature in the rack at various points and the power usage within the rack.
If DCIM boils down to information, a good DCIM user interface boils down to providing that information in such a way as to allow the user to make informed decisions. In his white paper Five Essential Components of an Elegantly Engineered Data Center Operating System, Kevin Malik describes the importance of the DCIM user interface, saying, “It is essential for a data center operating system to have an intuitive interface so users can quickly navigate through alerts, review environmental levels and review other detailed analytics.” He goes on to add, “Companies should be able to customize the views of real-time data of mechanical, power, cooling and electrical usage so decision-makers see information needed based on their roles to optimize data center operations.”
Like the visualization component, DCIM user interfaces vary widely in both their look and feel and their overall capabilities. While most DCIM products are web-based, allowing access to the data from anywhere, the user interfaces can take many forms, including dashboards, touch-screen technology and application support for hand-held devices such as iPads and smart phones.
One of the primary uses for the data collected by DCIM applications is to provide information for capacity planning. Data centers operate most efficiently when they maximize the use of key resources, particularly power and cooling. By storing the resource consumption over time and analyzing growth patterns, data center managers can more accurately predict when a given resource will be exhausted. Through the use of DCIM tools, data center builds can frequently be postponed due to more effective management of key resources.
Integration with Other Data Center Management Solutions
Contrary to what some DCIM vendors might have you believe, DCIM solutions will likely never replace all of the management tools available for the data center space. Typical management solutions include change management, CFD modeling, asset management, building management systems, maintenance management and a number of other third-party or in-house developed tools. A good DCIM solution will provide some type of integration with external systems, ranging from loading Excel spreadsheets to direct database interaction to sophisticated web-based API (application program interface) which might allow the data to be passed both into and out of the DCIM solution.