Sev Onyshkevych is the Chief Marketing Officer at FieldView Solutions.
Tech Target defines resilience as “the ability of a server, network, storage system or an entire data center to continue operating even when there has been an equipment failure, power outage or other disruption.”
No doubt the more complex the system, the more complex the definition of, and calculation of, resilience. At a data center level, resilience is very complex and fraught with challenges because data centers are like living things – they’re constantly changing. So methods to keep a data center up and running are also evolving.
Traditionally, data center managers ensured uptime with redundancy. They had duplicates of everything – two power sources, two servers, two connections, in effect two whole data centers – and more, with spare units, backup/disaster recovery sites, extra capacity for that “black swan” day, etc. – in many cases you’re running at 10 to 15 percent of your total capacity with all the extra redundancy. That way, if something failed, you have a cascading hierarchy of redundant systems to take over.
That's great for peace of mind, but not great for the bottom line – especially as data centers are expanding and operational costs increasing. There had to be a better way to ensure uptime than having equipment sitting idle, not doing a bit of work, but still drawing power, creating heat, and taking up space.
Enter Data Center Infrastructure Management (DCIM) monitoring. DCIM software monitors all the critical systems in a data center in real-time so users know how to optimize the use of space, power, cooling, and network capacity. What’s more, DCIM monitoring generates an alarm when something is headed for disaster before the catastrophe happens so changes can be made to reverse the risk.
While it’s important for data center operators to gain access to what’s happening in their facility in real-time, DCIM can also help with future planning. When data center managers know what equipment they currently have, how much power it’s drawing and where that power is coming from, among other vital information, they’ll be able to determine how much more equipment their facility can handle. And by optimizing capacity, they can delay, or altogether eliminate, the need for constructing a new facility.
A recent article in DataCenter Dynamics states: “In the case of data centers, most root causes (of failure) are down to human error… But very often failures occur through a combination of two or three faults happening simultaneously, none of which would have caused an outage on its own.”
That's why failure simulation is a valuable tool for mission-critical facility operators.
The ability to simulate device failures and review “what if?” scenarios gives data center managers the information they need to make wise business-critical decisions and avoid disasters. It answers questions like:
- If I took this piece of equipment off-line for updates or maintenance, what would happen?
- What if something else failed while I was doing maintenance?
- Where would the load go?
- Would something else fail as a result?
- Would there be a cascading failure situation?
You may have reduced redundancy during planned or unplanned outages, or due to errors in connections. Yet, identifying potential single points of failure, knowing where your system is most vulnerable and how resilient a data center is, are all critical for improving system infrastructure and reliability.
The Building Blocks of DCIM
Let’s also be clear that DCIM is not a single piece of software but a software category. It consists of two core building blocks: DCIM monitoring, and IT Asset Management (ITAM).
DCIM monitoring concentrates on collecting data about what is going on in the physical data center environment. It tells you what’s happening.
ITAM keeps you informed about the IT equipment that’s inside your data center. It tells you what you have.
When selecting a DCIM solution, make sure that it can provide:
- Monitoring of power and environmental factors. Energy expenses account for 25 percent of a typical data center’s operating costs. That includes both the power to run equipment and cooling to neutralize the heat the equipment produces. If you know the temperature at a variety of spots in your facility, it’s possible to raise the ambient temperature slowly, eliminate the hot and cold spots, and safely repeat this process to spend less on cooling without endangering your equipment. With a clear view of your power chain you know where the power is coming from, where it’s going, what’s connected to what and where it's all connected – and gain the ability to understand the upstream and downstream impacts of an element failing or being removed, and a clear picture of available capacity, and how you can grow responsibly.
- Alarm and alert . The ability to know when a value reaches a pre-established limit will help you take action to correct problems before they become critical.
- Trending. Real-time information is important, but trending values over time helps in responsible future planning. When is your data center busiest? How much power does it draw at those times?
- Scalability. As data centers get bigger, more complex and their density soars, a DCIM system must be scalable to keep up with the constant changes in these packed facilities.
- Failure simulation. Resilience metrics and information relating to system vulnerability and failure help maintain the highest possible uptime and predict what would happen in the event of a single failure or multiple failures in the power chain.
Cisco’s Global Cloud Index predicts that annual data center traffic will reach a total of 6.6 zettabytes by 2016. With data processing and storage demands on the rise, operations become more costly and there is a definite shift to the virtual. This underscores the value of DCIM as a critical, “must-have” component to any well-run data center.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.