While data centers aren’t necessarily something CIOs think about on a daily basis, there are some essential things every executive in this role must know about their organization’s data center operations. They all have to do with data center outages, past and future ones. These incidents carry significant risk of negative impact on the entire organization’s performance and profitability, which are things that fall comfortably within a typical CIO’s scope of responsibilities.
CIOs need to know answers to these questions, and those answers need to be updated on a regular basis. Here they are:
- If you knew that your primary production data center was going to take an outage tomorrow, what would you do differently today? This is the million-dollar question, although not knowing the answer usually costs a lot more to the CIO. Simply put, if you don’t know your data center’s vulnerabilities, you are more likely to take an outage. Working with experienced consultants will usually help, both in terms of tapping into their expertise and in terms of having a new set of eyes focus on the matter. At least two things should be reviewed: 1) How your data center is designed; and 2) How it operates. This review will help identify downtime risks and point to potential ways to mitigate.
- Has your company ever experienced a significant data center outage? How do you know it was significant? Key here is defining “significant outage.” The definition can vary from one organization to another, and even between roles within a single company. It can also vary by application. Setting common definitions around this topic is essential to identifying and eliminating unplanned outages. Once defined, begin to track, measure, and communicate these definitions within your organization.
- Which applications are the most critical ones to your organization, and how are you protecting them from outages? The lazy uniform answer would be, “Every application is important.” But every organization has applications and services that are more critical than others. A website going down in a hospital doesn’t stop patients being treated, but a website outage for an e-commerce company means missed sales. Once you identify your most critical apps and services, determine who will protect them and how, based on your specific business case and risk tolerance.
- How do you measure the cost of a data center outage? Having this story clear can help the business make better decisions. By developing a model for determining outage costs and weighing them against the cost to mitigate the risk, the business can make more informed decisions. Total outage cost can be nebulous, but spending the time to get as close to it as possible and getting executive buy-in on that story will help the cause. We have witnessed generator projects and UPS upgrades turned down simply because the manager couldn’t tell this story to the business. A word of warning: The evidence and the costs for the outage have to be realistic. Soft costs get hard to calculate and can make the choices seem simple, but sometimes the outage may just mean a backlog of information that needs to be processed, without significant top-line or bottom-line impact. Even the most naïve business execs will sniff out unrealistic hypotheticals. Outage cost estimates have to be real.
- What indirect business costs will a data center outage result in? This varies greatly from organization to organization, but these are the more difficult to quantify costs, such as loss of productivity, loss of competitive advantage, reduced customer loyalty, regulatory fines, and many other types of losses.
- Do you have documented processes and procedures in place to mitigate human error in the data center? If so, how do you know they are being precisely followed? According to recent Uptime Institute statistics, around 73% of data center outages are caused by human error. Before we can replace all humans with machines, the only way to address this is having clearly defined processes and procedures. The fact that this statistic hasn’t improved over time indicates that most organizations still have a lot of work to do in this area. Enforcement of these policies is just as critical. Many organizations do have sound policies but don’t enforce them adequately.
- Do your data center security policies gel with your business security policies? We could write an entire article on this topic (and one is in the works), but in short, now that IT and facilities are figuring out how to collaborate better inside the data center, it’s time for IT and security departments to do the same. One of the common problems we’ve observed is when a corporate physical security system needs to operate within the data center but under different usage requirements than the rest of the company. Getting corporate security and data center operations to integrate, or at least share data is usually problematic.
- Do you have a structured, ongoing process for determining what applications run in on-premises data centers, in a colo, or in a public cloud? As your business requirements change, so do your applications and resources needed to operate them. All applications running in the data center should be assessed and reviewed at least annually, if not more often, and the best type of infrastructure should be decided for each application based on reliability, performance, and security requirements of the business.
- What is your IoT security strategy? Do you have an incident response plan in place? Now that most organizations have solved or mitigated BYOD threats, IoT devices are likely the next major category of input devices to track and monitor. As we have seen over the years, many organizations are monitoring activity on the application stack, while IoT devices are left unmonitored and often unprotected. These devices play a major role in the physical infrastructure (such as power and cooling systems) that operates the organization’s IT stack. Leaving them unprotected increases the risk of data center outages.
- What is your Business Continuity/Disaster Recovery process? And the follow up questions: Does your entire staff know where they need to be and what they need to do if you have a critical and unplanned data center event? Has that plan been tested? Again, processes are key here. Most organizations we consult with do have these processes architected, implemented, and documented. The key issue is once again the human factor: Most often personnel don’t know about these processes, and if they do, they haven’t practiced them to be alert and cognizant of what to do when a major event actually happens.
Many other questions could (and should) be asked, but we believe that these represent the greatest risk and impact to an organization’s IT operations in a data center. Can you thoroughly answer all of these questions for your company? If not, it’s time to look for answers.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
About the Author: Tim Kittila is Director of Data Center Strategy at Parallel Technologies. In this role, Kittila oversees the company’s data center consulting and services to help companies with their data center, whether it is a privately-owned data center, colocation facility or a combination of the two. Earlier in his career at Parallel Technologies Kittila served as Director of Data Center Infrastructure Strategy and was responsible for data center design/build solutions and led the mechanical and electrical data center practice, including engineering assessments, design-build, construction project management and environmental monitoring. Before joining Parallel Technologies in 2010, he was vice president at Hypertect, a data center infrastructure company. Kittila earned his bachelor of science in mechanical engineering from Virginia Tech and holds a master’s degree in business from the University of Delaware’s Lerner School of Business.