Effective Risk Management in the Data Center

Data center managers are fighting a constant battle with risk. Their jobs, aside from cramming computing resource into a constrained space using limited power and cooling capacity, involves ensuring that this resource is available, all of the time. That means identifying and managing risks from various sources.

A standards-based risk management methodology can help with that challenge. It can help data center managers to prioritize their risks, and to prepare for a data center or critical environments audit. Where to start?

Understanding Different Types of Risk

Before a data center can manage risk, it has to understand the different categories of threat to operations. Kevin Read, GIO UK senior delivery center manager at French multinational IT consulting company Capgemini, is responsible for managing data center risk in his organization, which runs its own facilities to serve clients. He identifies several categories for data center managers to be worried about.

“The first risk category in a mission-critical data center is loss of power,” he warns. This risk is existential for a data center, but there are frameworks incorporating the management of that risk. Like many other data centers, Capgemini uses tier ratings, which help to classify their exposure to disruptive risks such as these.

“Capgemini designs and implements Tier 3 facilities to provide the resilience for its clients with N+1, & N+N UPS-backed power routes to the racks and cooling systems,” said Read. “Also, connecting duel power into the site protects against local sub-station power failure, with backup generators as a last resort.”

The second risk involves service disruption thanks to fires from malfunctioning plants and IT equipment, he said, adding that the company uses inert gas suppression systems in all IT rooms including plant rooms to douse fires before they spread.

“The third risk category is flooding (rivers and extreme weather), aircraft, pandemics and air contamination from other properties,” he continued. “Sites on flight paths, close to flood risk areas and close to factories that pollute or could contain explosive chemicals should never be selected.”

Finally, Read points to security as risk category number four. This includes both physical security, and the risk of logical security breaches (hacks). The firm even lumps terrorist threats into this risk category.

Like the other categories of risk, security naturally breaks down into many subcategories, and those can be divided still further. Within logical security, for example, managers may look at employee access to applications as a particular risk area, and mobile and device access as another.

Some risks emerge as new technologies and become mainstream. For example, Paul Ferron, director of security solutions at CA Technologies, warns about virtualization sprawl as a particular security risk. This phenomenon, more often described as a management and resource risk, can have its consequences for data security too, he warned.

“Virtual machines can easily be copied without the appropriate security privileges,” he warned. “When users have finished with them, they may not be shut down.”

In this case, as with many others, designing secure processes for certain operations helps to standardize them and reduce the risk of vulnerabilities slipping through the net. The use of, say, IT service management tools to codify and automate those processes reduces it still further.

Matt Lovell, CTO at cloud hosting company Pulsant, adds health and safety risks to the mix.

These are multi-faceted, he warned, ranging from electrical best practice and mechanical operational safety through to environmental and noise controls, and the challenges of working in restricted space areas.

“This requires a significant degree of compliance and safety of work measurements to ensure all personnel who work in the environment do so with the minimum of risk to themselves and others,” he said.

Risk Management Methodologies

These risks won’t all be equal, though. Some will be more likely than others, while some will have a bigger potential impact. Juggling them all and understanding which ones to prioritize from a budgetary perspective is an important part of the process.

Ferron advises managers to use variations on the traditional risk management matrix, with the probability of risk along one side, and the potential business impact along the other. “This can be a 3-D graph,” he added, suggesting that a third dimension could highlight the projected expenditure to mitigate the risk in question.

Read’s operation has a similar approach, designed to identify and quantify risks and their potential mitigation cost. Significantly, his risk management system is designed to be a living, breathing document that changes over time.

“At Capgemini, we have put in place a monthly risk management system that logs all risks and issues with containment and action plans,” he said. “An investment budget is made available if changes are required.”

While data centers face their own unique kinds of risks, the methods used for managing them aren’t specific to that environment. More generic risk management methodologies are as suitable for describing and handling data center risk as they are in other domains.

One commonly understood risk management standard is ISO 31000:2009, said Lovell. This standard sets out generic principles and guidelines for risk management, and is designed to be tailored to the risk types that each user sees fit. It is more a framework for risk management than an accreditation, but Lovell said that it can also be used to audit risk preparedness within a data center.

“The audit program must seek to identify that the correct response procedures are in place and that these are rehearsed and understood by staff, which will change over time, so they must be continually updated,” he said.

Data centers don’t function alone, though. They exist on a broader continuum that marries technology with business objectives. Risk management in technology will be part of a broader risk management story. Competent companies will be exploring all kinds of risk, from financial through to regulatory and organizational.

How the data center’s risk fits into this will vary between companies. In Capgemini’s case, the data center manager is responsible for the facility and will manage the monthly risks and issues process. That manager, along with the head of UK data centers, has monthly meetings with the chief financial officer’s team to forecast any major risk expenditures.

Data center compliance teams will typically report to the board in some form, said Pulsant’s Lovell.

“There are director responsibilities which must be managed and reported as legal obligations. This may differ from other IT governance programs which may report through various project or organizational structures,” he said.

Ideally, there should be some separation of duties when managing risk and reporting on the results, Lovell added. “The recommendation is always to manage risk appropriately, and this should involve a level of independent management and verification of compliance outside of the operational teams which monitor and deliver data center services. This can be an independent internal or external governance team.”

Choosing an Audit Methodology

The key word here is verification. Quantifying, prioritizing and mitigating risk is one part of the risk management challenge, but measuring a data center’s performance in these areas is an important part of the process. An audit for risk will help internal staff—and potentially clients, if necessary—to see how well a data center has controlled the various sources of risk in the operation.

Before choosing an audit to cover risk in the data center, managers must understand what they want to achieve from it. Is the risk audit customer-driven? If so, are there any specific standards that the customer is looking for? Are there any risk management metrics that a client particular wants the data center to hit?

Audits may also be driven by suppliers of risk mitigation services to the data center. For example, Capgemini’s data centers are audited regularly by its own group, and by government clients, but also by Capgemini insurers, Read said.

Audit Standards

One of the biggest challenges for a risk audit is the diversity of risk categories involved. It is difficult to audit all of these under one standard, meaning that data center managers may have to apply a variety of standards when conducting an audit.

When looking at security, ISO 27002 covers the code of practice for information security management. It explores a variety of different aspects, including human resource security, physical and environmental security, and access control.

The Payment Card Industry Data Security Standard (PCI-DSS) also covers information security, and is a highly prescriptive standard focusing on the organization and retention of credit card data in the data center. It covers the building and maintenance of a secure network, the management of vulnerabilities, and network and system monitoring among other things.

For commercial operators handling government information, other audits may be necessary. In the UK, List X is a commonly understood security clearance system for contractors handling government data, while in the U.S., Facility Clearance Levels are the alternative.

“From a health and safety perspective, many data center operators are working toward, or at least to, the principles of OHSAS18001, which is an internationally recognized standard for health and safety management and associated systems,” added Lovell.

Environmental protection audits will often fall under ISO14001. Data centers may wish to consider this auditing standard, and environmental risks in general, given the tendency to store diesel onsite in bulk to handle generator requirements.

Stakeholders

There are often multiple stakeholders involved when it comes to defining and mitigating risks, said Gavin Millard, technical director of Tenable Network Security, which sells software designed to scan networks for security threats. He divides them into three main groups: the security team, the operations team and the business.

The problem is that not all of them have the same agendas, he warned: “As many organizations have discovered, the goals and needs of each are often conflicting, causing issues with prioritizing the actions needed to reduce each specific group’s definition of risk,” he said.

What do these conflicts look like? One example involves software patching. This is one of the most effective ways to reduce security risks in an organization. In July 2013, the Australian Security Directorate published a set of strategies to mitigate cyber-intrusions. Patching operating systems was one of these measures, and patching applications was the other. Doing that, along with application whitelisting and minimizing administrative privileges would eliminate 85 percent of hacks, the agency said.

The problem is that the IT security group’s priority is to focus on eliminating holes in the system through which an attacker might creep, so that it can reduce the risk of data breaches. That requires it to patch critical vulnerabilities quickly. Conversely, the IT operations team needs to minimize the risk of downtime, meaning that any changes to the system must be structured, planned, and controlled. This can often lead operations teams to ask for less frequent patching schedules to reduce availability risk.

Business managers have their own, separate agenda: maintaining the bottom line and hitting their performance targets. So they will only want patches deployed if the benefit to the bottom line outweighs the cost of completing the work.

“Conflicting goals can be hard to address, but one of the most effective methods of doing so is to have a highly efficient process for continuously identifying where a risk resides,” said Millard. “You also need a predictable, reliable method of updating systems without impact to the overarching business goals of the organization.”

Managing risk effectively, then, involves not only an assessment of threats to the data center, but a willingness among team members to work together cooperatively so that all agendas can be happily accommodated. In some cases, this may create opportunities for new working practices.

The introduction of DevOps (development/operations) disciplines to streamline the workflow between development, test, and deployment, might help to offset tensions such as the one that Millard describes.

As with most things in IT, effective risk management is as much a people-centric process as a technology-focused one. The use of standardized methodologies and audits can help to quantify just how much risk a data center faces, and how this may affect future budgets. It always helps to measure what must be managed.

Danny Bradbury has 20 years of experience as a technology journalist. He writes regularly about enterprise technology issues including data center management, security, software development and networking.

Comments

Plain text