Human Error, Downtime and Design

Rob McClary is the Senior Vice President & General Manager at FORTRUST.

If one of the leading causes of data center outages is human error, why do we spend so much time focused on the tier classification or rating of a data center? I would agree that the design plays an important part in reliability, but it’s a small part relative to people, process, operations, maintenance, lifecycle and risk mitigation strategies.

Human error puts any data center design at risk for downtime, so why do we continue to over build and wait for the inevitable? Have we arrived at the point in this relatively young industry, where we believe human error is unavoidable and that focusing on trying to fix the root causes of “human error” is just too hard?

This thought process has caused gross over-provisioning and waste of resources, thus sparking a debate among industry thought leaders about what exactly the true value is behind a tier rating.

The Tier System

The tier classification system has traditionally been viewed as the industry benchmark for design standards and site reliability. A CIO searching for a data center for their organization will refer to the tier classification to predict a data centers expected reliability. However, tier ratings can often be relied upon too much.

Rather than a methodology that emphasizes the predictability of outcomes, it’s the actual results from the data center that matter.

I’m not ready to give in to ignoring the “most likely” causes of data center outages in favor of trying to design around human error. I will argue that the management and operations of a data center matters far more than a tier rating. Tier rating or any design for that matter, is no guarantee of reliability.

Until recently, data center reliability was always about excess capability and excess capacity, forcing clients to over-provision and to waste important resources in the process. Today’s world favors economy over excess, hence the noticeable shift in the industry’s mentality towards a more progressive approach to data center reliability: a fit for purpose design combined with world class operations and management.

Going Beyond Design to Operations

Despite the changes, too much importance is still placed on data center design, which is only one small part in creating high-availability.

More time needs to be spent on the Uptime Institute’s expansion of their Management and Operations (M&O) Stamp of Approval Program. Why? Once the data center is designed and built, it will inevitably be operated by people. No design I know of addresses the complete removal of human error from data centers.

This is a problem you can’t just throw resources at, or design around. This is a problem that can only be solved by creating an organizational structure that mitigates or eliminates human error. It’s no small effort to do so, however. Fostering ownership, process discipline, procedural compliance, training, and a positive work environment will bestow an operational mindset on your team, which will in turn ensure your data center is reaching its maximum potential—no matter its tier rating.

At the end of the day, the bottom-line metric of a data center’s success is simple: the years of continuous uptime that you deliver against the number of unplanned downtime events that you experience. With the focus on operational mindset and organizational strategy, continuous uptime can be achieved over a long period of time. That’s not luck, that’s not just design, that’s strategy.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text