How to Prevent Downtime Due to Human Error

3 comments

Data center downtime is often the result of equipment failure, or a chain reaction of unexpected events. But one of the leading causes of data center downtime is human error, as ComputerWorld reminds us in Stupid Data Center Tricks, which relays anecdotes of data center mishaps. The story notes a study by The Uptime Institute, which estimates that human error causes roughly 70 percent of the problems that plague data centers today.

How can this problem be mitigated? “There is no doubt that human errors in the data center causes a great deal of downtime and some of these can be avoided by adhering to some simple steps,” said Ahmad Moshiri, director of power technical support for Emerson Network Power’s Liebert Services business.

Here’s a look at Emerson Network Power’s Best Practices to Avoid Data Center Failure by Human Error:

1. Shielding Emergency OFF Buttons – Emergency Power Off (EPO) buttons are generally located near doorways in the data center. Often, these buttons are not covered or labeled, and are mistakenly shut off during an emergency, which shuts down power to the entire data center. Labeling and covering EPO buttons can prevent someone from accidentally pushing the button. See Averting Disaster with the EPO Button and Best Label Ever for an EPO Button for more on this topic.

2. Documented Method of Procedure: A documented step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Don’t limit the procedure to one vendor, and ensure back-up plans are included in case of unforeseen events.

3. Correct Component Labeling: To correctly and safely operate a power system, all switching devices must be labeled correctly, as well as the facility one-line diagram to ensure correct sequence of operation. Procedures should be in place to double check device labeling.

4. Consistent Operating Practices – Sometimes data center managers get too comfortable and don’t follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up to date and follow the instructions to operate the system.

5. Ongoing Personnel Training – Ensure all individuals with access to the data center, including IT, emergency, security and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake.

6. Secure Access Policies – Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.

7. Enforcing Food/Drinks Policies – Liquids pose the greatest risk for shorting out critical computer components. The best way to communicate your data center’s food/drink policy is to post a sign outside the door that states what the policy is, and how vigorously the policy is enforced.

8. Avoiding Contaminants – Poor indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all personnel who access the data center wear antistatic booties, or by placing a mat outside the data center. This includes packing and unpacking equipment outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and skids will end up in server racks and other IT infrastructure.

Add Your Comments

  • (will not be published)

3 Comments

  1. Rajesh

    good artical

  2. Steve

    I work in the industry and I'm surprised that the articles section from Liebert made no mention of utilizing FACTORY trained technicians for equipment services. For all those out there that think one service organization is as good as the next, see what happens when there's a failure with one of these pieces of equipment and the guy who shows up has no clue what model he's looking at and where and when the required parts would be available!! I could understand the items posted above with out a doubt, but the HUMAN error part is usually related to service/maintenance procedures. Factory service teams only work on their own equipment, which allows them to focus on quality service procedures for the respective equipment. Next time someone who utilizes 3rd party services is conducting preventative maintenance, ask the technician a question like, "during the PM, were the dc and ac capacitors checked and if so, what's the current age and how many years can I expect out of them before proactively replacing?". This a very simple question and the answer can mean the difference between a critical load loss or a systematic shutdown and replacement.