Data center downtime is often the result of equipment failure, or a chain reaction of unexpected events. But one of the leading causes of data center downtime is human error, as ComputerWorld reminds us in Stupid Data Center Tricks, which relays anecdotes of data center mishaps. The story notes a study by The Uptime Institute, which estimates that human error causes roughly 70 percent of the problems that plague data centers today.
How can this problem be mitigated? “There is no doubt that human errors in the data center causes a great deal of downtime and some of these can be avoided by adhering to some simple steps,” said Ahmad Moshiri, director of power technical support for Emerson Network Power’s Liebert Services business.
Here’s a look at Emerson Network Power’s Best Practices to Avoid Data Center Failure by Human Error:
1. Shielding Emergency OFF Buttons – Emergency Power Off (EPO) buttons are generally located near doorways in the data center. Often, these buttons are not covered or labeled, and are mistakenly shut off during an emergency, which shuts down power to the entire data center. Labeling and covering EPO buttons can prevent someone from accidentally pushing the button. See Averting Disaster with the EPO Button and Best Label Ever for an EPO Button for more on this topic.
2. Documented Method of Procedure: A documented step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Don’t limit the procedure to one vendor, and ensure back-up plans are included in case of unforeseen events.
3. Correct Component Labeling: To correctly and safely operate a power system, all switching devices must be labeled correctly, as well as the facility one-line diagram to ensure correct sequence of operation. Procedures should be in place to double check device labeling.
4. Consistent Operating Practices – Sometimes data center managers get too comfortable and don’t follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up to date and follow the instructions to operate the system.
5. Ongoing Personnel Training – Ensure all individuals with access to the data center, including IT, emergency, security and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake.
6. Secure Access Policies – Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.
7. Enforcing Food/Drinks Policies – Liquids pose the greatest risk for shorting out critical computer components. The best way to communicate your data center’s food/drink policy is to post a sign outside the door that states what the policy is, and how vigorously the policy is enforced.
8. Avoiding Contaminants – Poor indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all personnel who access the data center wear antistatic booties, or by placing a mat outside the data center. This includes packing and unpacking equipment outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and skids will end up in server racks and other IT infrastructure.