Critical Thinking is a weekly column on innovation in data center infrastructure design and management. More about the column and the author here.
The phrase “all the gear but no idea” is sometimes used to describe hobbyists – cyclists, skiers, or sailors – who think they can spend their way to proficiency without investing the time to accrue the necessary skills for their chosen pastime.
But the notion that competence can simply be purchased by procuring the newest and shiniest technology could equally apply to some data center operators.
Arguably there has been more innovation in data center design over the last five to ten years than in the previous two decades. Some of that momentum has been created by the almost relentless focus on efficiency by hyper-scale cloud platform operators who are willing to accept the potential risks of disruptive technologies in order to lower capital or operational costs. As a result, colocation providers and enterprise operators have also had to invest in innovative designs in order to remain relevant.
The pace of innovation looks set to continue and even accelerate over the coming decade due to a triumvirate of technology drivers: the Internet of Things (IoT) and edge computing, AI and machine learning, and the inexorable growth of public cloud. New data center form factors are also emerging, including highly compact micro-modular designs.
This isn’t to say that data center owners are ignoring resiliency and availability: operators continue to invest in highly resilient designs and third-party certifications.
For example, Uptime Institute recently announced that it would also begin to certify so-called prefabricated modular (PFM) equipment – something it’s shied away from in the past. These pre-integrated units of capacity can speed up the design and construction of resilient sites – from years down to months – and drive out some of the capital costs associated with bespoke designs.
Uptime is also developing a new initiative to certify the increasing number of sites that employ some form of so-called “distributed resiliency,” where availability is maintained by a combination of redundant mechanical and electrical (M&E) equipment but also applications and networks.
But it could be argued that the pace of innovation in data center technology could in some cases actually be a distraction from effective management: unplanned downtime is still worryingly prevalent. In the first few months of 2017 alone there were a spate of outages at high-profile facilities owned by financial services, retailers, cloud service providers, and airlines. Delta Airlines, United Airlines, and British Airways were just three examples of companies that suffered serious incidents with estimated costs running into the tens or even hundreds of millions of dollars.
A number of those outages were reportedly down to human error. A 2016 study by security research organization Ponemon Institute revealed that 22 percent of all unplanned data center downtime was due to human factors rather than malfunctioning equipment. Uptime thinks the problem could be worse: “Uptime Institute analysis of 20 years of abnormal incident data from its members showed human error to blame for more than 70% of all data center outages,” the organisation states.
For some, the idea of human error will conjure up the idea of some hapless admin flipping the wrong switch. The reality is that such incidents are rarely about the fallibility of a single human being but rather failure at an organizational level. Arguably, having the right operations and maintenance, or O&M, practices in place is as important, if not more important, than the most resilient 2N infrastructure design.
Some of the issues around poor O&M practices can be attributed to lack of communication between teams. For example, equipment manufacturers and design teams should create O&M documents and procedures and hand these off to the operations staff responsible for day-to-day management. However, all too often, this handover is not well managed, and the documentation is not fit for purpose. Operations teams also often fail to keep O&M documents up to date to reflect changes (upgrades, replacements) to the infrastructure.
There are tools that can help to improve O&M practices, such as some types of data center infrastructure management (DCIM) asset management software or even specific computerized maintenance management systems (CMMS). Specialist tools for O&M from suppliers such as MCIM and Icarus Ops are also available. However, some operators continue to rely on manual processes and hard-copy documents – in PDFs for example – that can be hard to edit and update.
Third-party consultants can help refine management practices. It’s also possible to outsource some of the problem to a facilities management (FM) service provider such as CBRE, JLL, or Schneider Electric. As well as bringing their own staff on-site, these FM companies often promise to introduce stricter management practices and improved service levels over those provided by internal teams.
Longer term it also seems likely that innovations such as cloud-based data center management tools, already in development by suppliers such as Schneider and Eaton, could help automate some management tasks using data lakes and advanced analytics and enable approaches such as predictive maintenance.
But technology on its own won’t prevent human error. Operators – especially those with a number of facilities – need to develop comprehensive O&M processes that cover everything from emergency operating procedures to staff training and qualifications. After all, even the latest carbon-fibre racing bike won’t do much for a rider who hasn’t bothered to learn how to fix a puncture.