Designing for High Availability and System Failure
May 8th, 2013 By: Julius Neudorfer
This is the third article in a series on DCK Executive Guide to Data Center Designs.
In the world of mission critical computing the term “data center” and its implied and projected level of “availability” has always referred to the physical facility and its power and cooling infrastructure. The advent of the “cloud” and what constitutes “availability” of a “data center” may be up for re-examination.
Designing for failure and accepting equipment failure (facility or IT) as part of the operational scenario is imperative. As was discussed previously (see part 1, Build vs Buy), the ascending tier levels of power and cooling equipment redundancies can mitigate the impact of a facility based hardware failure. However, the IT architects are responsible for mitigating the overall availability of the IT resources, by means of redundant servers, storage and networks, as well as the software to monitor, manage and re-allocate and re-direct applications and processes to other resources in the event of an IT systems failure.
Traditionally there have been very little discussions or interactions between the IT architects and the data center facility designers regarding the ability IT systems to handle failover. As more enterprise organizations begin to visualize and utilize public and private cloud resources it may change the need for the amount of redundant IT resources located within any one single physical data center and create a logical redundancy shared among two or more sites. The ability to shift live computing loads across hardware and sites is not new and has been done many times in the past. Server clustering technology, coupled with redundant replicated data storage arrays has been available and successfully used for over 20 years. While not every application may failover perfectly or seamlessly yet, we cannot underestimate the long term importance of rethinking and including the ability of the IT systems to be part of our overall goal of availability, when making decisions about required redundancy levels of facility based infrastructure, required to meet the desired level of overall system availability.
The holistic approach to include an evaluation of the resiliency of the IT architecture in the “availability” design and calculations should be part and parcel of the overall business requirements when making decisions on regarding the facility tier level, number of physical data centers, as well as their geographic locations. This can potentially reduce costs and greatly increase overall “availability”, as well as business continuity and survivability during a crisis. Even basic decisions, such as how much fuel should be stored locally (i.e. 24 hours, 3 days a week for generator back-up), needs to be re-evaluated in light of recent events such as Super Storm Sandy which devastated the general infrastructure in New York City and the surrounding areas (see part 4, Global Strategies).
Ideally, the realistic re-assessment and analysis should be a catalyst for a sense of shared responsibility by both the IT and Facilities departments, as well as a catalyst for the re-evaluation of how data center “availability” is ultimately architected, defined and measured, in the age of virtualization and cloud based computing. These type of conversations and decisions must be motivated and made by the higher execute level of management.
Designing for an enterprise type of user owner data center is different than for a co-lo, hosting or cloud data center. Also the level of system redundancy does not have to exactly match the tier structure. Many sites have been designed with a higher level of electrical redundancy (i.e. 2N) while using an N+1 scheme for cooling systems. This is particularly true for sites that use individual CRAC units (which are autonomous), rather than a central chilled water plant.
Site Selection and Sustainable Energy Availability and Cost
The design and site selection process need to be intertwined. Many issues go into site section, such as geographic stability, power availability as well as climatic conditions, which will directly impact the type and design of the cooling system. (see part 2 – Total Cost of Ownership). Generally, the availability of sufficient power is near the top the first critical check list of site evaluation questions, as well as the cost of energy. However, in our present era of social consciousness of sustainability issues, as well as watchdog organizations such as Greenpeace, the source of the power is also an issue that has become a factor, based on the type of fuel used to generate the power, even if the data center itself is extremely energy efficient. Previously, those decisions were typically driven by the low¬est cost of power. Some organizations have picked locations based on the ability to purchase commercial power that has some percentage generation from a sustainable source. The Green Grid has defined the Green Energy Coefficient (GEC), which is a metric that quantifies the portion of a facility’s energy that comes from green sources.
In other cases, some high profile organizations have built new leading edge data centers with on-site generation capacity such as fuel cell, solar and wind, to partially offset or minimize their use of less sustainable local utility generation fuel sources, such as coal. While this would impact the TCO economics, since it requires a larger upfront capital investment, however there may be some local and government tax or financial incentives available to offset the upfront costs. Nonetheless, while this option may not be practical for every data center, green energy awareness is increasing and should not be ignored.
The complete Data Center Knowledge Executive Guide on Data Center Design is available in PDF complements of Digital Realty. Click here to download.