Pete Mastin has over 25 years of experience in business strategy, software architecture and operations and strategic partnership development. His background in cloud and CDN informs his current role with Cedexis as their lead evangelist.
In spite of what many have predicted, data centers continue to grow in popularity. The prevalence of “server huggers” and cloud privacy concerns will continue to keep a significant number of enterprises from taking their applications to the cloud.
As Ron Vokoun mentioned in his article on Top 10 Data Center Predictions 2015, FUD will also play a part in maintaining a steady need for new data centers (as opposed to wholesale migration to the cloud). We agree with his assessment that an optimized hybrid model of both is much more likely.
Bigger, Stronger, Faster … Or Are They?
Today’s data centers are built to be bigger, stronger and more resilient. Yet not a month goes by without news of a commercial or private data center failure. A survey of AFCOM members found that 81 percent of respondents had experienced a failure in the past five years, and 20 percent had been hit with at least five failures.
Most seasoned operations managers have a wall of RFOs (Reason for Outage). I called mine “The Wall of Shame.” Whether it is the UPS system, the cooling system, the connectivity or any of a myriad of subsystems that keep the modern data center working, N+1 (or its derivatives) does not guarantee 100 percent uptime. Until the robots take over, nothing can mitigate human failure.
Outside These Four Walls
Furthermore, if your applications require top performance, there are a near infinite number of things that can impact you outside the data center. Connectivity issues (both availability and latency) are out of your control, from acts of god to acts of man. Peering relationships change, backhoes continue to cut fiber and ships at sea continue to drag their anchors. Hurricanes, earthquakes, tornadoes, tsunamis and rodents on high wires will continue to avoid cooperation with data center needs.
So how do we overcome these challenges? The simple answer: multi-home your data center.
Split Them Up, Spread Them Out
There is no reason for the type of outages described above to impact the correctly configured enterprise application. Architects and designers have long realized that data center outages are a fact of life. Every disaster recovery and high availability architecture of the past 10+ years relies on the use of geographically diverse deployment. Generally, the best practices for critical application deployment are:
- Have your technology deployed across multiple availably zones to maximize uptime in case of natural disasters such as hurricanes or earthquakes.
- Have your technology deployed across multiple vendors. Vendor specific outages are more common than natural disasters. Even carrier neutral data centers often have backchannel between their own data centers and these loops can be damaged. Other software related failures can plague specific vendors and cause issues. Further, having multiple vendors can help your costs during annual renewals.
One More Piece to the Puzzle
The first two are well-understood rules. But many architects miss the third leg of this stool: Adequate monitoring of your applications (and its attendant infrastructure) and deploying a global load balancing based on real-time performance data is critical for 100 percent uptime.
All too often we see applications having performance issues because the monitoring solution used is measuring the wrong things or perhaps the right things but too infrequently. The type of monitoring can and will change based on a variety of factors. Our findings show that the best solution is to mix Application Performance Monitoring (APM) with a good Real User Measurements (RUM) tool to get the best of both types. Performance issues are avoidable when real-time traffic management is deployed. We propose the following addendum to the traditional rules above:
- Use real-time global traffic management – based on a combination APM and RUM – and make this data actionable to your Global Traffic Management (GTM) tool to distribute traffic in an active-active configuration.
Following these best practices will allow applications to maintain 100 percent uptime and maintain the best possible performance, regardless of their providers’ maintenance or acts of god. There is no substitute for RUM in this equation. While synthetic measurements (via APM) are a very important part of the mix, you really do not understand what your end users experience unless you measure it. While this seems like a tautology, far too many fail-over solutions miss this vital point. If your data center goes down you must immediately route traffic away. The bottom line of something going wrong is when your end users experience it.
The upside of this approach (if deployed correctly) is that you actually get improved performance – since the traffic will flow to the best performing data center – even when nothing is going catastrophically wrong. This will make your end users happy which, after all, is what we’re here for. At least until the robots take over.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.