Insight and analysis on the data center space from industry thought leaders.

In Distributed Systems, Plan to Fail

When planning for failure, a faltering DNS can cause applications and data to become slow or unavailable, resulting in lost revenue and negative experiences.

4 Min Read
Data Center Knowledge logo


Shannon Weyrick is VP of Architecture at NS1.

There is an age-old adage in business: “Failing to plan is planning to fail.” But when it comes to distributed systems, planning to fail – or more accurately, planning for failure – is instrumental to assure uptime, security, performance, and resilience. Failure comes in many forms: human error, system outages, or even natural disasters. Sometimes failure may be the result of an organized attack. Organizations that successfully navigate failure have a couple of things in common: They build an architecture that plans for these failures and routinely stress-test their systems to identify weaknesses and areas for improvement.

When planning for failure, a key area to consider is DNS. As the entry point of everything your business and customers do online, DNS failures can cause applications and data to become slow or unavailable, resulting in lost revenue and negative experiences. This failure is not binary; it is linear. For example, routing to suboptimal resources can create just as many issues as a system outage. Here are a few best practices to ensure your organization plans in advance to alleviate any chance of DNS failure.

To Err is Human, to Automate is Divine

To plan for human error, you should turn to automation. DNS has become integral to modern infrastructure and application delivery systems, making it a complex environment to manage. Manual policy management and other monotonous workflow processes are time consuming and error prone. Automation presents an opportunity to reduce errors and their associated failures. Additionally, automating tedious manual processes frees employees to innovate improved infrastructure to develop and deploy next generation systems.

One such improvement may be to implement an analytics engine for better visibility into traffic routing, which may subsequently inform further infrastructure innovations. Automating error-prone processes is intended to boost uptime and availability, with the benefit of improving application experiences. Automation is an important first step in mitigating failure because it can be applied at every step along the way.

Redundant Systems Eliminate Single Points of Failure

Step one in planning for system outages is to develop redundant systems. The popular phrase 'one is none' rings true in the realm of networking. DNS outages can be mitigated, but not if there is only one server. Likewise, if all of your DNS servers are in one data center, then a data center outage will affect all of your servers. Multiple points of presence minimize these singular points of failure.

Secondary DNS increases server availability across two separate DNS networks with unique addresses and resources. Importantly, secondary DNS does not always require an additional service provider. In fact, a second provider may cause integration headaches and diminished feature functionality.

Stress It, Then Fix It

To prepare for the eventuality of coordinated attacks, operations teams should practice incidence response. Companies operating in the critical path of internet traffic are constantly exposed to DDoS attacks of all types and scales. While Mirai-scale attacks generate the biggest headlines, most attacks are much smaller. Ideally, in most at-scale systems, the smaller and more mundane attacks are mitigated automatically. But because scale can vary, and attacks can progress dynamically as attackers get creative, operations teams need to be ready to respond.

Incident response drills, or war games, can help keep your skills sharp. Without practice, you may be unaware that your tools have broken or forget how to use them. Stressing systems reveals their weaknesses, so they can be improved. Examples of areas to stress test include capacity, data center outages, and disaster recovery. Examples of output from these tests include operator confidence, better documentation, bug fixes, and improved architecture.

Get By With a Little Help from Your Friends 

Planning for failure can become an important investment when the cost of failure is higher. Building redundant infrastructure from the ground up can be quite the endeavor. And until failure occurs, much of this infrastructure may go unused or underutilized. Instead, organizations should consider partnering with a DNS service provider instead of building their own network. In this way, organizations can get the best of both worlds: an architecture that accounts for failure, at an economy of scale that any organization can afford.

Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating.


Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like