Robert Haynes is a Networking Architecture Expert with F5 Networks.
Applications fail. Large applications. Small applications with the potential to be the next big thing. Applications with redundant infrastructures. And even applications in the cloud. Sometimes they fail suddenly, and sometimes they just can’t cope with demand.
When they do fail, it’s not long before you give your vendors a call. You are paying them for support, after all. We’ve been on the other side of that phone call many times because we have thousands of customers running every kind of app, in almost every infrastructure model you can think of.
Here are the top five causes of application failures we’ve seen over the years:
Mostly, it’s Human Error
Most failures are due to admin error. In fact, several of my colleagues put this as reasons 1-3 of their top 5. These errors can be simple mistakes, such as rebooting the production database cluster instead of the QA. Or they can be systemic errors in the overarching architecture design, like synchronous storage mirroring without checkpoints – copying a database corruption to the DR site in real-time. My advice: mitigate these risks through increased automation and testing. Your changes should be preconfigured, tested, and then executed in production with a minimum opportunity for error. For this to work, it’s important that every component from your live environment is represented in the test environment. Fortunately, nearly all vendors now offer virtual versions of their components in a ‘lab’ edition. This allows you to create a test environment that behaves as near to your production environment as possible.
It Looks Like it’s Working, but it’s Not
Another common cause for failure is an application server failure or misbehavior that remains undetected by monitoring systems. Just because the application server responds to an ICMP ping or returns a “200 OK” to an HTTP request, does not mean things are working properly. Monitoring and health checking services must report application health accurately. For a web application, make sure that your health checks perform a realistic request and look for a valid response. Some organizations even create a specific monitor page that exercises critical application functions to return a valid response.
Capacity Planning Failures
Sure, the application worked in test, and flew through user acceptance testing. What happens once it goes live and twice as many users as predicted turn up? An unusably slow application is effectively offline. In an ideal world, of course, we would test our productions applications against the expected load and beyond. But testing applications at scale can be complex and expensive. And predicting application demand can be difficult. The best mitigation is to build application architectures that can scale reliably and rapidly. Fortunately, mostly thanks to cloud computing, there are plenty of design patterns for applications that can scale horizontally to meet demand. Designing an application architecture built to scale from the start will help you respond rapidly to unexpected demand.
With a Whimper, not a Bang
Some of the hardest application failures to detect don’t happen with a shower of sparks before the (virtual) lights going out. They occur over time, slowly building up until their effects become noticeable. Memory leaks, connections held open, database cursors consumed. Because applications are now complicated, interconnected entities comprised of many processes, finding the culprit can be tricky. Under pressure to fix the problem quickly, the old standby of “have you tried turning it off and on again?” can be a tempting fix. However, unless you have appropriate resource monitoring, you’re probably going to be back here soon. If the application doesn’t restart cleanly, you might need to rely on your backup or DR procedures.
When the DR, Backup, or Failover Doesn’t Work
Although you might scoff at the thought of backups and DR not being tested, it’s surprisingly common. Stories abound of backups silently failing for months, critical servers being missed from schedules, and, in an extreme example, DR equipment being ‘repurposed’ to meet another project’s deadlines. In an example I personally witnessed, this last case led to eight days of downtime during a critical business period. When the production database went down and stayed down, the operations team enacted the DR failover procedure. Except, half the DR site was now missing. The result: the company was unable to sign up any new users, leading to a significant financial impact on the business. This is an unusual example, but unless your backups, your DR, and your procedures are testable and tested, you might find yourself in a similar situation.
What Probably isn’t to Blame? Hardware
When in doubt, blame the build, not the bricks. About the least common cause of application outage is hardware failure, which happens when a device just crashes or stops working. Clean failures are usually easy to deal with, and most critical components run in clusters of two or more. Application server farms span multiple physical hosts, and storage subsystems have RAID and other technologies to protect data. Everyone I’ve consulted uniformly places individual hardware failure at the bottom of the list (and human error at the top).
Looking at this list, it’s clear that focusing on reducing the chances for human error and designing for scalability can prevent application failure. Just as important should be having excellent visibility for spotting problems early, combined with robust and tested recovery procedures when all else fails.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.