Peter Waterhouse is a Senior Strategist for CA Technologies.
In so many ways IT operations has developed a military-style culture. If IT ops teams are not fighting fires they’re triaging application casualties. Tech engineers are the troubleshooters and problems solvers who hunker down in command centers and war rooms.
For the battle weary on-call staff who are regularly dragged out of bed in the middle of the night, having to constantly deal with flaky infrastructure and poorly designed applications carries a heavy personal toll. So, what are the signs an IT organization is engaged in bad on-call practices? Three obvious ones to consider include:
Support teams are overloaded - Any talk of continuous delivery counts for squat if systems are badly designed, hurriedly released and poorly tested. If teams are constantly running from one problem to another then someone or something will eventually break. Of course, good application support engineers try to do the right thing by patching up systems to keep them in action. But such are the stresses of working in these environments that no time is ever available to work on permanent solutions. The result: Applications with Band-aids just limp from one major outage to the next.
Bad practice becomes the norm - If on-call staff is constantly asked to deal with floods of false-alarms, then any sense of urgency in responding to those alerts will be diminished – staff becomes desensitized. It’s a problem well understood in the field of healthcare where clinical staff have been known to dial-back cardiac alarm systems due to a nuisance factor. Similarly in IT, when on-call staff has alert fatigue they might be inclined to rejig some alert thresholds or hack up an automation to put old incident paging systems into snooze mode. Whatever the cheat, the results are never good.
Poor visibility and insight - What could be worse than being woken up at 3 a.m. to deal with a tech crisis? Being woken up at 3 a.m. and being absolutely powerless to do anything about it. Even with a swag of opensource monitoring tools at their disposal, including log aggregators and dashboards systems, on-call teams still struggle to address complex problems. Not because these tools are bad per se, but because narrowly focused monitoring only provides partial answers. That’s always been troublesome but now even more problematic due to the distributed, API-centric nature microservice style architectures.
Poor visibility doesn’t only manifest technically, there are people issues too. If senior managers aren’t aware of on-call burnout or just turn a blind eye, then methods should be employed to help them wake up and smell the stink. A good place to start is discussing the people cost associated with stressful on-call rotations. If, however, the empathetic approach falls short, try presenting all those latency, saturation and utilization issues in context of business impact – like revenue, profit, customer satisfaction.
Improving Conditions for Better Business Results
Apart from using monitoring to present on-call calamities in clear business terms, there are a many other common-sense approaches that can help give on-callers their life back.
Make alerts actionable - What’s the point of alerting on machine related issues when they have no tangible impact on the business? Good monitoring avoids this by aggregating metrics at a service-level and only alerting on-call staff when customers are hurting and problems need fixing immediately. Anything else can wait until tomorrow when everyone’s had a good night’s sleep.
Automate runbooks - It’s a good practice to develop concise documentation that guides on-call staff during major service disruptions. That’s all fine and dandy but runbook effectiveness is highly dependent on development teams providing clear and up-to-date instructions, which isn’t always top of mind. Although there’s no substitute for good support documentation, advanced analytics-based monitoring tools can augment manual detective work with fully automated evidence gathering, correlation and recovery workflows.
Put developers on-call – However good on-call support engineers are, no one knows the idiosyncrasies of an application better than the people who wrote the actual code. Putting developers’ on-call means the people who’ve most likely caused the problem are the ones being put on the spot to fix it. Witnessing programming stuff-ups first hand in the small hours of the morning is also a great motivator to put things right - permanently.
Audit continuously – Even if the ultimate goal is to never page on-call staff, a more realistic objective is to ensure staff never get paged for the same problem twice. Again, good monitoring tools and analytics can support this - by for example, reviewing performance alerts over historical time periods and correlating with infrastructure or code changes.
When employees are continuously placed in stressful on-call situations they burn out - so will your business. By combining constant on-call reviews and auditing together with advanced monitoring practices, organizations can eliminate alert fatigue, increase service reliability and reduce the need for costly unplanned work.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.