Ralph Eck is General Manager at Monitis.
Sink or swim. This is precisely what it boils down to when system administrators (SysAdmins) are dealing with the influx of data coming from all directions. Do this, drop that, careful there! While IT monitoring is meant to provide some guidance and give direction, it very often does the exact opposite. This is where monitoring de-escalation management comes into play to change things for the better.
Monitoring is about collecting the data you need in order to keep your crucial IT systems running. And even though this may sound blatantly obvious, there is more to it than first meets the eye. Monitoring may easily leave you with tons of data that means next to nothing – if you do not structure it right.
The most obvious distinction that needs to be made is whether you are more of a reports or an alerts kind of person. Reports and alerts both help account for the health of a system. Yet reports are primarily used to document the overall state of a system. Say for instance you are a web hosting provider and you want to demonstrate the quality of your service to your clients, a report will serve this purpose just fine. Assuming that everything is as it should be.
But then again, it is obvious that a report will not come out right automatically. Too many issues will certainly affect your overall service quality and bring it down to a level where it definitely should not be. So what you need to do is get active as soon as you get the first indication that something goes wrong. And that is precisely where an alert will help you keep matters on track. In other words: Alerts allow you to catch an issue before it becomes a problem. Therefore alerts are what SysAdmins must tend to so the reports show a healthy system.
The Need for Incident Management is Clear
Today’s monitoring technology provides the ability for SysAdmins to receive automatic alerts whenever a monitor detects a problem – that is certainly not a major bit of news. Even the fact that you can decide as to whether it is in the form of an email, text message or phone call will not necessarily sweep you off your feet.
Yet, there are a couple of crucial factors that need to be recognized and dealt with – including proper incident management. Each incident needs to be handled appropriately, and a proper escalation routine is the first step to ensure an alert is being brought to the right person’s attention at the right time. For instance, no one wants to receive a text message alert in the middle of the night while they are away from their desk, and presumably sound asleep. This would just be a dead end.
While this may not be a problem if it is concerning a minor issue, it may be totally different story if it is about a vital object of your system. If this critical object is possibly in danger you might want to make sure the alert gets to the right person in the right manner. So incident management sure matters, but it is still not all that needs to be taken into account.
Threshold to Determine Severity Level
To determine the correct escalation path, thresholds first need to be defined so each problem state can be assigned a severity level. This will help determine whether an alert is critical or can be just handled as a warning that something is not in its usual parameters.
Even moreso since it is important to point out that what matters to one organization may not be as important to another. So while it may be important for some users to know whether a server fails to respond within a predefined timeout others will have to find out whether page elements fail to load, or whether their RAM capacities cross a set threshold.
However, as any SysAdmin will tell you, inconsistent or even conflicting alerts sometimes can be almost as frustrating as not getting any results at all. The key to prevent that is to have more than one monitoring location so that the various locations can challenge the results of the other.
Finally, it is certainly critical to make sure that one issue is not being fixed by two people at the same time. This would most definitely be a waste of resources. So when an alert is being delivered to multiple parties, it is important to inform all involved hands when somebody takes the ownership of this particular issue. Everyone needs to be informed about the status, so that unnecessary task repetition can be avoided.
De-escalation Management Can Make a World of Difference
Monitoring certainly remains a key job for system admins and IT managers alike. Yet the success of it hinges on a proper set up, prioritization, and key elements like incident management, thresholding, and alert acknowledgement. If these elements are not in place, not monitoring can easily add to the misery of the folks who make IT systems work. But if they are, they can make a world of difference.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.