Everywhere we turn we’re surrounded by “Policies”. Foreign policies, “Social Media” policies, you name it. Even 7-Eleven has a policy: “No shirt, No shoes, No service”. Policies have also wound their way into corporate IT departments. However, one area where defined guidelines are often overlooked is alerting for systems and network problems. Let’s dig in because why should everybody else get to have to have all the rule-making fun?
What is an alerting policy? Well, it’s exactly what it sounds like: A collection of practices and procedures centered on the alerting of “bad stuff” in your IT infrastructure. However, even that definition is too vague for our use. The goal here is to discuss the detailed building blocks of an effective alerting policy. A lot of the explanation here may seem like common sense, but a codified policy will make your life as an administrator easier and day-to-day activities more transparent to end users and management alike.
A couple months back in this blog we discussed the difference between reporting and alerting in your NMS system. The takeaway is that you shouldn’t configure variables for alerting unless you’re prepared to DROP EVERYTHING to fix that particular problem the INSTANT that alert comes in. Alert generation is the first element to consider in your policy. If a thing isn’t CRITICAL to the operation of your infrastructure or business, then it’s best not to send alerts for it. Generally speaking, variables such as host outages, service outages, and disk consumption thresholds are examples of critical problems that can be addressed immediately and need alerts tied to them. Of course, every environment is a little different, but that is a good starting point.
Once you’ve decided what variables to alert on versus those to ignore the next thing to consider is “who” should get the alerts? Do three “Jack-of-All-Trades” engineers share responsibility in maintaining your IT infrastructure? Alternatively, do the systems people and network engineers lord over their own turfs and refuse to play nice with one another? Depending on levels of specialization you don’t want all alerts to go to all people. And, based on the “Potayto, Potahto” idea from link above, you don’t want them to either. Instead you only want your NMS generating alerts that are meaningful and actionable by the recipient.
An extension of this topic requires an answer to the question “where should we do this alert configuration?” Most, if not all, NMS systems will allow you to input individual contacts for recipients, but this isn’t the best idea. Rather go onto your email system and configure distribution groups for inclusion on your NMS system. You now have you a single point of alteration when IT personnel changes.
What if the applications folks busy fighting a fire with your CRM system? Or if the phone is ringing off the hook because the CEO can’t log into the mail server? It that case we get into the third major part of an effective alerting policy: Alert Escalation.
How long is too long before your NMS system sends a follow up alert for persistent a problem? 60 minutes? 24 hours? As with the other aspects of alerting policy the proper value depends on the business need, type of alert, etc. For re-notifications you want strike a balance between not causing too much “noise” in your inbox and still conveying the importance of addressing the issue (because you are only alerting on immediate/actionable things, right?).
There is another aspect of alert escalation to consider as well. What if your systems administrator and network engineer would feel more comfortable facing off against one another in Thunderdome rather than maintaining healthy IT environment? Or it’s a day after a long July 4th weekend and there’s not an engineer in site to deal with problems? Who do you want the problem escalated to when the primary contact isn’t available or doesn’t respond? Again, since the idea is that only really important matters are sending alerts, then the ideal escalation person is somebody that can address the issue. However, it’s also fairly common to set a manager as the secondary contact. Whomever is chosen they must have the gravitas within the organization to get the right people onto remediation ASAP.
Last on the list of elements to consider in your alerting policy is the idea of planned outage handling. This topic, of course, refers to things such has “Patch Tuesday” or “Reboot Sunday” when it is expected devices or services in the infrastructure will be unavailable. What is the proper way to handle those scenarios? Should we create a reoccurring maintenance window that mutes alerting for a small subset of devices? Is it preferable to shut off alerting for the NMS system as a whole? Once again answers are business-dependent.
On dictionary.com the idiom “Separate the wheat from the chaff” is literally defined as “sorting the valuable from the worthless.” While it refers to the ancient act of winnowing harvested grain it makes perfect sense within the context of an alerting policy. By defining the most important variables to alert on, who those alerts should go to, notification times, escalation paths, and planned outage procedures your NMS system will be broadcasting the valuable and jettisoning the noise. Happy winnowing and rule-making gang!