The longer you work at a landfill, the less the smell alarms you. The longer your network monitoring system dashboard is lit up in red and yellow when nothing’s really wrong, the less the alerts will mean to you. Configure your network monitor alert thresholds so red really means “Do something now.”
One of our senior service engineers (SEs) was visiting a client not long after installing a network monitoring system. Looking the client’s dashboard, the SE saw something rare and wonderful.
He said to the client, “Everything on your main page is green. Is that normal? Is everything (meaning the monitoring system) working okay?”
“Oh yeah. I run a pretty tight ship around here,” the client answered. “I know what’s supposed to be monitored and when. My thresholds are all dialed in. That way, if I go to Omnicenter’s UI and I see something yellow or red, I know for sure there’s a problem.”
Granted, this client doesn’t have nearly the scale of network that much larger clients maintain, which are unlikely to have zero alerts at any given time. But the client’s point is still valid. Properly configured alerts are essential, and poorly configured alerts can be worse than having none at all.
Here are 5 ways to make sure your alerts are doing what you’re paying for them to do:
1. Inventory your devices (get help if necessary) to make sure you’ve got visibility where you need it — and only where you need it.
Before addressing alert thresholds, establish whether you even need to be monitoring a given device or system.
As I mentioned in a previous post about preparing your network for a monitoring system installation, you may be able to use an “auto discovery” tool to inventory the systems and devices.
We’ve had clients tell us the auto discovery process helped them identify some key devices they hadn’t realized they could monitor so easily. Now, they’re finding it very useful to have visibility into these devices.
For other clients, it works better to manually inventory their network, often with one of our SE’s help. This helps you identify devices you probably don’t need to monitor. You know, like that guest VM that’s been “critical” for 250 days, and probably hasn’t actually existed for 249 of them?
Keep in mind that if you’re using an auto discovery feature with pre-set alert thresholds, you may need to adjust the thresholds right away.
For thresholds you don’t adjust immediately, set aside time to review them after they’ve been running a couple of weeks or so. Make sure you’ve got the visibility you think you have.
2. Determine who will be directly notified of each alert, and how they’ll be notified.
If you’ve got alert that generates an action item communication, be sure only the people who need to act on the alert — perhaps along with a backup and/or direct supervisor — receives the email, text, etc.
As with dashboard alerts, an in-box full of irrelevant notifications simply trains people to ignore all alerts, including those they should be acting upon.
3. Consider migrating alerts from individual device monitoring tools into one integrated monitoring system.
As the creators of OmniCenter, at Netreo we’re obviously believers in a comprehensive monitoring, alerting, and reporting appliance. I’m not saying, however, that certain devices’ proprietary monitoring tools, or home-grown tools built for particular elements of your infrastructure, aren’t useful.
The point is not to allow these different tools to create noise in the form of alerts that most of your crew don’t fully understand and can’t always respond to properly.
Different tools’ alarm systems may use different protocols to poll their devices. Each is likely to have a different UI. This may be forcing you to unnecessarily create silos within your network management operation.
An integrated monitoring system promotes a more cross-functional, flexible staff. You can avoid service bottlenecks (and the inevitable finger-pointing that goes with them).
A management system that polls every device using SNMP gives you visibility across your infrastructure, including devices you can’t manage with an agent. For example, SNMP can show you utilization on devices that don’t have operating systems, like switch ports or a UPS battery.
Whichever devices you decide require alert configuration, you should be able to overview all of them using a single, coherent UI.
4. Configure alert thresholds to spot significant anomalies, rather than predictable and/or momentary spikes.
As I described in a post about discovering anomalies hiding in virtualized networks, static thresholds, such as for server CPU or memory, can generate misleading alerts.
A classic example is a SQL Database server. It might hit 100% CPU four times per day, but you don’t really want to be alerted every time that happens. What you really need to know is when it’s behaving unusually.
Maybe at 10 a.m. on a Friday it’s normally running at 80%, but this Friday at the same time it’s running at 20%. This might indicate a problem with your application or users — but it won’t trigger a static alert. With anomaly detection, you will be alerted so you can find out what’s going on.
Or you might have users who normally never use more than about 20% of the bandwidth on their port, and right now they’re using 70% — still probably not enough to trigger a static “high water” threshold, but definitely something you should look into.
If your monitoring system is only looking for fixed threshold values, you might not be seeing the whole picture. It depends on the device’s customary workload.
5. Automate the process of pre-setting thresholds for new devices when possible.
Once you’ve inventoried your system, mapped all the required devices to your monitoring system and customized the thresholds, your alert management work is done. Just kidding. It’s never done, as long as your network is changing and growing, right?
But you can make this work easier going forward.
Configure your network monitoring system to pre-set alert thresholds for specific device classes, based on the parameters you’ve already set.
Chances are that even if a new device doesn’t require exactly the same alert parameters as the device class in general, the pre-set threshold will probably be close. You’ll at least have some protection if the device’s alerts aren’t re-configured immediately.
And you’ll probably find that most of the time the pre-set thresholds will be right on the money. That’s one of the tell-tale signs that you’re running a pretty tight ship yourself.