Alert noise, as well as false positives or too few alerts, undermine the effectiveness of any monitoring solution. Inaccurate alerts condition users to draw poor conclusions. Too many alerts contribute to serious alerts going undetected. Too many false positive or non-actionable alerts cause the significance of all alerts to diminish over time. And too few alerts can lead to misreading system performance and missing critical problems.
Alerts are something that should be dealt with in a strategic way to maintain trust in any monitoring appliance. Without trust, stakeholders are bound to miss the benefits that alerts provide in the worst possible situations. In this post, we’ll discuss how to identify and reduce alert noise and make all alerts meaningful with Netreo.
Finding Where to Make Changes
Start by eliminating the completely useless alerts that are filling up inboxes. After clearing out that tier of noise, move on to the next level of your alert hierarchy. Identify any alerts that don’t require immediate action. Once those are removed (or reclassified), only meaningful alerts that require action should remain.
Obviously, this is an oversimplification of the process. Each environment is different and impacts how you make these determinations. But in general, these actions should leave you with alerts that require action at the time the alert is sent. This is very different from a situation where your alert response requires planning or decision making.
Once you have your list of alerting issues that need to be addressed, trace them to the device or groups of devices that are generating the high volume of alerts, so that you can adjust their configurations. Normally, configuration changes are controlled and managed from a Template. You can also make changes directly on an individual device or by “unlinking” the alerting rule from an applied template.
After you find an entry on the Device, mostly in the Instances or Service Tab of the Admin page for a Device, you can click the icon to the right to navigate to the Template that is locking the setting to that Device. From here you can edit the entry on the Template that corresponds to what is applied on the device.
The first thing to consider is the renotification interval for your Alert. By default, these are either 0 for no renotifications or 1440, which represents one day in minutes. If these are changed to lower values, renotification actions are taken more frequently. Intervals can be left at the 1 day mark or reduced to every few hours for critical issues.
Since Netreo polls and records a statistic’s value every 5 minutes, selecting a TIME PERIOD of 5 minutes when configuring a static Threshold Check means that it would only take one poll that exceeded the warning or critical threshold values to trigger a change in state. However, selecting a period of 15 minutes would require three consecutive polls (with an average value exceeding the warning or critical threshold values) to trigger a state change. This field is an important adjustment for reducing false alarms.
Divide the TIME PERIOD value configured in the check by 5 to figure out the number of recent samples that will be averaged before being compared to the threshold values. For a storage drive that has a background process to clean up, it may briefly exceed the threshold while the cleanup is running. In that situation, extending the TIME PERIOD can help ensure that any Incidents created are due to an actual, actionable event and not something that will be handled automatically through other means.
Strategic Alert Parameters
Identifying the sources of noise can sometimes be a recursive process. As you bring your overall noise level down to a new baseline, you identify new items as noise. This suggests having a strategy on just how you go about refining alert parameters to eliminate noise.
Every alert strategy needs to effectively and efficiently identify what makes an alert useful. Without actionable parameters, efforts to curb alert noise will most certainly fail. However, every implementation is different, and different devices within the same infrastructure require different parameters. Therefore, your efforts at this juncture are necessary to ensure all your alerts are meaningful. Finally, you need to know how to implement your strategy within Netreo to reap maximum benefits.
The Right Metrics for the Right Devices
Monitoring the right metrics for different devices is always important. By default, a Bandwidth Threshold on all devices alerts you when a device exceeds a percentage of its maximum usage. This is often helpful for finding bottlenecks or misconfigurations in your network, but may not be valid in all situations. A common problem with the Bandwidth Threshold is that backup devices or similar applications will often use the entire network interface bandwidth for an extended time. For these situations, disable the threshold at a lower level template that’s applied only to your backup devices.
Alert Response Planning
You should only alert for actionable problems in environments that have a response prepared, within reason. Let’s look at an example of monitoring Disk Usage in your environment. If a critical application is known to fill its disks over time, it might make sense to alert a team. Monitoring a drive where this application’s data is stored, you’d set an alert for when it reaches a threshold of limited capacity remaining before a service outage is experienced. The immediate response plan would include using reserve disk space that was on hand to handle the issue, while a more long-term plan was put together.
You wouldn’t expand this alerting to any drive that increases past that same threshold in most environments. It’s unlikely you’d have reserve disk space and a plan in place to increase the drive size on some obscure disk that may be managed by a completely different team. Similarly, the same alert isn’t required for a drive already expected to fill and be cleared without harm to services.
These latter situations suggest using Report functionality, instead of Alerts. When information is used to get an overview of resource usage or meant for scheduled planning meetings, then a scheduled report is more appropriate. More on Reports in a bit.
The Right Tool for the Job
Consideration should also be given to the monitoring vector where you add your alerting. Say you simply need to monitor that a backup is successfully run. A custom Powershell or SSH Service Check could be written for that. Or perhaps setting an alert for network errors for those interfaces used for the backup is enough.
For tunnels or other connections, you also have choices. In some cases, operations appear fine when the state of the interface remains up, while the important part of the connection fails. Using a Service Check that pings across that connection and reports any failures or packet loss would be more helpful. The key to this part of the cleanup is to ensure that you are looking for a reliable “signal” for the problem you want to monitor. Relying on the default metrics for most devices may not suffice.
As noted above, alerts should notify the right team in advance of critical failures impacting users and where existing action plans are in place. Netreo excels in this area by properly routing the right alerts to the right teams with actionable insights for quick remediation. What’s more, Netreo’s AIOps: Autopilot feature continues to learn how your action plans are working, proactively modify settings and even fix problems automatically.
Reports, on the other hand, are extremely helpful for initially reducing alter noise, refining alter parameters and maintaining alert effectiveness.
Leveraging Reports to Reduce Alert Noise
The first Netreo Report I would pull up for getting an idea of the overall alert noise in your environment is the Detailed Notification Report. This is found in Reports > Alerts > Notifications > Detailed. You don’t have to select any groupings. Simply leave it open to view all notifications for all devices and change the time to the last 24 hours.
This report will show one entry for each email, webhook, sms text or other action generated by an alert. It also shows separate entries for the Critical, Recovery or Acknowledgement alerts that go out. The table at the top is a decent measuring point for seeing if your changes are reducing alerts compared to what you saw in the last week or month.
There are two other tricks for this report that make it easy to find options for fixing alerts. One is to sort the alerts by Incident ID, then find Incident IDs that are over represented in this report. If you find that one incident has many more alerts than the average incident, your renotification interval may be too high. Another trick is to filter for a specific contact or action after generating the report. Using the box in the upper right, you can set multiple filters on multiple columns to fine tune your search. When you find excessive alerts, make a note to review the configuration when we get to that part of the process.
Problem and Incident Management Reports
Problem and Incident Management Reports are also very helpful. These recent additions are only available to Ultimate tier license users, and both can export data as CSV files. The Problem Management Report is perfectly designed for finding alert noise in your environment. Most options are prepopulated, and all options are sortable. Simply set the Groupings option to “All” and run the report to see the most common, recurring incidents in your environment. Evaluating the outliers in this report is a quick way to find places that could use configuration changes.
The Incident Management Report includes an Incidents list, which provides another great way to identify alert noise. In your Active Incidents page, look under Quick Views. If these Incidents are all identified as Acknowledged, then they will not be sending out alerts. However, the process for how these incidents are acknowledged and handled may still be contributing to alert noise. Identifying old incidents to clean up, as well as incidents that repeatedly sit in Open state for extended time while sending out alerts, is an important step toward ensuring alerts are meaningful.
Baseline Templates Report
One last place to look for problems with alerting is the Baseline Templates Report – found on the Administration > Templates page. Service Checks and thresholds that have gone critical will appear in this report. You can drill down to find problem devices that are contributing to active alarms. Armed with this information, you can determine if the alerting is legitimate or a source of noise that should be removed with configuration adjustments.
Alert noise can be a problem with any monitoring solution, so ensuring your system generates only meaningful alerts is paramount to success. But every environment is different, so our Netreo platform is designed with the flexibility and intelligence to enable you to maximize your efforts. Combining powerful infrastructure visibility with customizable Templates, Service Checks, Reports and more, you can ensure what and how you monitor your infrastructure leads to meaningful alerts, without noise. Be sure to check out Netreo Documentation for more on tuning your Netreo solution for your environment.