We’re not going to sugar coat it – network monitoring is a drag. The most common pain points we hear customers talking about are a lack of a single view into all of IT, over-alerting, and the admin overhead required to keep the whole thing working. In this blog, we’re going to talk about how to alleviate what usually is the most painful of these – the problems and fatigue caused by over alerting and drowning your engineers and techs in a sea of hard-to-understand data – what we call the ‘data deluge.’
The Monitoring Noise or “Data Deluge” Challenge
The issue lies in a “deluge” of information from multiple platforms, causing so many problems that employees are too busy dealing with them to prevent them, exacerbated by missing information caused by widespread failure to notify. To deal with these issues, we need to start by finding their causes.
The deluge of information from multiple platforms is caused by a lack of a defined notification plan and over-configured monitoring and alerts. Try creating one coherent system confined to a single platform that only alerts you about problems that you can or will fix. Reactive alarming, static alerts, and a lack of automation and long term visibility leave potential issues unreported and unresolved until they become major issues. Failure to notify results in missing issues, which can be compounded by spotty monitoring policies that let issues slip by. Those are caused by a lack of communication between dev and prod that keeps monitoring systems from keeping up with configuration changes.
Creating a New Monitoring Policy
The first step to getting your information under control is to create a new monitoring policy. This is done by laying the groundwork for your configuration, defining an action and escalation plan, and scheduling a timeline to update that plan regularly. Define your priorities based on the number of users impacted and the severity of the impact. Remember, if everything is critical then nothing is critical. Follow this up by making an action plan that is as specific as possible. You will need to define: who, when, and how to notify; when to escalate, and to who; when changes are to be made; and what you can and cannot automate.
Defining an Action Plan
The first priority is to define an action plan that can be relied on for detailed instructions during a crisis. You need to define, at a minimum, exactly who should be notified during an incident and exactly what methods should be used, both for initial notifications, as well as defining a timeline for escalation. How long should an outage persist before someone in management is notified? That will also depend heavily on the scope of the outage and priority, so make sure you outline clearly exactly what will be communicated to them. Remember that badly worded notifications are counterproductive – they can cause people to panic and take incorrect actions.
Also, remember to factor in the time of day and day of the week. Some enterprises are 7/24/365 operations, but many aren’t, or at least not for all the resources. When a system goes down after hours but won’t be needed until the next day, what is the process? When is someone available to work on the problem? A different response to an outage during off-hours is reasonable and appropriate in many cases, and your monitoring plan should reflect this.
Lastly, don’t forget to see what you can automate. Automating the initial response to incidents to do things like reboot servers, spin up new cloud resources, or restart services in a cluster can drastically reduce your MTTR – often before you even have to intervene manually.
Managing Your Monitoring Notifications
During a mass outage, you need to be able to reliably communicate what is going on to all the teams and stakeholders. Your monitoring plan should address exactly HOW you are going to get notifications to people, especially factoring in the difficulties inherent in a mass outage. Email and SMS are usually the foundation that most notifications are based on, but those may be impacted during a crisis, or simply saturated with alerts and notifications, if we haven’t fully implemented the techniques we’ve talked about to minimize them.
Consider some alternative methods. For example, mobile app push notifications take a path that is entirely separate from email, and these can even be initiated from the cloud, so the state of your internal networks or systems won’t affect it. You can also use API integration into team communication platforms, like slack, spark, bitrix, or others to make sure all the members of the team are aware of what’s being done, when, and by whom.
Your plan should also leverage the existing operational systems you have. This can help to eliminate redundant alarm sources and consolidate incidents into a manageable number of alerts. Integrating with notification managers can simplify on-call schedules and provide a way to distribute alerts that isn’t dependent on internal mail infrastructure, for example. Many organizations will also be using a ticketing or ITSM system like ServiceNow, and you’ll want to make sure your plan includes how these tools should be used during a crisis and how to integrate them with your monitoring and alerting systems.
Reducing Alert Volume
Alert volume can also be reduced by automating your responses to certain types of alerts, such as by having services, servers, and ports automatically restart, while still allowing you to execute commands manually. In this case, maintenance windows would be less critical. Another way to reduce the number of alerts is to put related problems into a single notification, which will be especially useful in cases such as cascade failures and infrastructure dependency alerts, i.e., latency. Of course, the easiest way to reduce is to stay on top of your process, by acknowledging alerts as they are handled, using maintenance windows for planned outages, and integrating with existing processes, such as ticketing and cc. Don’t bother to alert for non-actionable items.
When dealing with email alerts and reports, use distribution lists to keep in sync with personnel changes, and create an archive for historical reports. Use the archive of historical reports to identify “noisy birds”, which will ease finding and tuning out false alarms. You can use “noisy birds” to create template baseline reports and notification summary reports. Let’s look at a few examples of alerts to see what’s really actionable.
If you get a high CPU utilization alarm, is that actionable? Well, it might be if the CPU has been stuck there for a while. Netreo allows you to set a delay for these types of alerts, so it won’t generate an alarm until it’s been over 90% for some length of time. If your SQL server pegs the CPU, that’s probably normal, and you don’t want that alarm, but if it’s been pegged for 30 minutes you might have a job you need to go kill.
Optimizing Your Alerts
Netreo also builds in tools to help you optimize your alerts, which will allow you to identify the ‘noisy birds’. The few configured alerts that are responsible for a disproportionate amount of alarms, or those where an adjustment to the sensitivity of the alarm might be required. You can also use template baseline reports to see exactly which alerts are being generated by your configuration templates and make changes to them directly that can then be automatically rolled out to all of the devices where that template is applied. Also, things like the notification summary reports are helpful to see exactly what alarms are going out and get a count of how many alarms are coming from each device, or from each type of service check or threshold alarm.
If a switch port goes down, is that actionable? If it’s plugged into something important, generally yes, so this is a good candidate for an immediate alarm. But what if it’s part of a port-channel or it’s just one server in a large cluster? That might warrant a different priority or a different method of notification that doesn’t require a ‘drop everything and fix this’ response.
What about if the website goes down? Well, that’s generally a big yes. But you want to make sure it’s down from outside your environment as well as inside so you can properly orchestrate the response. And it’s important to know if it’s just the web page returning an error, or the back end going down, or something else – so using things like application checks launched from multiple remote locations can help you assess the issue before you even start troubleshooting.
Using Email Distribution Lists
One of the best practice recommendations we often give customers is to use distribution lists – not individual emails – for your email alerts and automated reports. There are a few good reasons to do this, but one of the key ones is to better keep in-sync with personnel changes. There have been more than a few incidents where a problem was detected by the monitoring system, but the alert went out to an abandoned or invalid mailbox, and this caused a significant delay in detecting and acting on the problem. It also allows you to create an easy historical archive of the automated reports and share them in a central location, so if you need to go back to the weekly performance reports from two years ago in September, they’re easy to find and locate, even if the detailed data has been archived by now.
One of the best ways to reduce the number of alerts coming in is to leverage your tools to take advantage of automatic alert suppression technologies like root cause detection. By configuring a topology into your tools (or letting the tool discover it), the tool can determine that the reason all the remote devices at that location suddenly went unreachable is because of a failure on the WAN router, and instead of sending out notifications for every remote device separately – every switch, server, application, and wireless AP – you can instead get a single alert that tells you right away the problem is the WAN router, for example.
Automate Your Monitoring and Alert Response
Also, a great way to get control of the number of alerts being sent is to integrate automation. Automation comes in a few flavors, and the first place to look at is automating the response to alerts so we can eliminate the need to send a notification at all, in many cases. You can link in vendor APIs like webhooks so your monitoring system can restart ports, retest applications, or even dump real-time diagnostics in response to an issue.
Netreo allows you to control those commands manually through operator intervention, so if you want to add a ‘click to restart server’ function directly into the web interface of the monitoring system, and limit access to just administrators, there’s an easy way to do so. However keep in mind that if you’re automating response, maintenance windows become critical to use. Otherwise, a scheduled software upgrade may not go as you expect, as your monitoring system starts taking action in the background. Nothing is more frustrating than shutting down your application for a planned upgrade and having the server suddenly reboot.
If you use multiple platforms to receive notifications, make sure that the priority of the issue affects the method of the messaging so that urgent notifications don’t get buried in inboxes in several different apps. Some common methods of notification include email, SMS, mobile push notifications, slack, and spark.
To manage your notifications, centralize altering if possible, to avoid duplicate reports and conflicting statuses. The most important parts of centralization are having a single point of acknowledgment, communication, and looking for statuses between teams. You can integrate centralization with existing systems via API, notification managers such as OpsGenie and PagerDuty, and ticketing systems such as ServiceNow and Remedy.
Leverage Automated Configuration
Let’s talk about another way to automate the monitoring process, and that’s automating the monitoring system’s configuration. It’s key to make sure that your centralized visibility can stay up to date, while your environment constantly changes. Netreo has easy to use web-based APIs that allow you to link your provisioning process directly into monitoring, or automatically discover resources via API, and then drive that configuration forward using rules to define the category, strategic groups, and templates that control every aspect of monitoring configuration with minimal overhead. You can even link your deployments with automatic maintenance windows via API. Less time spent putting new systems into monitoring (or taking them out) also means more time to focus on more important tasks to move the business forward. You should also make sure you’ve linked your monitoring system into your cloud resources, so those can automatically be discovered and added to monitoring, as well.
Netreo uses a flexible auto-configuration system to automate your configuration. Autoconfiguration comes with a set of rules to help you get started, and they can easily be customized or created to fit your environment. You can use these to set device attributes, like categories, site, and application groups based on any device criteria – including things like which processes are running, what ports are open, what the device is named, or SNMP values. This makes sure that there’s no manual step in your provisioning process. This way you can easily ensure the devices end up in the correct reports, and that they always get the right settings applied. These are automatically applied to devices as they’re discovered, and can also be re-applied as desired, so if you want to make sure everything STAYS configured the way you want, you can enforce that.
Do not rely on manual processes to manage configuration. Automated discovery should NOT be limited to scanning, and make sure to use deep links into APIs. Linking monitoring into provisioning can be conducted with vCenter, puppet/chef, Hyper-V, and hyper-converged. Autodiscover cloud resources and systems that you use including Azure, AWS, and GCP. Make use of anomaly-based thresholds to detect and alert unusual behavior and automatically generate baselines, but carefully consider your sensitivity.
Automating with Cascading Templates
Using those criteria we set with the AutoConfig engine, Netreo can dynamically apply all the relevant templates to your devices. Multiple templates will be automatically applied, and the settings from each of them will intelligently roll down onto the devices. This way, a new SQL server coming online would not only get the basic Windows server settings you want but would also get the SQL-specific application checks and settings. You can align your templates with your monitoring plan to make sure they set the appropriate priorities and timings, and have them define the correct authentication, escalation, and active response automated actions. It’s designed to be flexible enough to meet the needs of the global enterprise customer while still being simple enough for a small IT department to use without a lot of training or dedicated personnel.
When using template-based configuration: set all your alerts, thresholds, and services; create templates by priority, application, and platform; and define escalation, actions, and automation. Use rule-based automation to apply templates, which can be auto-applied to discovered devices, will make sure you never miss a configuration step and allows rules to be re-applied as desired.
Use Anomaly Detection
Detecting unusual behavior, instead of just relying on static alarm settings, is a key way to get proactive with your monitoring and find issues before they impact users. With Netreo, you can easily use anomaly detection to find changes in application behavior, and it can be applied almost anywhere – CPU, memory use, running processes, even log messages. If your application goes from 10 login failures an hour to 1000, that last deployment may not have gone as smoothly as you expected, and now you have a starting place to troubleshoot.
Netreo will automatically generate a baseline behavior model using the large volume of historical data we retain, automatically adapting the baseline as your environment changes and evolves. Anomalies can be detected based on changes in baseline behavior looking at the time of day, day of the week, or even hour by hour. This allows you to find unexpected impacts, like a software change causing unusual behavior in the CPU on a back-end SQL server.
One customer of ours discovered an issue where normally at 10 am on a Wednesday the database server runs at 50-60%, but suddenly it was running at 15%. It turns out that a software change to the user interface broke the CSS on a form deep within the application, where the front-end team didn’t notice it, and customers weren’t able to complete orders properly. That sort of anomaly is the kind of unusual behavior that would never trigger a static threshold, but in this case, revealed a problem long before they noticed the sharp drop in orders and the customer complaints that would have resulted.
Managing the Process
A key point of having a successful monitoring plan is to make sure it’s being used and to stay on top of the monitoring system. Make sure your techs and engineers are acknowledging problems as they are dealt with. Don’t let the monitoring screens get flooded with alerts. A monitoring screen that’s flooded with alerts makes it very easy to miss a critical alert. Alerts need to be acted upon and acknowledged or corrected as soon as possible. It is not acceptable to let alerts sit in the queue until the support teams’ regular work hours. If the condition is acceptable during off-hours, then the alert should only be generated when it is not acceptable and when someone is available to work on the problem. Alerts that don’t get addressed within a specific timeframe should get escalated.
Also, using maintenance windows for your planned outages will not only help reduce the volume of alerts, but it will also normalize your uptime reports, so systems with no unplanned outages in a given month can show 100%, without manually adjusting your reports or writing explanations to roll them up to senior management. Finally, don’t ignore alerts. Ignoring alerts is a sure way of getting into trouble. At some point, a mistake will be made and someone will ignore the wrong alert. Avoid the problem altogether by making sure non-actionable alerts are not even seen. You can still report after the fact of adverse or unusual conditions for proactive investigation.