Let’s be honest, alert fatigue is a real thing and anyone telling you otherwise is flat out lying. If you have tools generating tens or thousands of daily alerts, eventually people will burn out and simply start ignoring alerts. Even if you have enough team members to divvy up alert reviews, the approach only works for a while.
Trouble is, false positives are always generated when managing alerts, and people will eventually ignore false positives. But how do you really know which alerts are truly false positives? The answer is, of course, that you don’t. You can’t know which alerts are real and which are not without knowing the content and context of each alert.
There are lots of vendors out there selling tools that they claim will eliminate false positive alerts. These vendors throw around fancy terms – machine learning, AIOps, analytics, etc. – as their approach to dealing with false alerts. And, I’m not going to lie. Some tools work a lot of the time and help you reduce the number of alerts significantly.
But how do you really know that the tools are only filtering out the false positives (or low-level alerts, etc.)? Being brutally honest here again, you can’t. Sure there are studies, analysis and other ways to mathematically or scientifically prove that the algorithms work as indicated. But when it comes to the real world, all those things go right out the window.
To Boldly Go …
I’m reminded of a (very) old Star Trek book titled “Enterprise, The First Adventure” and published in 1986 (yeah, I’m old). There’s a scene where Spock is playing chess against himself, since no one else plays at his level. Kirk walks up, looks at the board, and says “White to checkmate in three” before walking away. Spock is shocked, and after studying the board for a while, was unable to determine how Kirk came to his stated move. Spock asked Kirk to explain, leading to the following exchange:
Jim moved his queen’s knight.
Spock regarded the chessboard. One black eyebrow tilted to a steeper slant. He stared at the positions as if he had shifted into computer mode, as if he were calculating the effects of every possible move of every piece on the board. Jim had seen the opening in a flash of insight. Now, abruptly doubtful, he searched the board for some overlooked move, some schoolchild error.
Spock reached out. Jim forced himself to stay as collected as any Vulcan while he waited for Mr. Spock to make a move that Jim’s intuition had not taken into account.
Spock tipped his king and let it settle back onto its squat base.
“I resign,” Spock said.
Jim wondered if he saw the barest hint of a frown, the barest suggestion of confusion, in the Vulcan’s expression.
“Your move,” Spock said, “risked your queen and your knights. It was … illogical.”
“But effective,” Jim said.
“Indeed,” Spock said softly. “What mode of calculation do you use? Sinhawk, perhaps? Or a method of your own devising?”
“One of my own devising, you might say. I didn’t calculate it, Spock. I saw it. Call it intuition, if you like. Or good luck.”
I think this exchange is quite appropriate when discussing things like alert fatigue. Vendors use all sorts of advanced technologies and algorithms to determine which events are important and which are not, identify false positives, false negatives and things like that. But, even the most advanced algorithms currently available are not perfect and will make mistakes. Sometimes the best approach to identifying which problems are real and which are not is just seeing them.
Back to Alerts on Earth
So, will tools promising to eliminate misidentified alerts and generate a reasonable number of accurate alerts ALSO eliminate real alerts? Is eliminating alert fatigue hopeless? The good news is that avoiding alert fatigue is not hopeless. But, you will need to set realistic expectations of what tools can do and what you want them to do.
At Netreo, we believe in an approach to handling alerts that is very well balanced and far more productive. But the challenge isn’t so easy. On the one hand, we wantto make sure you get all the alerts that customers need. On the other hand, we don’t want to give you any alerts customers don’t need.
For example, when a switch goes down, you will lose connectivity to the switch and everything behind it. Many vendor tools generate an alert for the switch, as well as all the unreachable servers behind that switch. The infrastructure team receives an alert about the switch, AND the server team receives a whole bunch of alerts about the servers.
Since nothing is wrong with the servers, your server team wastes time spinning their wheels trying to figure out what to fix. Your only hope is that the infrastructure team quickly solves the problem or tells your server team to stand down. But we all know how often that happens!
The Netreo approach is to analyze the context of the event, before generating multiple alerts for multiple teams. In this case, Netreo recognizes that the servers are not accessible because the switch is down. All alerts for servers behind the switch are suppressed until the switch is brought back online.
Getting the Right Solution
When considering how tools deal with alert management and fatigue, it’s best to know what you want. Within reason, it’s better to have too many alerts but catch the ones you need, than to have too few alerts and miss a critical notification. Don’t be fooled by tools that use advanced (but unproven) algorithms or technologies to distinguish alerts – nothing is perfect. Look for tools that take a realistic approach to dealing with incidents and provide an intelligent approach to managing them. Also, be prepared with specific questions on the capabilities of vendor tools:
- When tools claim to only show “true” alerts, ask the vendor how the tool manages to show only “true” alerts?
- Find out how flexible alert notifications are – does the tool support alert suppression like discussed in the example?
- How flexible is the tool when it comes to identifying incidents?
- Does the tool support customizing incident thresholds, accessibility, event correlation and more?
Once you have a plan, make sure you keep focused on your objectives. Ask the questions you’ve prepared to ensure vendor solutions align with your expectations. Investigate vendor resources for additional information and real world examples of managing alert fatigue.
And be sure to check out Netreo’s on-demand webinar series, featuring three episodes specifically on alert management:
- Integrating Incident Management into Monitoring
- Eliminating Event Noise & Alert Fatigue via ITSM + Monitoring Integration
- Winning the Battle Against Data Deluge and Alarm Fatigue
Or schedule a Netreo demo today and see how intelligent alerts help you eliminate alert fatigue, while helping you automate incident management, speed MTTR and focus on delivering a great user experience.