Jumbo Shrimp. Great Depression. Deafening Silence. Virtual Reality. These are all terms known as “Oxymorons”, defined by Webster as “a figure of speech in which apparently contradictory terms appear in conjunction”.
Put your mind to it and you can probably come up with several more that are peppered into our everyday vernacular. However, there’s one commonly overlooked oxymoron that’s particularly synonymous with systems and network management: The dreaded “False Positive”. Although this sounds like it could be some obscure math term you were supposed to have learned in 9th grade Algebra it isn’t. Instead it’s an alert or alarm that gets generated by your IT monitoring tools for a condition that’s worthy of neither an alarm nor an alert. Think of the “False Positive” as a digital manifestation of “The Boy Who Cried Wolf”.
You may be thinking, “Okay. I got it. False positives are bad. Tell me something I don’t know.” However, consider the consequences of not taking this problem seriously:
What happened when there was a real wolf nearby getting ready to chow down on our aforementioned boy for dinner? Nobody believed him. If you’re getting deluged with high CPU utilization alerts or SNMP trap messages when edge switch ports go up and down how will you be able to pinpoint legitimate outages?
Ever watch a dog chase its tail? It’s funny and sad all at the same time. Constantly having to run down technical problems that aren’t really problems is similarly futile (although only funny and sad to those who DON’T work in IT, I suppose). Today Information Systems are a core part of most businesses and it takes time and skill to maintain them. Why waste that resource playing whack-a-mole with your technology stack?
What if your revenues are directly tied to a properly functioning Information Technology architecture? False Positives have the potential of compromising the information provided to end-users (think E-Bay, Ticketmaster, and the NASDAQ Stock Exchange). Your customers will become “former” customers faster than your SysAdmin can say “The monitoring system told me the production Oracle Database was down, so I rebooted it.”
Of course, what kind of a technology blog only offers doom and gloom and not solutions as well? There are effective ways to thwart this Oxymoron-offensive.
The first thing you want to do is establish a good baseline from which all your other monitoring configurations are based. This task needn’t be time consuming or difficult. Simply pick a one – two week period where things were stable in your system and run reports from your monitoring tools telling you the state of things. For example, if memory utilization on your MS-SQL Servers is running at 96% in your reports, but all applications were functioning well and no users were complaining, then THAT’S the baseline from which all SQL Server memory thresholds should be based. Setting a level lower than 80% is just asking for an inbox full of Oxymorons.
Next, you only want to configure items to send out alerts that are immediately actionable. In other words, unless you are prepared to drop everything and deal with an alarm that comes in, do not configure it to send you an alert. Consider the following scenario: You have a regular backup that runs at 11pm on a Friday night that always spikes CPU utilization to 98%. First, if it happens regularly why are you alerting on it? Second, even if you are prepared to deal with the CPU spike, what can you really do about it other than kill off the backup job, which needs to finish at some point anyway? Proper configuration of most monitoring tools adequately deal with these situations.
Lastly, make sure your monitoring tools are customized for your environment. Network and systems management doesn’t have to be an IT project that costs hundreds of thousands of dollars and take years to implement. However, it isn’t a one-size-fits all proposition either. When coming up with a monitoring game plan make sure you’re clear-eyed about the specific requirements of your infrastructure, your visibility requirements, and the problems you’re trying to solve. The monitoring requirements for a managed services provider are going to be vastly different from a company that does heavy equipment manufacturing.
When all’s said and done you’ll get out of your monitoring system what you put into it. If you take the time to configure it properly you have the opportunity to turn your False Positives into True Negatives. Hey look at that … we just invented a new Oxymoron.