Weather an IT Incident Storm

Build monitoring noise reduction into your IT landscape by leveraging Artificial Intelligence and Machine Learning

Ever watch news coverage of an incoming hurricane? You’ve got those correspondents out there in the elements, wearing their yellow rain ponchos, fighting the wind, and describing the scene to an audience watching at home. That situation reminds me of life as an engineer managing a large-scale IT infrastructure. Although I’m no longer a sysadmin there were certainly days where I had to put on my metaphorical poncho and weather an incoming storm. To quote a comedian, “In a hurricane it isn’t the wind or the rain by themselves that cause the problem. It’s the combination of the two plus ordinary objects like Volkswagens and mailboxes that get turned into deadly projectiles during storm conditions. These things are what cause issues during inclement weather.”

In this second article in our “Taming IT Chaos” blog series we’ll dig into distilling the signals from all that noise in a typical IT environment.

Like in a hurricane, it’s pretty rare to find standalone incidents in modern IT landscapes. A simple crack in your IT infrastructure can easily trigger a tsunami of alerts and rampant traffic. There’s no question that information and data of this nature is useful. After all, isn’t that why you added a monitoring system to your infrastructure in the first place? However, the problem ultimately becomes the so-called “signal-to-noise” ratio. In other words, of all that inbound data, how much can actually be utilized by IT personnel to diagnose and fix problems?  

In actuality, the “signal-to-noise” ratio is a myth. It’s a myth insofar as it’s meant to make us, as humans, feel better about doing our jobs. Think about that statement for a moment. It’s literally impossible to track every inbound diagnostic feed competing for our attention, especially in complex infrastructures. Yet, that data is being fed for a reason. It’s signaling something, and we just lack the ability to process it. So what’s the best way to deal with this challenge?

There are a couple strategies.

I. Stop the problem at the source and limit the received signals.

Don’t configure something to alert your IT Operations team unless, hyperbolically speaking, they are prepared to drop everything and immediately fix the problem.  We’ve written about this topic in the past when discussing the differences between reporting and alerting and alert policy formulation for your environment. While this strategy can work, it’s like using a sledgehammer when the delicate touch of a ball peen hammer is the more appropriate tool. The obvious risk is that if you completely disable the signal, you might be missing something important.

II. Use technology to your advantage.

Rather than disable the signal entirely, restructure the relationship between the individual alarms that are emanating from your monitoring system. By leveraging artificial intelligence and machine learning (AI\ML) to gain a better understanding of the events, IT operators are able to quickly identify useful signals and figure out the root cause.

It’s All About Levels, Jerry!

This strategy sounds neat, but how do we get there? First things first, you want to think about your traffic in terms of “levels” in how it’s organized. Signal processing success and noise reduction starts from a thorough study of the underlying IT infrastructure.

Level 1: Physical Networks

At an initial level you have the physical networks, which are relatively easy to depict. Graphic algorithms can be used to detect key network paths and single critical points. Based on the network topologies, a complicated network can be divided into multiple areas that are functionally related.

Level 2: Virtualization and Containerization

At the next level is virtualization and containerization. Both technologies are fantastic for increasing development flexibility. Unfortunately, even in IT, there’s no such thing as a ‘free lunch’. What you gain in flexibility, you lose in simplicity. Managing virtual environments can be difficult. An accurate host, virtual machine, and container parenting inventory/map needs to be maintained before any investigation can be conducted in this kind of environment. Classification algorithms based on the attributes of your objects (hosts, guests, containers, etc) can be useful to dynamically keep track of the changes that may in turn spike signal quantity.

Level 3: Application Traffic Patterns

Going up one more level, you have application traffic patterns. These are the least organized and require the most creativity to learn. Clustering algorithms by traffic metadata is helpful in this situation. The combination of heuristic traffic rules and machine discovery is a powerful one-two punch. It allows you to gain an unbiased understanding of how application traffic is distributed among IT resources and what the traffic patterns are across your infrastructure.

To Everything There is a Season (and a Trend and a Cycle)

Once you have a good understanding of your traffic patterns at the various levels of your infrastructure, then basic time-based monitoring can be set up in the form of moving average or moving windows. Monitored metrics fluctuating away from a healthy range should be alerted upon. In order to build a smart monitoring system, a time series analysis becomes necessary to detect Trends, Cycles and Seasonality.

Trends

Trends are defined as the future movement tendency of a certain signal. A family of moving regression algorithms can be used to detect trends of different nature. Variance of LOESS (locally estimated scatterplot smoothing) and LOWESS (locally weighted scatterplot smoothing)  can be used for variant use cases.

Cycles

Cycles are special trends that happen in a regular frequency distribution. For example, a regular process clearing up job can simmer down CPU usage on a daily basis. These regular changes can be treated as a cycle. When certain cycles don’t happen in the expected frequency, it can be a symptom of more deep rooted issues.

Seasonality

Seasonality is the periodic fluctuation on top of trends and cycles in a repetitive manner. To detect seasonality, you normally need to decouple seasonal data from trends or cycles first, then, assess them separately. Finally, overlay seasonality data on top of the trend or cycle data to reach the overall prediction.

As the image above illustrates, global anomalies can be detected by trends. In contrast, more subtle, local anomalies can only be noticed when cycle and seasonality detection are in place as well.  And, for what it’s worth, you can thank me later for getting this snappy Byrds tune from the early 60s stuck in your head.

Three of These Things Belong Together

Not unlike the Sesame Street game where a Muppet and the guest host try to figure out which of a group of four items belong together, once you have an idea of your traffic “levels” and have done some initial time series analysis, the next step is to group and collapse events by time and impact.  Noise reduction is a combination of clustering of different events and time series analysis. First, based on the network topological information, related events are grouped and examined together.  Second, time series models can be applied to extract relationships in grouped events.  This step allows you to distill leading signals from followers. The collection of leading signals from different dimensions can then be presented to human experts to help quickly identify the root cause.

Putting All the Pieces Together

So we’ve discussed the various components to signal processing success and noise reduction, but what would an end-to-end example look like in a real IT infrastructure? Consider the following scenario and diagram:

  • Multiple users have reported access issues that triggers the IT operators to look into the recent network changes.
  • There is a slight surge of alarms from some network devices. Based on network topology and metrics relationships two event clusters are noticed by IT personnel: a spread of network performance issues and a switch reporting a couple of abnormal metrics.
  • Time series analysis, combined with a cycle and seasonality model, has raised alarms of CPU usage surge of the switch, first warning and then escalating to critical.
  • As a quick investigation, the root cause of all the symptoms has pointed to a misconfiguration of the switch, which causes the overheating of the CPU.
  • The Alert is cleared as soon the correct configuration is re-applied to the switch.

 

The above is a simple example, but the concepts are exactly the same whether you’re talking about one device with two variables (perhaps latency and CPU usage in this case) or hundreds of devices with potentially thousands of variables. In an environment of that magnitude human intelligence alone cannot get the job done. Instead, if you rely on AI\ML, then you’ve got a fighting chance. IT operators can focus more on high impact events and leading signals. The result is a lower mean time to repair, happier end-users, and (best of all) no more need to break out that rain poncho for regular IT outage downpours.

Ready to get started? Get in touch or schedule a demo.