Former Major League pitcher Vernon Law is quoted as saying “Experience is a hard teacher because she gives the test first …[and] the lesson afterwards”. Everybody reading this blog post right now probably has his or her own real life example. For me, I was four years old and my family was having some renovations done to our house. I can dispense with the details and simply say that I will NEVER again put my finger into a coverless light switch. Misadventures of a curious little boy aside, is there a lesson from my shocking experience that can be applied to the world of infrastructure monitoring and management? As we continue with our “7 Habits” series we’ll see that the answer to this question is an unequivocal “yes” and the most effective administrators engage in this sort of activity every day.
In addition to avoiding the prophecy of the aforementioned Pittsburgh Pirates right-hander, why should network and systems administrators care about past events? Even if those events pre-dated their tenure? The answer is as simple as it is (hopefully) obvious. If you don’t know what caused a problem in the first place, then it’s a pretty good bet it’ll happen again. Once I had a couple years of professional sysadmin experience under my belt, a “Disk Full” alert at 3am wasn’t resolved by simply wiping the sleep from my eyes, dialing into my network (yes, I’m that old), and clearing drive space on the culprit server. It was also necessary to dig into the root-cause of the problem. However, since the most effective administrators know history often repeats itself they’ll take things a step further. They’ll look at longer time horizons and use the tools at their disposal to predict problems before they begin.
It’s easy to talk a good game when it comes to predicting infrastructure problems, but what’s actually necessary before we can confidently attach a Fortune-Teller’s turban to our head? In the real world, we don’t have access to Midi-chlorians or Flux Capacitors. However, we do have access to massive quantities of historical data. The key is to parse that data and look for patterns of behavior. Take a look at the following histogram:
What can we ascertain about this device on this network besides the obvious stuff like it’s a Palo Alto Firewall and Ethernet 1/1 is the busiest network interface?
A) This company maintains typical business hours.
B) It looks like lunch for most employees’ starts at 1130am.
C) People apparently work harder on Thursdays and Fridays than the other days of the week.
I’m not a psychic, nor do I play one on TV. I simply noted patterns in the gathered time-series data for the last 30 days. If we can derive these insights from an “eyeball” test of one simple histogram imagine what could be accomplished with terabytes of historical data. For example, the third cluster of data points contains six spikes and not five. In the fourth set there are two abnormally large bandwidth spikes. Do these anomalies highlight potential trouble? Without more data we cannot definitively say. However, with the aid of machine-learning algorithms and a larger sample size, the sky is the limit. An information base of time-series data, availability knowledge, and feeds from monitored-device syslog and Netflow allow engineers to up their “Fortune-Teller” game and spot trouble in advance.
We’ve examined the “Why?” and the “What?” of using predictive analysis to spot infrastructure weak spots, but how do we get there? What are some concrete steps we can take to allow network and systems administrators to be more effective at their jobs?
As I’ve previously written there are two primary ways of extracting diagnostic information from network and systems gear. Data can either be pulled into your NMS or pushed into it from outside sources. In a well-rounded monitoring regime both methods are necessary. However, simply pulling all available data or sending everything (no matter what it is) into your NMS is like trying to take a small, refreshing sip of water from a fire hose. Don’t do it. You’ll be making way too much work for yourself. The promise of “AIOPs” and machine learning tools is that you can simply activate inbound data feeds and you’ll magically have visibility into every corner of your environment. There’s only one problem with that promise. Oodles of data without insight is just noise. Start small to get a feel for what’s coming in. From there you can begin to train your tools on what to specifically look for.
The biggest hurdle to the recommendation of “starting small” when it comes to pushed and pulled data feeds is that it’s hard to know where to start. Put another way “We don’t know what we don’t know”. That’s why we’re using machine learning in the first place.” A good rule of thumb here is to pick one of your mission critical applications and start as high in the stack as possible. Work your way down from there. If you’re monitoring a web-based application, then start with user experience. Track URL load times from local and outside perspectives. Next, move to variables responsible for the services that run the application (i.e. ‘Current Connections’ for IIS or ‘Lock Wait Time’ for a DB in MS-SQL). Continue with the monitoring of the server(s) where this application lives and then the networking gear that’s responsible for connecting it to your end-users. After each “application” is added to your NMS, take stock of the data getting gathered. Is there too much? Not enough? Season to taste and move onto the next application. Following this methodology for each of your applications will eventually lead to a complete picture of your infrastructure because everything is networked together.
Now we have the data we need as part of our information base. And, it should be a manageable subset because you did this one application at time, right? At it’s core machine learning and predictive analysis works by analyzing patterns. In all the data you’ve started to gather patterns will likely start to form after a week or two. It is your job, as a human first, to understand what the data is showing you. Once you’re confident in the data getting pushed/pulled to your NMS system it’s time to consider what conditions are alert worthy. Flexibility in this aspect is key. All NMS systems will allow you to set specific thresholds on when an alert gets sent (i.e. alert when memory usage gets higher than 3GB on a server with 4GB available for the previous 10 minutes). Limiting thresholds setup to static levels in this manner is advisable, but incomplete. You’re still in “fire-fighting” mode in this scenario. What you ultimately want from your tools is a way to look back at a longer time horizon and evaluate observed patterns. Let’s suppose the pattern for this particular server is that memory usage has been hovering at 1GB used and there’s an anomalistic spike to 2GB usage in the past hour. Your static 3GB/4GB won’t flag this situation, but it’s likely something a responsible engineer would want to know about. The most effective network and systems administrators seek out this kind of flexibility in their tools.
Back in the mid-1950 when Vernon Law was pitching in baseball games computer scientist Alan Turing was pitching “The Imitation Game”. He asked, “Can machines think?” The answer is still debated 65 years later. However, networks are faster and data feeds more plentiful than they’ve ever been. Luckily, hardware is more powerful and algorithms more advanced as too. NMS systems may not quite be up to the level of ‘thinking’ about what to alert on, but we’re closer than ever to “Experience” being no more difficult of a teacher than Mrs. Hodgkinson in 10th grade Geometry.
Netreo’s long term memory storage and user friendly interface makes tracing the beginning of an issue an easy task.