Implementing Alerts for a Refined DevOps Experience
In our last four blog posts, we have been sharing tips on eliminating DevOps monitoring challenges. In case you’ve missed them, make sure to catch up with our “Eliminating DevOps Monitoring Challenges” series.
The next key point is to make sure you’re engaging the right people at the right time. This sounds obvious, but you’d be surprised how often it doesn’t happen. If you’re not going to react to a notification as soon as it comes in, you shouldn’t be alerting on it.
Proactive DevOps Monitoring
If you’re getting alerts when hard drives reach 90% utilization, and then just ignoring them because it’s not urgent, two things will happen: first, you’re sooner or later going to get an application failure because you forgot to circle back on that alert before the drive filled up. And second, you’re conditioning yourself and your teams to ignore alerts, and someone, someday, will inevitably ignore the wrong one, and you’ll have a major problem on your hands. Instead, use reports, so you can be kept up to date on things that may not need to get folks out of bed, but you can still stay informed. A nice email in the morning showing you which printers have low toner is a lot nicer than being alerted every time it happens, and you can schedule replacements for non-intrusive times, preventing any interruption of service.
A key success factor is to schedule and automate these reports, so nothing gets missed and you’re not depending on some poor soul to take time every week to run and send it. This task is sure to get skipped when the team is up to their necks in alligators, or when the designated engineer is out on vacation and the person covering for him has no idea how to work the reporting software.
Let’s look at a few examples of alerts to see what’s really actionable.
If you get a high CPU utilization alarm, is that actionable? Well, it might be if the CPU has been stuck there for a while. Netreo allows you to set a delay for these types of alerts, so it won’t generate an alarm until it’s been over 90% for some length of time. If your SQL server pegs the CPU, that’s probably normal, and you don’t want that alarm, but if it’s been pegged for 30 minutes you might have a job you need to go kill.
If a switch port goes down, is that actionable? If it’s plugged into something important, generally yes, so this is a good candidate for an immediate alarm. But what if it’s part of a port-channel or it’s just one server in a large cluster? That might warrant a different priority or a different method of notification that doesn’t require a ‘drop everything and fix this’ response.
What about when the web site goes down? Well, that’s generally a big yes. But you want to make sure it’s down from outside your environment and not just a VPN failure to your VPC from the inside, so using things like application checks launched from multiple remote locations can help you assess the seriousness of the issue. Netreo makes it easy to do that from anywhere, including your data center or our reflectors in the cloud.
Netreo builds in tools to help you optimize your alerts, which will allow you to identify what I like to call the ‘noisy birds’. These are the few configured monitors that are responsible for a disproportionate amount of alerts or those where an adjustment to the sensitivity of the alarm might be required.
You can also use template baseline reports to see exactly which alerts are being generated by your configuration templates and make changes to them directly that can then be automatically rolled out to all of the devices where that template is applied.
Also, other specialized reports, like the notification summary, are helpful to see exactly what alarms are going out and get a count of how many alarms are coming from each device, or from each type of service check or threshold alarm.
In an ever-evolving environment, agility is essential to keeping pace. By having a proactive DevOps monitoring system, you are sure to move faster, test earlier, while improving quality and reducing costs.