DevOps Monitoring Challenges
Some of the most common DevOps Monitoring challenges we hear about from customers are things that might be all too familiar to some of you. One of the most common is that teams lack visibility into the whole environment. This is both a symptom and cause of labor-intensive visibility, loosely coupled discrete tools, and a lack of hard data to capacity plan or assesses success. Many conventional management tools are slow to adapt to the rapid changes and deployments that are key to DevOps, and this can lead to blind spots. If teams don’t share a single view of reality, you get finger-pointing between teams, which can cripple your resolution times.
Visibility often suffers when it’s difficult and labor-intensive to manage. If a manual process is required to add monitoring onto new systems, there’s just no way it’ll ever stay up to date.
Often, this lack of an up-to-date view across the organization leads teams to implement their own point solutions, and the more tools you’re dealing with, the more likely they are to contradict each other, to say nothing of the increased admin and training burdens, and the communications issues this can cause.
Also, if we don’t have good, hard data about how things are actually performing, it’s difficult to avoid slowing down response times, over-saturating infrastructure, damaging reputation and adding unnecessary costs. DevOps should be helping your organization, not causing more problems!
The State of DevOps report told us that, “High performing [organizations] resolve production incidents 168 times faster than their peers, with the median high performer having an MTTR measured in minutes, while the median low performer had an MTTR measured in days. The top two technical practices that enabled fast MTTR were the use of version control by Operations, and having telemetry and proactive monitoring in the production environment.”
A key finding here is the realization that DevOps often focuses on producing applications faster, not on promoting after-deployment operational visibility, and that hurts us when it comes time to deal with a critical failure. So why don’t we already reduce MTTR by having telemetry and proactive monitoring? Because we’re stuck in firefighting mode. If development isn’t thinking of the monitoring part of the equation DURING the development cycle, they’re crippling operations, and as a result, MTTR.
Reactive alarming means we’re not detecting unusual but significant behavior, and we’re not finding problems before they start impacting users.
Lack of automation is also a big factor because the admin and training burden of running multiple monitoring platforms means we’re wasting a lot of time and we’re probably never keeping up with the speed of DevOps deployment. It also means we’re manually intervening every time there’s a problem.
Most platforms also suffer from slow, cumbersome reporting, and little ability to customize or automate it.
This leads to a lack of real data to make planning decisions or efficiently scale.
Utilizing Telemetry Data to Improve DevOps Monitoring
The first and most important key is to get telemetry on our applications. Telemetry is just data, but it’s data you can use. Burying the important data in a petabyte of log files doesn’t help anyone. The most important step in improving DevOps visibility is to make sure you’re collecting the right data. The best way to get monitoring in Operations is to start by designing, planning, and building it in Development. Making it as easy as possible to get that data into your monitoring platform is the first step in getting development on board. That means you need a flexible set of APIs to push data to, and easy ways to manage the configuration preferably, fully automated ones. If we make it too difficult to add this visibility into the applications, the developers will resent having to do it, and it’ll slow down our deployment speed, which is the opposite of what we want to do. But if we make it easy enough, the developers can use this data to help them debug and optimize too, and they’ll embrace it – which helps everyone.
An important way to make this data really useful is to get a single view into the environment to make sure nothing gets missed. This has been a primary goal of Netreo since the beginning. By integrating application-level and system-level metrics into a single user interface with consolidated reporting, you’re enabling the system to detect correlations for you, and empowering your operators to better find the cause of problems without having to go on a scavenger hunt through multiple systems.
So we need to get beyond up/down monitoring. Having telemetry means you have good visibility into the entire application stack, including the user experience, the applications, the devices, the systems, and the underlying infrastructure. To get there, we need to make sure we can see into the application, letting you detect sudden changes, anomalies, telling you when something isn’t working right or a problem exists only at scale, or a deployment breaks something unexpectedly.
A simple test is usually pretty easy to create for the applications, but you’ll have to work with development to settle on exactly what that test case is, what functions it links in to, and what method should be used to collect that data. Continuous testing is a lot easier if you build a few hooks into the application that can show you where – and how – something is going wrong. Pushing that data into Netreo via API is as easy as making a web call, and you can easily integrate tests that are too deep to run from the outside into your main operations dashboard.
One example of this is a customer who set up their application so that if a certain internal call took longer than the designed number of seconds, they could push that alert directly to the operations console in Netreo, so the operations team would know about it, and could then get long-term reports about how often this was happening. This lets the development team optimize the way the application scaled and gave them good historical insights into how those optimizations improved performance over time.
Make sure to keep up to date with our blogs for the next parts of this series on how to improve DevOps Monitoring.