Resolve production incidents with clear DevOps Consulting
A crucial step to having efficient DevOps consulting is by being completely transparent with data. By making monitoring data available to everyone in the value stream, everyone shares a common view of reality, which aids in communications, and demonstrates transparency which enhances trust.
In this fourth post of our “Eliminating DevOps Monitoring Challenges” blog series, we’ll share with you the importance of monitoring your dev environment in order to improve your DevOps consulting.
Be Transparent with Your Data to Improve DevOps Monitoring
A key element to making this successful is to make it easy and quick to access the information, so teams don’t feel like they need to have their own monitoring, which can create a lot of alert noise and redundant effort, as well as creating admin, support, and maintenance spend issues.
If the audience is non-technical (and in every company, there are always some non-technical people to keep in the loop), don’t hesitate to simplify or abstract the data into easier to understand formats. We do this without strategic groups, which allow you to roll an entire application and all its component parts into a single percentage score.
Sharing this data can take a number of forms, depending on what works for your environment. APIs, common share pages or intranet pages, unauthenticated status pages, and information radiators.
An information radiator is just a system designed to display important status information in a public area – for example, a monitor on the wall of a hallway everyone walks through. The team can display metrics using Netreo’s custom dashboards or business workflows functionality in order to share useful data with the team – current incidents certainly, but also useful operational metrics like response time, transaction velocity, and system or application status.
Using information radiators also allows us to show off our DevOps chops, too: Everyone can see the team has nothing to hide from visitors (customers, stakeholders, etc), and the team has nothing to hide from itself: It acknowledges and confronts problems.
A key factor in making information radiators work well is to make them simple to understand. Our Business Workflows functionality allows you to group arbitrary systems together to see all the different parts that are required to make an application work – databases, storage, load balancers, clusters – and group them into a single easy to understand percentage metric that can be monitored and trended over time, to detect anomalies or manage service level agreements.
You can control how each monitored item is weighted, so only the key metrics or services on each system are counted. That way these numbers reflect the reality of how your applications are doing. And systems can be part of multiple groups, so you can have one group for ‘storage’ that your SAN team focuses on, but the storage arrays can also be part of each application that depends on them. Then, your app teams can see when a problem might be coming from there, which cuts down on resolution time. You then can combine multiple groups into Aggregate groups, so for the CIO’s dashboard, she can have a single metric for each line of business or area of responsibility that just says “network”, “Security”, or “Remote offices” and know immediately if all is well, or if something is being worked on. And Netreo’s Auto configuration rules let you do this configuration – and importantly, maintain it going forward – without adding admin overhead.
By tagging systems and ports to identify key portions of the application delivery stack, you can make custom dashboards or on-demand reports take seconds to get. Where it might have once taken hours to have an engineer run reports on CPU utilization on all the systems running a particular service, or bandwidth on all the ports affecting a particular cluster or VRF group, you can now have this data rendered in seconds with Netreo. A small-time investment in tagging – or in creating rules to apply tags – can make generating these kinds of complex reports effortless. And of course, you can automate them so these reports get pushed out daily, weekly, or monthly with no extra effort.
Having good telemetry makes deployments safer.
If you’ve been monitoring the apps through the dev process as we talked about, you should have a good idea of what to expect even before deployment.
If Netreo is watching your logs for anomalies, you’ll find it easy to detect when something goes wrong. it also allows you to make sure that the deployment didn’t break something else.
Comprehensive telemetry also means you can be sure your deployments aren’t creating collateral damage in systems you didn’t think would be affected. We’ve seen more than one deployment happen that monopolized a resource that no one realized was being shared by another application, especially at larger organizations where you have many dev teams working in parallel.
Telemetry data also provides a lot of decision support metrics, and even allows you to do predictive capacity planning. By knowing in advance what your long term trends are, even when they’re not obvious, you can prevent performance impacts before they even happen.
And of course, automating the reports to stakeholders improves trust and communications throughout the organization, and saves a considerable amount of operator time. Netreo uses a ‘make any view into a report’ model that means no special training or scripting is required to get exactly the reports you want sent exactly where you need them.
Make sure to keep up to date with our blogs for the next and final part of this DevOps consulting series.