Hybrid Root Cause Analysis Solutions are the Key for an Effective Problem Solving
IT engineers often refer to the term “MTTR” (Mean Time To Repair). It’s a key metric to indicate how quickly an IT team can solve outstanding issues. Although we have more sophisticated monitoring tools today, and may be able to receive alerts much earlier than before, there is still a constant struggle to find the root cause analysis solution and identify the appropriate remedy. This is often the cause of MTTRs taking much longer than desired.
In this situation, Root Cause Analysis can be helpful by combining past human experience and machine powered data processing capabilities to provide visibility quicker and suggest the best solution candidates.
In this fourth article in our “Taming IT Chaos” blog series, we’ll introduce a Root Cause Analysis solution using Machine Learning (ML) and related technologies.
Contextualize An Alert
When an alert happens, there can be many failures that cause it. The very first step to reaching a better understanding of the alert is to contextualize it. Typically, there are four contextualizations that can be applied:
- Vertical stack
- Horizontal traffic path
- Transaction flow
- Time-series event association
1. Vertical Stack
A modern IT infrastructure consists of multiple layers: from the physical network, to hosts, services, and applications. For each layer, there are different monitoring mechanisms such as traffic monitoring for the network layer, and service checks for the service and application layers.
Associate an alert with monitoring metrics of each layer is the first step to contextualize an alert, as illustrated below:
2. Horizontal Traffic Path
Traffic path is another dimension to contextualize an alert. It tries to connect alerts with monitoring metrics along with the traffic flow. For example, the root cause of a website issue can be tracked down to its related gateway, web server, function service, or search service, as illustrated below:
3. Transaction flow
A transaction is a series of actions executed consequently that jointly fulfills a task. A transaction can be a high-level eCommerce flow that consists of search, shopping cart operation, and payment. Or it can be a low-level database SQL execution that includes multiple steps of in-memory computation and a final commitment.
Setting up the relation of an alert with its transaction flow will allow IT operators to connect the business purpose with its underlying operations and quickly outline the scope of an alert’s impact, thus pinpointing the critical path quickly.
4. Time-series event association
All the IT events can be applied to a time-series analysis, the sequential relationship among events can be identified and related events can be grouped and sorted.
Sequential relationships may not necessarily indicate a causal relationship, however, a frequent, repetitive sequential relationship can be viewed as a pattern that may lead to further discovery when combined with other information.
Supervised Training of Past Accidents
Past experience is a valuable asset to solving today’s issues. Past alert data, with all contextual data and solution information, are perfect training sets for a machine to learn. It is typically a supervised learning process, and there are multiple learning models that can be applied such as regression, decision trees, or neural networks.
Normally, there is no single model that fits all scenarios or a perfect solution for one problem. The beauty of machine learning is that it can combine multiple models’ results and provide a series of solutions ranked by confidence scores. Humans can examine the solutions suggested by the machine learning models, further investigate them, or try them.
Beyond the Root Cause Analysis Solution
With confidence scores associated with each solution suggested by machine learning, human operators can even set up a self-remediation mechanism. For highly confident solutions that are also low negative impact, should they be applied, human operators can allow machines to execute them automatically and monitor the consequence.
When more and more experience is learned and modelized by machine intelligence, less and less human intervention is required — even for catastrophic incidents. And this can be the very first step towards full playbook automation, the ultimate dream of IT operation.