How to Establish a Sustainable ML/AI Practice in IT monitoring
With more and more people working from home and the ever-increasing complexity of IT infrastructure, it’s important to understand the best way to leverage Machine Learning (ML) and Artificial Intelligence (AI) to improve IT operations.
ML and AI have promised to bring disruptive changes to IT operations, and many organizations have already decided to adopt Artificial Intelligence for IT Operations (AIOps) or to do it soon. Yet, implementing and deploying AIOps is still very challenging. Here, we would like to provide some tips for ensuring a successful AIOps implementation.
How to guarantee a successful AIOps implementation?
Tip 1: Data is gold
ML and AI are all well known for their hunger for data. There is no way to overestimate the importance of data to a successful AIOps implementation. IT monitoring tools are all rich in device and incident metrics, but other data like digital asset inventory, organization structure, and workflow information can dramatically augment the effectiveness of AIOps.
Tip 2: Data quality still matters
Just like the quantity of data, the quality of data is equally important. Although ML and AL models are more tolerant of noise than traditional analytical methods, the old saying of “garbage in, garbage out” still holds true in most cases. Bogus alerts, outdated information, and disorganized data will only bring more chaos rather than clarity. Data quality, such as accurate timestamps and freshness, can set a good foundation for overarching AIOps journeys.
Tip 3: Differentiate real-time and non-real-time analytics
One trick to organizing data is to differentiate real-time data from non-real-time data. Real-time analytics demands a full set of different pipelines to process than batch processing, so that in general, it is a good practice to separate real-time data from non-real-time data in terms of storing, processing, and predicting. When real-time data becomes obsolete, it can be merged into non-real-time data to leave room for the latest real-time data.
Tip 4: Feature engineering is as important as model training
Feature engineering is the work of bringing more structure to the data by attaching labels, applying classification or grouping from different dimensions, or exerting PCA (Principal Component Analysis), etc.. Although feature engineering does not directly generate predictions, the structure it brings to the dataset will dramatically impact how efficient the model training can be, the effectiveness the trained model can achieve, and how quickly the model can iterate.
Tip 5: Human experience still matters, but needs to be codified
ML and AI are very powerful, but they are not positioned to displace human intelligence. Rather, they are positioned to augment human intelligence. The IT industry has accumulated a large amount of best practices which have saved us many times in history. This human knowledge best practices, once codified, are valuable assets for ML and AI models to learn and amplify.
Tip 6: Start with less factors and gradually increase the complexity
Many advanced machine learning models can take a large number of factors and build very complicated models. With this “superpower” handy, people often tend to dump as much data as available into machines, then leave the machines to handle the data deluge. However, more data does not always equal better outcomes. Dumping the data into machine models without discipline can only generate convoluted signals. It is suggested to begin with simple models and limiting factors that are identified by PCS as most important. Simple models can unfold major trends in monitoring and are easy to be comprehended by the human brain. With insight gained by a small set of factors, the models can be enhanced by adding more factors or concatenating with another model for more advanced analytics.
Tip 7: Don’t rely on one model, instead run multiple models in parallel
No one model is omnipotent. Some models are good at information simplification and others may be good at information enhancement. Different models can gain different insights from the same set of data. Training and deploying multiple models can help provide a 360-degree view of the data. Don’t build a giant model with all factors, instead build a forest of small models, and collectively they can be more powerful and easy to manage.
Tip 8: Prediction is important, so is the explanation
One inherent drawback of machine models is that it is hard to explain the cause-and-effect relationships among data. However, understanding the root cause of alerts and incidents is crucial for IT operations, and this is where human intelligence comes to the rescue. Human comments or historical intervention can be input for model iterations and make the models more and more explainable.
Tip 9: Don’t build a black box, instead make the tool interactive
Related to the topic of explanation, it would be favorable to build a process that can get human intervention when needed, and pick the adjusted path moving forward. Experienced IT operators can pick some early signals when they are still looming, and suggest the best short cut if available. Machine analytics on large amounts of data augmented by human hunch can be phenomenal, and tools should be built to accommodate this combination.
Tip 10: Data-driven mindset is as important as data
Last but not least, having a data-driven mindset in an organization is critical to the success of an AIOps deployment. Building a disciplined process around data from data generation, to storage, to refinement, to recycling will ultimately guarantee the success and continuous improvement of AIOps.
These are tips Netreo has gained from over two decades of experience in IT monitoring and AIOps implementation. AIOps has become a foundational technology for IT monitoring, and we will continue our exploring in this area and share with you our findings periodically.