Artificial Intelligence (AI) is all the rage these days. Everywhere you look, companies are promising to solve your ills by applying AI to whatever problem you’re trying to solve. It doesn’t seem to matter what area you are in; medical, research, education, technology, software or anything else. Someone, somewhere is offering an AI-based tool that will solve all your problems.
Examples of this abound with organizations offering tools to help you write code from natural language text, hire the correct person for a specific role, ensure your new tenants are reliable and will pay their rent on time or enable the police to identify people with outstanding warrants.
Information Technology (IT) is no different. The term AIOps is being commonly used when referencing the application of AI/ML (Machine Learning) to IT Operations. In fact, frequent visitors to the Netreo Blog already know about AIOps from a previous post discussing what it is.
So, if we already discussed AIOps, why another blog post on the topic? In truth, this blog post is not really specific to AIOps, but instead to the more general concept of AI overall and focused on just how effective AI really is or can be.
Artificial Intelligence: One Ring To Rule Them All!
We know that AI is promised as the solution to all of our ills these days. But, is there any basis for this claim and can the technologies actually deliver? Based on a new study, The Fallacy of AI Functionality, the answer is a resounding “probably not today.” It’s not that AI/ML is insufficient for what we’re doing with it today. It’s more that AI/ML is in its infancy, yet we’re trying to use it like it has reached a significant level of maturity.
The study identifies a selection of examples of where the application of AI went badly. These range from the mild (a student who almost had their post-secondary education acceptance revoked) to people going to jail, filing for bankruptcy, or losing healthcare access. All these issues are due to problems with AI algorithms and implementation.
The trouble with algorithms and AI implementations
Of course the problem is not just solely related to the algorithms (as the authors are quick to point out). The authors categorized the issues with AI into four categories:
Missing safety features
Failure under adversarial attacks
Falsified or overstated capabilities
Each of these four categories include a further breakdown of the different types of failures that could occur. For example, under the impossible tasks category there are two types of impossible tasks identified, conceptually impossible and practically impossible. Keep in mind, no one is saying it is impossible to write an algorithm that will attempt to do something. However, there is no way such an algorithm could be accurate due to other existing issues. From the article (page 6 & 7):
Many predictive policing tools are arguably practically impossible AI systems. Predictive policing attempts to predict crime at either the granularity of location or at an individual level . The data that would be required to do the task properly—accurate data about when and where crimes occur—does not and will never exist. While crime is a concept with a fairly fixed definition, it is practically impossible to predict because of structural problems in its collection. The problems with crime data are well-documented—whether in differential victim crime reporting rates , selection bias based on policing activities [54, 120], dirty data from periods of recorded unlawful policing , and more.
What About AIOps?
AIOps is, for good or for bad, not that different from any other use case of AI. This means if you are expecting amazing results after implementing an AIOps solution, you are likely going to be disappointed. The reasons for disappointment are numerous, but some things to consider include (based on the taxonomy from the study above):
AIOps can absolutely fall into this category. The authors break the category into two subcategories: Conceptually and practically impossible.
Conceptually there is no real issue. Infrastructure, networks, applications, etc. all generate huge volumes of data. That data is what is needed to determine what the problem is. After all, that is what people do all the time when troubleshooting a problem.
Practically impossible is a more interesting conversation to have. As I mentioned there is lots and lots of data available, so that part is done. Unfortunately the process of looking at the data and making informed decisions from that data is much more difficult. There is no practical way to model and understand every possible scenario, combination of devices, environments, etc. Just because an AI model works in a lab or specific environment doesn’t mean it will work in YOUR environment.
Again, the authors have broken this into the subcategories of design, implementation and safety failures. And again, AIOps can suffer from these sorts of failures.
Designing an algorithm for AIOps is difficult. If it were easy, we’d all be retired living a life of luxury, but mistakes can and will happen. Whether the mistake is a lack of understanding of the problem, a rush to get a product to market, a simple typo or dozens of other possibilities is irrelevant. What is relevant is that if there is a design failure, the tool will not work as well as it should (if it works at all)
Next, we have the implementation of the model. This is taking the base model (that may or may not be well designed) and actually making it useful. There are, once again, plenty of places for things to go wrong. Is the system getting the data it needs in a format it can understand in sufficient quantities to properly make decisions? Was the model built for and trained on data relevant to my specific environment or is it a generic model?
Because automation is a critical part of AIOps (after all, once I have a system that can identify and come up with a solution to the problem, why do I want to have to manually implement the solution?), there is the potential for significant damage to occur if appropriate safeguards are not implemented. If a production switch needs to be rebooted due to a non-critical error condition, but that switch is the primary switch for the entire data center, should it just be rebooted when the issue is detected? Perhaps, but if the issue is not impacting production, bringing down the entire environment for a non-critical issue is probably inappropriate.
Once again we have three subcategories (robustness issues, failure under adversarial attack, unanticipated interactions) and once again AIOps can fall into traps with each of them.
The robustness issue may or may not be an issue. I mentioned previously that one challenge with AIOps is that just because I trained a model on IT data does not mean it is relevant or sufficient data. Taking a model trained at a 200 person software company and trying to make it work at a 20,000 person manufacturing company could lead to challenges.
Security and IT go together like a horse and carriage – both are useful on their own but so much more useful together. While we have not seen (or are not aware of anyway) issues with attackers specifically targeting AIOps solutions we expect that will happen at some point. If an attacker is trying to exfiltrate data from a company using AIOps to identify issues (whether they are security, performance, or something else) attacking the AIOps to hide the activities may make sense. How many AIOps tools are built to defend themselves against this sort of attack?
I’ve mentioned this a few times and it remains relevant, namely that every environment is different. An AIOps tool designed for one environment may work great at another but may not work at another. One reason for this is unanticipated interactions. Devices interacting with each other that are unexpected, people doing things in a different way than expected, etc.
Finally we have perhaps the most basic type of failures of all, namely communication failures. When it comes to AIOps, we can make this a bit narrower, focusing on marketing and sales failures. The authors have broken this into two subcategories (they have a good thing going with these subcategories, so may as well stick with it to the end).
These categories are falsified or overstated and misrepresented capabilities. These types of failures are pretty straightforward. The vendor is saying that their solution can do something it cannot do. Whether it is intentional (flat-out lying) or unintentional (misunderstanding the question, stretching the truth a bit) is irrelevant. The fact remains that there is a failure, and it is going to negatively impact the operation of the solution.
So What Do I Do?
My takeaway from everything I just discussed is that you shouldn’t be afraid of AI/ML, but instead you should focus on using it in the places that make sense. Do not believe vendor hype that their AIOps solution will be all things to all people and make your life perfect. Eventually that may happen (though I expect I will be long retired and playing with my grandkids before it does), but we are not there yet. Do the following:
Look for narrow areas where it makes sense to leverage AI/ML
The narrower the use case the more likely you are to succeed (and not run up against the challenges listed above). Netreo (for example) focuses on leveraging machine learning to help tune the system. While this is a fairly narrow area, it helps us avoid issues with things such as engineering and deployment failures (since we are not trying to do too much), communications failures (I just told you what we are doing with ML and that it is narrow), and the impossibility of the task (we aren’t telling you what is or is not a problem, but instead trying to make suggestions to help you tune your system more effectively).
Ensure the solution you are looking at is a good match for your organization
If you are a massive government agency, choosing an AIOps solution that was designed for small retail outlets is probably not a great idea. Look for solutions that are designed for your specific area (both vertically and horizontally) and that have a proven track record of success in organizations similar to yours.
Manage your expectations
No matter what the vendor and press releases say about the benefits you will receive from a given solution, plan to start small and work up from there. Start with a small portion of your organization and use the solution for a narrow area of problems. As you gain more experience and comfort with the tools, you can expand their remit over time. If things are not working out, you have not spent a huge