AIOps Tools – Top 5 Questions to Consider
This article shares how AIOps tools work and what to look for when choosing a solution.
Artificial Intelligence for IT Operations or AIOps is a term coined by Gartner to describe a class of technology that utilizes Machine Learning and Big Data to enhance IT operations. It works by ingesting data points from devices throughout your network and analyzing them with Machine Learning models in real-time that have been trained to look for specific use cases by harvesting the power of Artificial Intelligence.
AIOps can correlate information between your devices to find issues on your network and make predictions about potential issues before they occur.
This article explores components that make up the system, practical examples of how AIOps are being used today, and the top five questions you should ask yourself when comparing AIOps tools.
Why do we need AIOps tools?
Anyone that has worked in IT knows how difficult and time-consuming, triaging and finding issues can be. User complaints can often be vague and require a long series of troubleshooting steps that vary according to the use case. Each network brings its own unique series of questions that needs to be chased down and answered by operations. Oftentimes these triaging steps are time-consuming because it feels more like a process of elimination than a scientific method. Ideally, IT would have the answers to these questions in real-time so they can make informed decisions about where the problem is and what to do next. This applies not only for the use case in question but also for other issues they commonly see in the field.
While current IT monitoring tools provide notifications of obvious issues, they lack the intelligence and contextual awareness that is required to answer complex problems. AIOps tools solve these problems by using Artificial Intelligence. Event and telemetry data from devices throughout your network are run through many different scenarios and models to provide the answers to these triaging steps in real-time.
There are several different methods by which a system can learn and make decisions. In AIOps, the learning method is typically the supervised model of Machine Learning, which means that the system must be trained on what to look for and how to score various data points that it receives.
In IT operations, there are countless use cases that we must train the system so that it can effectively find and remediate issues it encounters as it goes through its models. While Machine Learning is at the core of the intelligence in AIOps, we still need to ingest and normalize thousands of different data points throughout our network, and that takes us to the second major part of our system, which is Big Data.
A typical network consists of many data points that can come in the form of Syslog, Netflow, config changes, and other telemetry types. This results in a massive amount of data that must be ingested, stored, and retrieved. At its core, this is what most AIOps tools are looking to achieve – turn hundreds of thousands of data points into information that can be used to make decisions in real-time. The hope is that this would lead to not just finding and remediating issues quicker, but also finding issues before they are even noticed by the end-user.
Share this AIOps infographic on your website with the following script.
How do AIOps tools work?
Gartner defines five major functions of AIOps tools that we should review:
The Big Data component of the system will ingest, index, and normalize events from devices throughout the network. This spans across multiple devices and vendors to grab data and telemetry from the devices on your network. These events can be as simple as config changes, Syslog messages, SNMP alerts, Netflow, and other types of telemetry data. As you evaluate AIOps tools, this step is a critical part to consider as you make sure your devices can be integrated and supported by the system. The more data points supported for your device the better. A system that supports Syslog, SNMP, and Netflow is going to have better context than a system that only supports Syslog for that device.Gartner also calls out two points that the ingestion function must perform which is to allow real-time and historical analysis of stored data. Both of these functions are critical to other components of AIOps tools. The topology function relates to the discovery and mapping of an IT asset, including hardware and software, in the environment. This goes beyond just knowing about the device, but also extending out and building relationships between devices. The same is true for a human that begins to troubleshoot a potential issue. A topology view helps understand the context of the issue between the end-user and the resource they are trying to access.
With data coming in and relationships between devices established, the next function is to correlate the telemetry data between devices. That means the relationship is understood between the various assets on your network and how they relate to the network at large. For example, a business-critical application that is used by third-party contractors might not have a relationship with the end-user that works in sales. However, their access to invoices may be critical. For that business group, correlating information between all systems involved in that specific flow is crucial. This includes endpoints, switches along the path, routers, firewalls, and servers that could all be problematic and require further investigation. Once we have our data, understand the relationship of the devices in our network and the correlation of how data and devices are involved for the given use case.
The next function is what Gartner calls recognition. This is where issues are detected or predicted based on the Machine Learning training that has been given to the system. This will undoubtedly be the most vital component of most AIOps tools, and it is really intelligence that varies depending on the vendor. In a supervised learning model, the system is trained for specific use cases to look out for. That means all AIOps tools are only as good as the models it has been trained on. Because each vendor will have its own models and use cases that need to be trained, every product can vary from the next, and as we see this field evolve before our eyes, you will notice there are some platforms that may be cloud heavy while others are focused on more traditional network routing. This is where you will need to have a firm understanding of what you are looking to achieve with AIOps tools and proof of concept to make sure it works in your environment.
Ultimately, the recognition function, just like any other Machine Learning model, is about making a decision based on the data and training it has received. Based on the decision or prediction that the system has made in the previous step, the last phase is to actually do something with that information. This is where the remediation phase either makes a recommendation based on the situation or automates a response to an external system. While it’s unlikely that most customers will offer full automation in the beginning, a more likely scenario is that a trigger alerts the network teams on its findings for a human resource to make the final decision and decide on the appropriate next steps.
AIOps is not a “set it and forget it” model, but a system that relies heavily on learning and improving over time. This means that there will be potential for many false positives in the beginning and hopefully, more accurate decisions and predictions over time as the system has learned and trained via the supervised learning model.
Questions to consider when comparing AIOps tools
The AIOps market is still incredibly new and most AIOps tools are just now beginning to scratch the surface of what’s possible with artificial intelligence. Over time, we should expect to see increased use cases covered as AIOps tools introduce new training into their Machine Learning models. As you consider the landscape of potential solutions, here are some questions you may want to consider:
- Does the solution ingest data from all of my major IT assets? If so, what are the event types that are being supported?
- How well does the solution integrate with my current processes? In other words, how can this speed up my current triaging steps instead of it being another tool that IT uses when troubleshooting?
- What is my use case?
- What (or how many) use cases are covered by the solutions machine learning training?
- What kind of predictions can be expected?
With the right AIOps tool, you can increase IT efficiency, reduce cost, and truly transform the customer experience.
Gartner states in its 2021 Market Guide for AIOps Platforms that “AIOps platform adoption is growing rapidly across enterprises” and “there is no future of IT operations that does not include AIOps.” To help you better understand AIOPs tools, download your complimentary copy of the Gartner report here.