Generative AI is opening a wide range of possibilities. Various industry verticals are harnessing the power of GenAI for creative content generation, efficiency improvement, and personalization of experience. IT operations are no different either. Autonomous IT operations (AIOps) use artificial intelligence and automation to automate and optimize IT operations. The AIOps industry is only scratching the surface of possibilities offered by GenAI. However, early explorations show a promising opening for reimagining several aspects of closed-loop autonomous IT operations.
Elements of Autonomous IT Operations
Closed-loop autonomous IT operations are driven by five key aspects.
- Learn context: Capture factual and situational knowledge and create a blueprint of IT operations that connects businesses to applications and infrastructure.
- Manage alerts: Generate only the right alerts at the right time by suppressing false alerts, aggregating related alerts, prioritizing alerts, and predicting alerts.
- Handle incidents: Diagnose the root-cause and take corrective actions to auto-resolve incidents.
- Perform actions: Automate the actions required for various lifecycle operations.
- Optimize proactively: Plan for growth & change and identify opportunities for continuous improvement.
Human-in-the-loop
Analytics and automation are transforming these five aspects of autonomous IT operations. However, each of these aspects, from time to time, need human intervention to leverage the knowledge of a subject matter expert, and the instinct and intuition of an experienced practitioner.
- Creation of context requires human expertise to define the domain knowledge models.
- Managing alerts still requires human intervention to review and approve these rules before deploying in production, while AI/ML solutions can mine rules to suppress false alerts, and group related alerts.
- Incident handling requires human experts to address resolution of exceptions and never-before-seen scenarios.
- Automation of actions requires domain experts to write the code to perform last-mile atomic actions. These actions are then chained together using intelligent automation
- Proactive optimization often requires subject matter expertise to review analytical insights and translate them into actionable recommendations.
Changing the Human-in-the-loop Experience with GenAI
Generative AI can transform the human-in-the-loop experience of the first mile and the last mile of autonomous IT operations.
The first mile of autonomous IT operations relies on comprehensive knowledge for modeling IT operations. GenAI can transform this first mile process by creating a knowledge accelerator to capture enterprise context and generate automation scripts for various operations such as resource provisioning, service configurations, and patch management. This allows us to easily adapt to technological changes and accelerate the automation of lifecycle-specific service operations.
The last mile of autonomous IT operations requires human involvement to validate actions, guide in case of exceptions, and to consume insights for continuous improvements. GenAI can transform this last mile by creating an intelligent assistant to drive intelligent conversations. It leverages GenAI’s ability to understand language, capture user context and learn from feedback. As a result, analytics insights can be consumed in a much simpler and intuitive way, leading to a higher AI adoption, faster incident resolutions and proactive problem management.
IT Operations Knowledge Accelerator
Autonomous IT operations rely on availability of three types of knowledge – factual knowledge, situational knowledge, and operational knowledge. Let’s look at what this knowledge is and how GenAI can help in accelerating its acquisition.
Factual Knowledge: This refers to knowledge about the facts of a technology, a process, or a domain of an enterprise IT. Take Autosys technology for example. Its factual knowledge consists of following aspects.
- Understanding the associated entities such as streams, jobs, files, and feeds.
- Understanding attributes associated with these entities such as execution schedules, execution time constraints, and SLAs.
- Understanding the relationships between these entities. For example, a stream consisting of other sub-streams and jobs, where jobs are related to each other through precedence relationships.
- Understanding the data sources from where the above information can be fetched, parsed, and disambiguated. For instance, a lot of the above Autosys information can be fetched by the Autosys JIL files. Factual knowledge consists of knowing these data sources, the commands to fetch them, and the knowledge of how to parse this data.
Today, there is a heavy dependency on subject matter experts to provide this knowledge. Queries and parsers are in place to extract this information, however, there is still a reliance on expert knowledge for answers pertaining to what, where, and how of this knowledge.
A GenAI-powered knowledge accelerator can significantly simplify this process. It can be used to understand technology models and the associated entities, relationships, and attributes and automatically create the metamodel of the technology. It can capture the information of data sources to collect this information. It can also develop scripts to automatically fetch and parse this information from a live instance.
Situational Knowledge: While factual knowledge captures universal facts, situational knowledge captures instance-specific information of the environment. This information can include various aspects such as the knowledge of business functions and their structure, the business-criticality of functions, the users, the business models, the service level agreements, among others. This knowledge is custom and lacks any universal reference. And hence, today, there is a complete reliance on the line-of-business heads, architects, and operations managers to provide such information. This often makes the process time-consuming, inefficient, and error-prone.
Concepts of Retrieval Augmented Generation (RAG) can be used to accelerate this process. RAG models can be used to build knowledge repositories of organization’s own data, which can be continuously updated to capture the most recent information. These RAG models along with LLM can be used for relevant text retrieval and generation. The resulting solution provides a very effective vehicle to easily extract this situational knowledge from organization’s various data sources including the structured databases, the blogs, the news feeds, the incident resolution notes, user guides, product collateral, and case-studies.
Operational Knowledge: A vital requirement for a closed-loop autonomous solution is the ability to perform actions. This ability is realized using operational knowledge. This knowledge consists of the ability to perform last mile actions on target machines. These atomic actions can then be chained together to enable various use-cases such as event management, incident management, security and compliance management, problem management, patch management, among others. Today, this knowledge mostly exists in the form of code scripts developed by experts of a technology or domain. The scripts vary from restarting a service, or updating an operating system, to installing a patch or resetting a password. This becomes a time and effort consuming task. Furthermore, it is often a recurring activity to update these scripts with technology refresh and version changes. GenAI can simplify this process significantly in various ways.
Code Generator: A GenAI-powered code generator can be used to take user instructions in natural language and use LLMs to generate equivalent code. Creative prompt engineering plays a major role here. Prompts can be designed to explain the expected input and output formats, the style of programming, the level of exception handling, the structure of code, and so on. Another common requirement from this feature is to customize the code to integrate into the wider eco-system. This often translates to expecting input arguments from some specific data-structures and returning output and error messages in specific formats.
Code Translator: Another common style of knowledge creation is translating the code from one language to another. This is commonly required when an environment goes through a technology refresh where the previous language is not supported in the new environment, or the application intends to migrate to a more optimized language. For cases where the LLM is aware of both the source and the target programming languages, then the code translation can be performed through straight-forward prompts. However, the complexities arise when the code involves platform or OS-specific features. The LLM needs to be carefully guided to address such scenarios by looking for an equivalent functionality in the target environment.
The task of code translation gets further challenging when one of the two languages is not a known language for the LLM. Domain-specific language (DSL) forms a very common case for this scenario. These languages are custom designed for an organization and no pretrained LLMs are available for them. One way to address this challenge is by fine-tuning an LLM with the knowledge of the new language. This often requires a significant amount of data and resources. Another approach to this solution is through In-context learning, where the LLM is provided with detailed instructions on how to translate the code from one language to another, detailing out specific constructs, and corner cases, along with examples.
Code Quality Assessment: The code generated by the LLMs may not be entirely accurate or as per the user’s expectations. Hence, it is important to involve an element of code quality assurance. Various approaches can be used to ensure code quality. An ensemble of LLMs can be used to generate the same code through multiple models. The LLMs can also be used to evaluate the accuracy of the generated code and assess a confidence score to different segments of the generated code.
GenAI can open many more possibilities for operational knowledge creation such as pseudo-code generation, code documentation, test case generation, and test data generation.
GenAI IT Operations Intelligent Assistant
Users face different types of challenges when interacting with an AIOps solution. These challenges are primarily due to lack of user-friendliness in the interaction paradigm of the AIOps solution. The following are some of the common last-mile user interactions between a human and an AIOps solution.
- While using an AIOps solution, users often need help to understand a feature, or troubleshoot issues encountered while using an AIOps Today, this is often done either by searching through user guides, or by connecting with the support teams.
- A common observed human interaction with an AIOps solution is when human expertise is required to either validate and approve the actions suggested by the AIOps engine, or to guide the AIOps to handle exceptional scenarios.
- Another common challenge with last mile of AIOps solution is insight fatigue. Given the advances in observability and AI/ML algorithms, the AIOps solutions inundate the users with a flood of AI-driven insights. The end-users often struggle to find the insights of interest. Furthermore, they often struggle with the aspects of explainability and trustworthiness of these insights.
GenAI-powered solutions can redefine this last mile experience by engaging with the user in intelligent conversations to simplify the entire user engagement with the AIOps solutions.
Product Q&A
Products usually contain knowledge articles, troubleshooting guides, user guides, release notes, case studies, and even incident resolution notes. To retrieve the relevant information, the user needs to know which documents to search for, and their query must use the right terms and phrases to match the content of the document. However, in practice, user’s queries are often ambiguous, verbose, or even incomplete. Consequently, a significant amount of reiteration may be required to identify the correct information. Hence, this task is typically limited to the product support team due to its relative complexity. This experience can be transformed with a GenAI solution.
RAG architectures can come at play here. Knowledge repositories of various product collaterals can be maintained in the form of vector stores and RAG & LLM can be used to create conversation engines that possess the ability to comprehend language variations, handle contextual information, and maintain a meaningful conversation with the user. It can also ensure the generation of meaningful and well-formed responses even if the source documents have varying language quality.
Source documents might be duplicate or ambiguous. At the same time, user queries may be incomplete, ambiguous, or lack clarity. The bot can make use of sentence transformers to identify problem statements that closely match the user’s query, seek user’s preference, and learn from it.
Another aspect of this experience is to not only respond to user queries but also form a point-of-view and lead the conversation. This can be done by looking for other sources of information with similar context and performing sequence mining of past user conversations.
Collaborative Resolutions
An AIOps solution involves a human expert to either validate a resolution procedure or get guidance in resolving exceptional cases. A human expert often requires information that provides the context of the incident, steps performed by the AIOps solution, health-check status, past statistics, any analytics insights, and so on.
GenAI can simplify this process by providing plain language summary of the incident resolution and by engaging with the user to provide any additional information in the context of the conversation. The conversations can also be remembered to apply in similar situations in the future, thereby ensuring that the AIOps solution involves humans for just the right questions at the right time.
Insight Storytelling
Another common challenge with an AIOps solution is insight fatigue. An AIOps solution analyses a wide range of data across business, application, and infrastructure and generates insights ranging from behavior analysis to risk and capacity management to predictive and preventive insights. An end-user is often overwhelmed with this information and struggles to find the insights of interest.
GenAI can be used to address this problem in various creative ways. It can group and summarize insights in the form of user-friendly reports and notifications. It can be used to create chains of related insights. It can also enable a conversation engine to help users easily navigate through a wide range of insights by simple conversation. The GenAI solution can remember the context of the conversation, the relevant scope of the enterprise estate, and similar conversations in the past to not only respond to user’s queries, but to lead the conversation with relevant insights making it easy for the user to make the best use of this information. GenAI-based solutions can pave ways for making the AI-driven insights explainable and trustworthy by proving simple-language explanation of how the AI engine derived an insight.
Closing Thoughts
AI and Automation have been transforming the IT enterprise with autonomous closed-loop operations. GenAI can advance this transformation by creating new metaphors of augmented intelligence. This not only improves the scope, accuracy, and effectiveness of the AIOps solution, but also increases its adoption by the business teams by ensuring ease of access and explainability. We are only scratching the surface of the possibilities that GenAI has to offer and there are open questions with respect to its transparency and trustworthiness. However, this technology holds great potential to bridge between machine intelligence and human creativity, and hence demands continuous exploration with ethical use and responsible development.