What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

What we do Digitate helps enterprises improve the resilience and agility of their IT and business operations with our SaaS – based platform . Platform Overview Platform ignio™ Platform ignio™, Digitate’s SaaS-based platform for autonomous operations, combines observability and AIOps capabilities to solve operational challenges Industries Autonomous IT Solutions for the Modern Industry BFSI Retail Healthcare & Life Sciences Travel & Hospitality Consumer Packaged Goods AI Agents ignio’s AI agents, with their ability to perceive, reason, act, and learn deliver measurable business value and transform IT operations. AI Agent for IT Event Management AI Agent for Incident Resolution AI Agent for Cloud Cost Optimization AI Agent for Proactive Problem Management AI Agent for Business SLA Predictions

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

What we do Digitate helps enterprises improve the resilience and agility of their IT and business operations with our SaaS – based platform . Platform Overview Platform ignio™ Platform ignio™, Digitate’s SaaS-based platform for autonomous operations, combines observability and AIOps capabilities to solve operational challenges Industries Autonomous IT Solutions for the Modern Industry BFSI Retail Healthcare & Life Sciences Travel & Hospitality Consumer Packaged Goods AI Agents ignio’s AI agents, with their ability to perceive, reason, act, and learn deliver measurable business value and transform IT operations. AI Agent for IT Event Management AI Agent for Incident Resolution AI Agent for Cloud Cost Optimization AI Agent for Proactive Problem Management AI Agent for Business SLA Predictions

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

What we do Digitate helps enterprises improve the resilience and agility of their IT and business operations with our SaaS – based platform . Platform Overview Platform ignio™ Platform ignio™, Digitate’s SaaS-based platform for autonomous operations, combines observability and AIOps capabilities to solve operational challenges Industries Autonomous IT Solutions for the Modern Industry BFSI Retail Healthcare & Life Sciences Travel & Hospitality Consumer Packaged Goods AI Agents ignio’s AI agents, with their ability to perceive, reason, act, and learn deliver measurable business value and transform IT operations. AI Agent for IT Event Management AI Agent for Incident Resolution AI Agent for Cloud Cost Optimization AI Agent for Proactive Problem Management AI Agent for Business SLA Predictions

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

Baselining Normal Behaviour of Enterprise IT Systems With Event Correlations

What we solve

Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions.

ignio Products

AIOps

Redefining IT operations with AI and automation

Workload Management

Enabling predictable, agile and silent batch operations in a closed-loop solution

ERPOps

End-to-end automation for incidents and service requests in SAP

Digital Workspace

Autonomously detect, triage and remediate endpoint issues

Cognitive Procurement

AI-based analytics to improve Procure-to-Pay effectiveness

Assurance

Transform software testing and speed up software release cycles

What we do

Digitate helps enterprises improve the resilience and agility of their IT and business operations with our SaaS–based platform.

Platform

ignio™ Platform

ignio™, Digitate’s SaaS-based platform for autonomous operations, combines observability and AIOps capabilities to solve operational challenges

Industries

Autonomous IT Solutions for the Modern Industry

AI Agents

ignio’s AI agents, with their ability to perceive, reason, act, and learn deliver measurable business value and transform IT operations.

Looking for something?

Discover how we empower customer success and explore our latest eBooks, white papers, blogs, and more.

Blogs

Podcasts

Customers Success

Omdia Research Report

Resources

Analyst Reports

Discover what top industry analysts have to say about Digitate

ROI

Get insights from the Forrester Total Economic Impact™ study on Digitate ignio

Webinars & Events

Explore our upcoming and recorded webinars & events

Infographics

Discover the capabilities of ignio™’s AI solutions

Blogs

Explore insights on intelligent automation from Digitate experts

Trust Center

Digitate policies on security, privacy, and licensing

e-Books

Digitate ignio™ eBooks provide insights into intelligent automation

Podcasts

Explore our upcoming and recorded podcast

Case Studies

Learn how businesses overcame key AI-driven automation issues

Reference Guides

Guides cover AIOps and SAP automation examples, use cases, criteria

White Papers and POV

A library of in-depth insights and actionable strategies

Today’s enterprise IT systems are emphasizing a major focus on observability to capture events across various layers of business, applications, and infrastructure. Different kinds of events are captured. The system-generated events are created by alerting tools on observing anomalies such as “high CPU utilization”, “node not reachable”. The user-reported incidents capture the problems experienced by the end-user such as, “unable to access an application” and “unable to download a report”. In addition, change requests are logged that capture changes such as “patch update”, “new application installation”, “hardware upgrade”. And lastly, there are anomalies captured from activity and error logs.

Analysis of these events can provide powerful insights to better understand the enterprise operations and identify optimization opportunities. Event Correlation is one of the most popular levers in this space. However, applying the theory of event correlation into practice presents various real-world challenges as well as opportunities. In this blog, we present our rubber-meets-the-road experiences on how to make the best use of event correlations in real-world enterprise IT systems.

Basics of Event Correlation

Correlation is the process of finding connections between different events which may seem unrelated at first glance. More specifically, temporal event correlation is the process of analyzing relationships between events based on their timing and sequence.

Consider a pair of event types A and B. Mining temporal correlations involve analyzing the occurrences of these events to assess whether there is any relationship in the occurrence of these events. Do they co-occur? Do they follow a certain lead pattern, e.g. event A always follows event B? Do they co-occur under certain preconditions, e.g. event A follows event B only on weekends? A correlation signature is associated with different properties, such as:

Support: Number of co-occurred instances of a pair of events.
Confidence: The ratio of number of co-occurred instances by total number of instances of the event pair.
Direction: The sequence of the events in which the event pair is observed.
Lead time: The average time window within which the event pair is observed together.

Spatio-temporal correlation brings another dimension to event correlation. Spatio-temporal correlations involve analyzing both spatial and temporal dimension together to assess whether any spatial relationship exists in addition to temporal. Do these events happen in close geographic proximity to each other? Do these events have common influencers?

We next discuss the challenges in applying event correlations in real-world systems and their possible practical workarounds.

Selecting the Right Scope

The first challenge is to select the right scope to mine correlations. Failing to do so leads to too many or too few correlations. Instead of mining correlations across all events, the relevance and usability of correlation signatures increases significantly by selecting the scope to mine correlations. A very effective lever is to use the topological information to narrow down this scope. The inter-component influences can be effectively captured in the form of graphs. Various graph traversal levers such as connected components, cliques, and spanning trees can then be used to derive the influencers. Mining correlations within these influencers point to different types of insights.

Downtree traversal can help with the root-cause analysis by following the flow of dependencies through the system. For example, an application down issue can be diagnosed by traversing the influencer hierarchy and checking the health of underlying application server, database server, virtual machines, and other components.
Uptree traversal can help with the impact analysis of a fault. For example, in the event of a disk failure, an uptree traversal can help assess the impact of this failure on the upstream servers and applications that use this disk.
Connected components or cliques help point to common problems across homogeneous entities. For example, machines hosted in the same rack, or virtual machines hosted on the same physical machine can be captured using connected components.

Selecting the Right Time-window

Another important aspect is to set the right time-window to mine correlations. Time-window decides the acceptable time difference between two events to call them potentially correlated. Setting this value as too large ends up correlating unrelated events, whereas a small window leads to missing genuine correlations.

Instead of using fixed time windows for mining correlations, a better approach is to consider time windows that adapt based on the topology and the nature of events. The basic idea is to assess how long it takes for an event on one entity to cause another event on another entity. To understand this propagation time, we tap into the underlying logs of these entities and compute the lag time between these activities. For example, the impact of a “server down” event is almost instantaneous on the application performance. However, the impact of a batch job starting late takes a longer time to manifest in the form SLA violation of the batch process.

Sometimes, the lag time between 2 events demonstrates multi-modality. The same pair of events exhibit different lag times on different conditions. It is important to understand the factors that lead to these different lag times. Classification and regression algorithms provide ways to assess various attributes such as day of week, day of month, severity, priority, etc. that best explain such multi-modal behavior.

Signature Fatigue

A frequently faced challenge with event correlations is that a large number of signatures get generated, making it very difficult to consume and use them effectively.

Clustering provides an effective tool to address this challenge. A correlation signature consists of entity type, entity name, event name, timestamp of two or more events. We use these attributes to create clusters of signatures with similar properties of support, confidence, and lead time. Clusters can then help group similar types of correlation signatures. Clusters of these signatures can be created by event types, by entity types, and by entity names. Correlation confidence and support can also be used to create clusters of correlations of different strengths.

Low Confidence Correlations

Another common problem observed while correlations on real data is that many correlations do not demonstrate high confidence values. These signatures in their base form are not usable as they do not inspire confidence in users to derive any meaningful inferences from them. However, high confidence signatures are often hidden within these low-confidence signatures. They just need to be extracted by applying the right filters. Applying filters on attributes such as severity, priority, day of month, hour of day, day of week, location, etc. on several occasions leads to the discovery of useful correlation signatures.

Applying classification algorithms on various event attributes such as severity, priority, day of month, hour of day, day of week, location, etc. helps to find the set of pre-conditions that increase the correlation confidence.

Let’s consider a scenario with 7 databases and a single backup server. Each database performs backup to this server on a specific day of the week. If we analyze events throughout the entire week, we’ll likely observe weak correlations, with roughly 1 in 7 confidence, due to the varying backup schedules. However, if we narrow down our analysis to only include events occurring on specific days, the likelihood of finding a strong correlation increases significantly.

Interpreting Correlation Signatures

Correlation signatures can be used in different ways to understand and manage an enterprise IT system. Below are some real-world use cases:

Alert aggregation: A fault often generates many symptoms and each of these symptoms manifest in the form of events. Consider a scenario wherein a database accessed by multiple applications goes down, triggering “database not accessible” events form all associated applications. Separate events are created for each of these applications. The command center teams treat each event in isolation and end up putting in a lot of redundant efforts. Correlation signatures can help aggregate such related alerts and thus reduce alert fatigue for command center teams.

Not all correlation signatures are suited for alert aggregation. The correlated events should occur within a short time-window such that the incoming alerts can be grouped together to act on. The entities of the correlated events should be structurally related such that the correlations have semantic significance.

Alert prediction: Often major problems have early indicators. For example, high website traffic followed by high disk utilization on the database server is a strong early indicator of future disk full and database crash events. Early identification of future issues can help the command center teams mitigate or contain their impact. Correlations signatures can also be used to identify these issues.

Correlation signatures that have a strong sense of direction are best suited for alert prediction. Furthermore, the correlated events should occur within a relatively longer time-window, such that early signals are helpful in taking any preventive actions. Consider a correlation signature where a chain of upstream job failures leads to downstream SLA violations three hours later. In this case, the SRE has both a clear understanding of events that are about to come and sufficient time to perform corrective actions.

Problem signature mining: Correlations also provide a very useful lever to analyze recurring issues. Correlations can help derive detailed signatures of these issue manifest and also narrow down their root cause.

To use correlations for problem signature mining, look for correlations with high support indicating that the issue has occurred for a sufficient number of times to initiate the problem management process. Also, it is a good idea to use a larger time window to mine such correlations as it may take time for the fault to manifest across different levels of tech-stack.

A recurring issue may get triggered by more than one cause, and these causes may not manifest together. Consequently, when all recurring issues are analyzed, no strong correlation signature surfaces up. However, the same events, when analyzed in subsets of different preconditions demonstrate stronger correlation.

Closing Notes

Traditionally, IT operations have been reactive, issues create tickets that are then assigned to SMEs for resolution. Event correlation introduces a smarter, more proactive approach by analysing event data to better manage enterprise IT systems.

It enables early prediction and elimination of potential issues, reducing their occurrence. It also groups related alerts into logical clusters, cutting down noise and redundant notifications. Finally, by identifying likely root causes, event correlation speeds up resolution times and helps maintain a healthier, more resilient IT environment.

What we solve

Baselining Normal Behavior of Enterprise IT Systems With Event Correlations

Table of Contents

Recent Blogs

Basics of Event Correlation

Selecting the Right Scope

Selecting the Right Time-window

Signature Fatigue

Low Confidence Correlations

Interpreting Correlation Signatures

Closing Notes

Dr. Maitreya Natu

Get started with Digitate