Today’s enterprise IT systems are emphasizing a major focus on observability to capture events across various layers of business, applications, and infrastructure. Different kinds of events are captured. The system-generated events are created by alerting tools on observing anomalies such as “high CPU utilization”, “node not reachable”. The user-reported incidents capture the problems experienced by the end-user such as, “unable to access an application” and “unable to download a report”. In addition, change requests are logged that capture changes such as “patch update”, “new application installation”, “hardware upgrade”. And lastly, there are anomalies captured from activity and error logs.
Analysis of these events can provide powerful insights to better understand the enterprise operations and identify optimization opportunities. Event Correlation is one of the most popular levers in this space. However, applying the theory of event correlation into practice presents various real-world challenges as well as opportunities. In this blog, we present our rubber-meets-the-road experiences on how to make the best use of event correlations in real-world enterprise IT systems.
Basics of Event Correlation
Correlation is the process of finding connections between different events which may seem unrelated at first glance. More specifically, temporal event correlation is the process of analyzing relationships between events based on their timing and sequence.
Consider a pair of event types A and B. Mining temporal correlations involve analyzing the occurrences of these events to assess whether there is any relationship in the occurrence of these events. Do they co-occur? Do they follow a certain lead pattern, e.g. event A always follows event B? Do they co-occur under certain preconditions, e.g. event A follows event B only on weekends? A correlation signature is associated with different properties, such as:
- Support: Number of co-occurred instances of a pair of events.
- Confidence: The ratio of number of co-occurred instances by total number of instances of the event pair.
- Direction: The sequence of the events in which the event pair is observed.
- Lead time: The average time window within which the event pair is observed together.
Spatio-temporal correlation brings another dimension to event correlation. Spatio-temporal correlations involve analyzing both spatial and temporal dimension together to assess whether any spatial relationship exists in addition to temporal. Do these events happen in close geographic proximity to each other? Do these events have common influencers?
We next discuss the challenges in applying event correlations in real-world systems and their possible practical workarounds.
Selecting the Right Scope
The first challenge is to select the right scope to mine correlations. Failing to do so leads to too many or too few correlations. Instead of mining correlations across all events, the relevance and usability of correlation signatures increases significantly by selecting the scope to mine correlations. A very effective lever is to use the topological information to narrow down this scope. The inter-component influences can be effectively captured in the form of graphs. Various graph traversal levers such as connected components, cliques, and spanning trees can then be used to derive the influencers. Mining correlations within these influencers point to different types of insights.
- Downtree traversal can help with the root-cause analysis by following the flow of dependencies through the system. For example, an application down issue can be diagnosed by traversing the influencer hierarchy and checking the health of underlying application server, database server, virtual machines, and other components.
- Uptree traversal can help with the impact analysis of a fault. For example, in the event of a disk failure, an uptree traversal can help assess the impact of this failure on the upstream servers and applications that use this disk.
- Connected components or cliques help point to common problems across homogeneous entities. For example, machines hosted in the same rack, or virtual machines hosted on the same physical machine can be captured using connected components.
Selecting the Right Time-window
Another important aspect is to set the right time-window to mine correlations. Time-window decides the acceptable time difference between two events to call them potentially correlated. Setting this value as too large ends up correlating unrelated events, whereas a small window leads to missing genuine correlations.
Instead of using fixed time windows for mining correlations, a better approach is to consider time windows that adapt based on the topology and the nature of events. The basic idea is to assess how long it takes for an event on one entity to cause another event on another entity. To understand this propagation time, we tap into the underlying logs of these entities and compute the lag time between these activities. For example, the impact of a “server down” event is almost instantaneous on the application performance. However, the impact of a batch job starting late takes a longer time to manifest in the form SLA violation of the batch process.
Sometimes, the lag time between 2 events demonstrates multi-modality. The same pair of events exhibit different lag times on different conditions. It is important to understand the factors that lead to these different lag times. Classification and regression algorithms provide ways to assess various attributes such as day of week, day of month, severity, priority, etc. that best explain such multi-modal behavior.
Signature Fatigue
A frequently faced challenge with event correlations is that a large number of signatures get generated, making it very difficult to consume and use them effectively.
Clustering provides an effective tool to address this challenge. A correlation signature consists of entity type, entity name, event name, timestamp of two or more events. We use these attributes to create clusters of signatures with similar properties of support, confidence, and lead time. Clusters can then help group similar types of correlation signatures. Clusters of these signatures can be created by event types, by entity types, and by entity names. Correlation confidence and support can also be used to create clusters of correlations of different strengths.
Low Confidence Correlations
Another common problem observed while correlations on real data is that many correlations do not demonstrate high confidence values. These signatures in their base form are not usable as they do not inspire confidence in users to derive any meaningful inferences from them. However, high confidence signatures are often hidden within these low-confidence signatures. They just need to be extracted by applying the right filters. Applying filters on attributes such as severity, priority, day of month, hour of day, day of week, location, etc. on several occasions leads to the discovery of useful correlation signatures.
Applying classification algorithms on various event attributes such as severity, priority, day of month, hour of day, day of week, location, etc. helps to find the set of pre-conditions that increase the correlation confidence.
Let’s consider a scenario with 7 databases and a single backup server. Each database performs backup to this server on a specific day of the week. If we analyze events throughout the entire week, we’ll likely observe weak correlations, with roughly 1 in 7 confidence, due to the varying backup schedules. However, if we narrow down our analysis to only include events occurring on specific days, the likelihood of finding a strong correlation increases significantly.
Interpreting Correlation Signatures
Correlation signatures can be used in different ways to understand and manage an enterprise IT system. Below are some real-world use cases:
Alert aggregation: A fault often generates many symptoms and each of these symptoms manifest in the form of events. Consider a scenario wherein a database accessed by multiple applications goes down, triggering “database not accessible” events form all associated applications. Separate events are created for each of these applications. The command center teams treat each event in isolation and end up putting in a lot of redundant efforts. Correlation signatures can help aggregate such related alerts and thus reduce alert fatigue for command center teams.
Not all correlation signatures are suited for alert aggregation. The correlated events should occur within a short time-window such that the incoming alerts can be grouped together to act on. The entities of the correlated events should be structurally related such that the correlations have semantic significance.
Alert prediction: Often major problems have early indicators. For example, high website traffic followed by high disk utilization on the database server is a strong early indicator of future disk full and database crash events. Early identification of future issues can help the command center teams mitigate or contain their impact. Correlations signatures can also be used to identify these issues.
Correlation signatures that have a strong sense of direction are best suited for alert prediction. Furthermore, the correlated events should occur within a relatively longer time-window, such that early signals are helpful in taking any preventive actions. Consider a correlation signature where a chain of upstream job failures leads to downstream SLA violations three hours later. In this case, the SRE has both a clear understanding of events that are about to come and sufficient time to perform corrective actions.
Problem signature mining: Correlations also provide a very useful lever to analyze recurring issues. Correlations can help derive detailed signatures of these issue manifest and also narrow down their root cause.
To use correlations for problem signature mining, look for correlations with high support indicating that the issue has occurred for a sufficient number of times to initiate the problem management process. Also, it is a good idea to use a larger time window to mine such correlations as it may take time for the fault to manifest across different levels of tech-stack.
A recurring issue may get triggered by more than one cause, and these causes may not manifest together. Consequently, when all recurring issues are analyzed, no strong correlation signature surfaces up. However, the same events, when analyzed in subsets of different preconditions demonstrate stronger correlation.
Closing Notes
Traditionally, IT operations have been reactive, issues create tickets that are then assigned to SMEs for resolution. Event correlation introduces a smarter, more proactive approach by analysing event data to better manage enterprise IT systems.
It enables early prediction and elimination of potential issues, reducing their occurrence. It also groups related alerts into logical clusters, cutting down noise and redundant notifications. Finally, by identifying likely root causes, event correlation speeds up resolution times and helps maintain a healthier, more resilient IT environment.