What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

What we solve Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions. Platform Overview Products ignio AIOps Redefining IT operations with AI and automation ignio Observe Cloud Visibility and Cost Optimization Business Health Monitoring IT Event Management ignio AI.Workload Management Enabling predictable, Agile and Silent batch operations in a closed-loop solution Business SLA Prediction ignio AI.ERPOps End-to-end automation for incidents and service requests in SAP IDoc Management for SAP ignio AI.Digital Workspace Autonomously detect, triage and remediate endpoint issues ignio Cognitive Procurement AI-based analytics to improve Procure-to-Pay effectiveness ignio AI.Assurance Transform software testing and speed up software release cycles

Who we are At Digitate, we’re committed to helping enterprise companies, realize autonomous operations. Integration Channel Partner Technology Partner Azure Marketplace Resources Leadership We’re committed to helping enterprise companies realize autonomous operations Newsroom Explore the latest news and information about Digitate Partners Grow your business with our Elevate Partner program Academy Evolve your skills and get certified Contact Us Get in touch or request a demo

Baselining Normal Behavior of Enterprise IT Systems With Data-Driven Thresholds

What we solve

Digitate’s empowers organizations to transform their operations with intelligence, insights, and actions.

ignio Products

AIOps

Redefining IT operations with AI and automation

Workload Management

Enabling predictable, agile and silent batch operations in a closed-loop solution

ERPOps

End-to-end automation for incidents and service requests in SAP

Digital Workspace

Autonomously detect, triage and remediate endpoint issues

Cognitive Procurement

AI-based analytics to improve Procure-to-Pay effectiveness

Assurance

Transform software testing and speed up software release cycles

What we do

Digitate helps enterprises improve the resilience and agility of their IT and business operations with our SaaS–based platform.

Platform

ignio™ Platform

ignio™, Digitate’s SaaS-based platform for autonomous operations, combines observability and AIOps capabilities to solve operational challenges

AI Agents

ignio’s AI agents, with their ability to perceive, reason, act, and learn deliver measurable business value and transform IT operations.

Looking for something?

Discover how we empower customer success and explore our latest eBooks, white papers, blogs, and more.

Blogs

Podcasts

Customers Success

Omdia Research Report

Resources

Analyst Reports

Discover what top industry analysts have to say about Digitate

ROI

Get insights from the Forrester Total Economic Impact™ study on Digitate ignio

Webinars & Events

Explore our upcoming and recorded webinars & events

Infographics

Discover the capabilities of ignio™’s AI solutions

Blogs

Explore insights on intelligent automation from Digitate experts

Trust Center

Digitate policies on security, privacy, and licensing

e-Books

Digitate ignio™ eBooks provide insights into intelligent automation

Podcasts

Explore our upcoming and recorded podcast

Case Studies

Learn how businesses overcame key AI-driven automation issues

Reference Guides

Guides cover AIOps and SAP automation examples, use cases, criteria

White Papers and POV

A library of in-depth insights and actionable strategies

Imagine setting a morning alarm. If you always wake up at 7 a.m., a fixed alarm works perfectly. But life isn’t that simple! On weekends, you might wake up at 9 a.m., and after late nights, you may sleep in until 11 a.m. A rigid alarm fails to adapt. This challenge mirrors the problems enterprises face in monitoring their complex IT estates. Just like a fixed alarm, traditional monitoring systems in organizations often struggle to keep up with the dynamic nature of modern IT infrastructure that either creates false alerts or may skip critical issues. The complexity of IT estates, with their constantly changing dependencies and temporal variations, requires smarter and more flexible monitoring solutions. Perhaps, ones that can adapt to fluctuations in system behaviour.

To detect any anomaly in an enterprise IT system, it is important to first define the normal behaviour of an entity. This normal behaviour is then used to define the baseline thresholds which in turn are referred by monitoring tools to generate alerts to report any anomalous behaviour. However, given the dynamic nature of various enterprise IT components, defining this threshold presents several challenges. Failing to address these challenges can either risk in generating too many false alerts or missing out on genuine anomalies. In this blog, we discuss our experiences on defining these normal behaviour thresholds. We present how the process of deriving these thresholds evolves with growing complexity.

Blanket Thresholds based on Subject Matter Expert’s (SME) Tacit Knowledge

A common approach in enterprise monitoring is to set fixed blanket thresholds based on experts’ domain knowledge. For instance, set a fixed threshold of 90% on the CPU utilization of all virtual machines, and generate an alert when the CPU crosses 90%. Subject Matter Experts (SMEs) define these limits using operational experience, aiming to balance alerting accuracy and system stability.

However, such “one-size-fits-all” thresholds often fail to account for real-world variability, assuming uniform behaviour across dynamic and evolving systems overlooking the inherent differences in system behaviour.

Example:

Following figure shows an example of response time of 4 APIs. Here the SME has used tacit knowledge to set a blanket threshold of 20ms for API response times for all the APIs. As the figure shows, different APIs have different value ranges and setting a blanket threshold of 20ms will generate too many alerts in some APIs, and will miss some genuine anomalies in some APIs.

Statistical Fixed-value Thresholds

Instead of intuition-driven thresholds, a more promising approach is to adopt data-driven approaches to derive thresholds. The past history of a metric can be analysed to define the normal ranges. These normal ranges can then be used to define the thresholds for generating alerts.

One of the simplest statistical approaches is to define a threshold using standard statistical functions such as mean and standard deviation. For instance, higher threshold is set at mean + 3*standard deviation, and lower threshold is set at mean — 3*standard deviation.

However, several factors need to be considered before applying such statistical measures to define thresholds on real-world data.

Most enterprise IT components observe changes from time to time. These could be business changes or technology changes. Hence, it is important to first identify the steady state of the metric. Various change detection algorithms can be used to detect significant, persistent changes.
Once the most recent steady state is identified, then the data of this steady state needs to be analysed to remove noise. Various outlier detection algorithms can be used to detect outliers and remove them before baselining.
The data also needs to be analysed to look for trends using trend detection algorithms. If an increasing or decreasing trend is present, then it needs to be considered while deriving the baseline thresholds.

This processed time-series can then be used to define the normal behaviour thresholds. Different statistical functions can be used to derive these thresholds. Most common approach is to compute threshold as a function of mean and standard deviation. Other commonly adopted approaches include median and MAD, or quantiles.

Example:

Following figure shows the response time of 4 APIs. Here, instead of a fixed threshold of 20ms, a different threshold is computed for each API response time based on its normal behavior range.

Dynamic Thresholds

Various enterprise IT components demonstrate temporal patterns. For example, an application faces high workload on weekdays and low on weekends, or certain virtual machines demonstrate high CPU utilization in office hours than non-office hours, etc. Static thresholds do not take these patterns into account. Consequently, the thresholds are not an accurate representation of the normal behaviour leading to false alerts or missed alerts. Such situations demand a dynamic threshold, where different thresholds are set for different time-periods.

Multi-modal analysis of a metric can help identify if the metric needs a static or a dynamic threshold. In contrast to unimodal distributions, a multimodal distribution features multiple peaks, highlighting varying temporal behaviours, such as differing workloads on weekdays and weekends or peak-hour traffic. Spectral analysis can be done to identify prominent modes. Once these modes are identified, then classification techniques can be used to identify the temporal dimensions that best explain each mode. The temporal dimensions include week of month, day of week, day of month, hour of day, day of week, etc. The modes are often explained using a combination of these temporal dimensions such as day of week of month, or hour of day of week.

Example:

Following figure shows that the response time of API 3 demonstrates temporal patterns where the response if high on weekdays and low on weekends. Here, instead of a static threshold, a dynamic threshold is recommended showing higher thresholds on weekdays and lower thresholds on weekends.

Composite Thresholds

The thresholds discussed so far look at one metric in isolation. However, many real-world anomalies manifest when more than one metric start showing abnormal behaviour. For instance, consider an application that starts observing failures when heap, threadpool, connectionpool; they all start observing high utilization levels. Alerting of such scenarios can be made lot more effective by considering more than one metrics to generate an alert.

Classification and regression techniques provide effective solutions to analyse multiple metrics and identify the behaviour ranges of these metrics, which when collectively crossed, lead to an alert. These insights can be used to define a composite threshold that defines thresholds of multiple metrics. An alert is generated when all these metrics cross their normal behaviour thresholds.

Example:

Following figures shows an example of the response time of 3 APIs, and how an alert gets generated only then all 3 APIs demonstrate a high response time.

Self-tuning and Self-learning Thresholds

Enterprise IT systems undergo change on a continuous basis. These include business changes such as onboarding of new applications or change in business workload. These also include technology changes such as version upgrades or topology changes. It is important to accommodate these changes while computing thresholds to ensure that the thresholds are accurately representing the normal behaviour ranges.

Incorporating expert feedback also helps in making thresholds realistic and more applicable for real-world scenarios. User feedback helps to identify special business scenarios such as sales campaigns, promotional events, or housekeeping activities. It also helps to fine-tune the aggressiveness of the thresholds. For instance, the business-critical applications need thresholds where even minor deviations from normal behaviour requires alerting. On the other hand, machines in development or testing environments or housekeeping applications might prefer more relaxed thresholds for alerting. Such feedback from an SME can help further finetune the thresholds.

Conclusion

Normal behaviour characterisation is more than mere technical configurations; it is the backbone of effective anomaly detection and operational health management. It pins the problem of missing genuine anomalies or generating false alerts and helps generate just the right alerts at the right time. Consequently, it helps in reducing downtime, improving operational efficiency, and optimizing resource utilization. As organizations embrace smarter, adaptive threshold strategies, organisations can unlock the potential for proactive issue detection, enhanced resilience, and streamlined, noise-free operations. The result? A more agile, efficient business that stays ahead of challenges, drives productivity, and maintains seamless continuity.