The Oxford Dictionary defines an “incident” as “an instance of something happening; an event or occurrence”. But in enterprise IT, “incident” does not carry such a benign meaning. An incident is often unexpected and unwelcome, and normally sends a flutter amongst the operations team manning the floors.
IT infrastructure incidents are primarily of three types:
- Capacity-related incidents occur when available capacity is breached. Examples include CPU utilization, Memory utilization, Disk space utilization, and Network bandwidth utilization.
- Workload- or Performance-related incidents occur when response times of a service are unacceptable, usually when there is an increase in transaction volumes. Examples include workload jobs taking too long to complete, extraordinarily high month-end processing requirements, and credit card transactions timing out.
- State-related incidents occur when an entity gets into an undesirable state. Examples include a server being unreachable, a service being down, a network port being down, or an application not being available.
Incidents have some interesting characteristics:
- They have a mob mentality. Incidents rarely occur individually. An event storm is a fairly common phenomenon in IT operations.
- Incidents are interrelated. Capacity constraints in processing power can lead to poor response times, high workloads can cause capacity problems, or unavailable services can cause high workloads on other resources and thus degrade service times.
- A vast majority of incidents are predictable. This is especially true for capacity and performance incidents. Capacity fills up over time and you can predict up front when the thresholds will be breached. Workloads tend to show some temporal behavior patterns in terms of peak hours, peak days, or peak seasons.
- A vast majority of predictable incidents can be avoided. For example, if you anticipate a capacity breach in the near future, you can avoid it by augmenting capacity beforehand.
- Incidents rarely occur just once; they have a tendency to repeat. Incidents that recur frequently usually have a root underlying problem; resolving this problem will eliminate the incidents once and for all.
Intelligent Incident Management involves understanding the characteristic of each incident and handling it appropriately. Smart Incident processing should, at the very least, include the following components:
- Incident Filtering: Filtering out spurious or unimportant incidents
- Incident Handling: Triaging and fixing of incidents that pass through filters
- Incident Pre-emption: Foreseeing incidents before they happen and pre-empting them
- Incident Prevention: Understanding frequently occurring incidents and resolving them at the core
Do you see this happening in your enterprise? What strategies do you apply for incident handling? Please reach out to us with your views and comments, and feel free to read our paper on Intelligent Incident Management.
by Harish Iyer, Head of Product Engineering