Incident and Problem Management are two very different issues in IT service management that are unfortunately often used interchangeably. On the surface, it might just seem like a matter of terminology. But, what if you get to know that one is a small hiccup and the other could dent your entire quarterly or annual results?
It is crucial to understand the difference because the same incidents could continuously keep on reappearing without having a permanent fix. When companies do not comprehend this difference, they are not able to utilize the processes distinctively because they do not know how and they might not have the right tools for it. This could not only result in using up a chunk of the support’s time but also affect continual service improvement (CSI) adversely.
Recognizing the key differences between Incident and Problem Management can give you clarity as a business operator or technology vendor, that will help you focus on the right solution architecture and efficient resource allocation.
To understand the difference, the first understanding of each process is necessary.
What is Incident Management?
There are several definitions and analogies to explain Incident Management. But the simplest way is to first focus one the very term – Incident.
ITIL defines an incident as “an unplanned interruption to an IT service or a reduction in the quality of an IT service. An event that is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and customer productivity.”
In colloquial terms – they are honest mistakes that cause hindrances to business as usual.
In 2012, there was an incident at the Royal Bank of Scotland where 6.5 million customers faced a severe outage. The bank had to go through a PR disaster and pay GBP 56 million in fines and damages.
Another infamous incident took place when the Air traffic control at the LA Airport lost voice communication with over 400 airplanes in the southwestern United States on September 14, 2004. It could’ve led to clashes, causing severe damage to life and property.
Incident Management best practices include:
- Focus on restoring the affected process to its normal functioning in the quickest possible time.
- Minimizing the impact on business operations.
- Ensuring that the caused issue is within the parameters of the Service-Level Agreements.
- Using the least possible resources to control the costs.
The larger an organization, the greater possibilities it has to run into Incidents. This does not mean you have to focus a boatload of resources in resolving them. Taking a structured approach to Incident Management yields efficient but effective results.
The Incident Management Workflow
- Logging: Incidents coming in from the users are logged in this stage. Name, Contact Details, Employee ID, Asset Tags, Critical User Status, and Location are collected.
- Assignment: The raised ticket carries severity and category information. Based on that, it is assigned to technical support staff.
- Categorization: Each Incident should carry a clear category and sub-category. Based on this, it can automatically be sorted for prioritization.
- Prioritization: The prioritization is assigned based on projected user, business impact, and SLA adherence.
- Tracking: The ticket should carry precise details of the User’s Name, Contact, and Incident Description so that it can be tracked against SLAs.
- Closure: This is the stage where the rubber meets the tarmac. Once the Incident has been resolved and the affected process is brought to restoration, the incident can be closed. Once the incident is closed, an Incident Report is prepared and added to the system. If similar incidents come up, referencing this report can help in a quick resolution.
When Does an Incident Become a Problem?
You may argue, how can you determine how an Incident goes on to become a Problem without defining what constitutes a Problem first. For the simple fact – a Problem starts with an Incident.
When the same or similar type of issue keeps on occurring even after resolution, it could be pointing towards a larger matter at hand. If different users from different geographic segments at different time intervals across similar conditions have reported similar issues, it could be due to a systemic complication. If you are facing such difficulties, you can conclude with reasonable confidence that you are dealing with a Problem and not an Incident.
What is Problem Management?
A standalone Incident is a one-off issue, by definition. If it repeats itself across timelines or customer segments, you have a Problem at hand. And the marked Incident is a symptom while the undiscovered Problem is the cause.
A problem is defined by ITIL as “a condition often identified as a result of multiple Incidents that exhibit common symptoms. Problems can also be identified from a single significant Incident, indicative of a single error, for which the cause is unknown, but for which the impact is significant.”
When the Royal Bank of Scotland was asked to pay off penalties, they took major steps in fixing the issue as a hygiene practice. RBS discovered that there was a systemic issue of a batch scheduling glitch which was the underlying cause for the incident.
Similarly, on investigating the LA Airport air traffic control incident where there was a loss of vocal communication with the planes and a breakdown of the backup system, they discovered that the problem lied in the internal ticking timer, which couldn’t time itself after reaching zero.
The common characteristics of all effective Problem Management systems include:
- A primary focus on Root Cause Analysis.
- Analyzing the systemic impact.
- Designing a solution that completely eradicates the cause or minimizes possible future damages.
Problem Management may seem like inductive reasoning – we are establishing causality. Since it calls for deep analysis of a system and verification of potential causes, solving these issues is more complex. Yet, with a logical framework and the right tools, you can establish a system that catalyzes the resolution process.
The Problem Management Workflow
- Detection: The entire framework is dependent on this step. Rules should be in place to detect patterns of recurrence of issues across a timeline or segments of customers. This helps in the detection of a Problem. Once a problem is detected, it should be logged.
- Categorization: The Problem should be categorized according to the incidents underneath its umbrella. This would help in understanding which categories of incidents are causing most systemic failures.
- Prioritization: Establishing the causality may require cleansing the entire codebase or inspecting the entire IT infrastructure. These can be resource consuming affairs. Hence prioritizing can help analyze the business impact by verifying the number of users impacted, processes halted, and value affected.
- Analysis: As soon as an urgency has been established, the seemingly creative process of uncovering a root cause begins. The focus should be on diagnosing the problem, breaking it into fragments, and investigating the root causes. Once the root cause has been identified, it can be fixed with a turnaround maneuver. Or, certain functionalities might have to be stopped while the solution is integrated into the system.
- Resolution & Closure: Once the Problem is resolved, it should be recorded. The Problem Resolution process should be accurately communicated. This helps the team in the future to quickly resolve the same Problem in case it reoccurs.
Incident vs. Problem Management: How can businesses use the difference to their advantages?
At the most basic level, Incidents are all about the ‘What’ and Problems are all about the ‘Why’. Incidents are meant to have quick resolutions, whereas, Problems need deep analysis. Beyond these bifurcations, understanding the differences has a tangible business impact:
Incidents need to be resolved as quickly as possible to minimize the impact on business processes. So it would be obvious that resources would be allocated on priority to maintain user experience. Once the incidents are resolved, the focus should be on identifying the root cause by detecting the underlying Problem.
However, most small to medium organizations have one team managing both Incidents and Problems. So what happens is that Problems keep losing priority to the more pressing Incidents. This could cause a deep and irreversible impact in the long run. Understanding the difference between an Incident and a Problem can help deploy the resources to the latter as soon as it comes under the radar in parallel with managing other Incidents.
Software teams like to promise great user experience round the clock. But systemic problems cannot be resolved overnight. If a team is facing a Problem and treating it like an Incident, they are bound to end up overpromising and under-delivering – the exact opposite of what they should be doing.
The inverse also has its repercussions. If there is an Incident which is being treated as a Problem, it will result in over-complication of the issue and overspending of resources.
A quick fix or workaround can solve an Incident within hours of it being raised. Problems, on the other hand, call for more sophisticated solutions. If Incidents are observed as taking a path of becoming a Problem, the IT team should prepare for a systemic update to eradicate or fix the underlying cause. This can have a big impact on the Service Level Agreements that are in place.
Individually, both processes are quite different. But their existence is co-dependent. In these unprecedented times more than ever before, if you think your IT support team can manage issues with only a simple Incident Management system, think again.
For your business to truly succeed while the world battles a global pandemic, it needs to provide strong support to the remote workforce. Motadata ServiceOps is an ITSM module that comes with an in-built Incident and Problem Management module along with some other great features like Change/Release Management, Knowledge Management, and Asset Management. With ServiceOps, your business can identify, diagnose, and manage the complete lifecycle of Incidents and related Problems in just a single pane of glass.