Skip to main content

Incident Management

An incident is an unplanned interruption to an IT service or a reduction in the quality of an IT service. Incident management restores normal service operation as quickly as possible while minimising adverse impact on business operations. This procedure covers the complete incident lifecycle from detection through closure, enabling consistent response regardless of which team member handles the incident.

The distinction between incidents and other work types determines how you respond. A user unable to access email is an incident requiring immediate restoration. A user requesting a new shared mailbox is a service request handled through Request Fulfilment. A pattern of repeated email outages requiring root cause elimination is a problem addressed through Problem Management. Security-related incidents such as malware infections or account compromises follow separate playbooks in the Incident Response section, though initial detection and logging use the same mechanisms described here.

Prerequisites

Before handling incidents, ensure you have the following access and information:

RequirementDetail
ITSM tool accessCreate, update, and close incident records; minimum of Incident Analyst role
Monitoring dashboardsView current alerts and service status; read access to primary monitoring platform
Knowledge baseSearch and retrieve troubleshooting articles; read access to internal knowledge base
Communication channelsPost status updates; access to incident notification channel and service status page
Escalation contactsCurrent on-call roster with phone numbers; access to escalation contact list
Configuration dataService-to-owner mappings and CI relationships; read access to CMDB or service catalogue

Verify your ITSM tool access by confirming you can create a test incident (mark it immediately as test/cancelled). Confirm monitoring access by viewing the current service health dashboard. If either access is missing, contact your service desk manager before proceeding with incident handling duties.

You should also have completed incident management training covering your organisation’s specific priority matrix, escalation thresholds, and major incident procedures. Training records are maintained in the HR system; confirm completion with your manager if uncertain.

Procedure

Detecting and logging incidents

Incidents reach the service desk through multiple channels: monitoring alerts, user reports via phone or email or portal, and direct observation by IT staff. Regardless of source, every incident requires a logged record within 15 minutes of detection.

  1. Receive the incident notification through any channel. For monitoring alerts, the alert itself provides initial details. For user reports, gather the symptoms, affected service, number of users impacted, and when the issue started. For phone calls, document during the conversation rather than relying on memory afterward.

  2. Search the ITSM tool for existing incidents matching the symptoms. Use the service name, error message, or affected system as search terms. If a matching open incident exists with the same root symptoms, link this report to the existing incident as an additional affected user rather than creating a duplicate.

  3. Create a new incident record if no match exists. Complete the required fields:

    • Title: Concise description starting with affected service, for example “Email - Outlook unable to connect to server” rather than “User cannot access email”
    • Description: Full symptom details including error messages (exact text), affected users (names or count), time issue started, and any recent changes the user is aware of
    • Contact: Reporting user’s name and preferred contact method
    • Service: Select from service catalogue; if uncertain, select the closest match and note uncertainty in the description
    • Category: Select initial category based on symptoms; this may change during diagnosis
  4. Record your incident number and provide it to the reporting user for reference. Set expectations for next contact, for example “I will update you within 2 hours or sooner if we resolve the issue.”

The logging step must complete even when resolution seems obvious. Unlogged incidents create gaps in service reporting, prevent pattern detection for problem management, and leave no audit trail for compliance purposes.

Categorising and prioritising

Categorisation groups incidents by the type of service or infrastructure affected, enabling routing to appropriate resolver groups and supporting trend analysis. Prioritisation determines the order in which incidents receive attention, based on the combination of business impact and urgency.

  1. Assign a category based on the affected service area. Your organisation’s category structure derives from the service catalogue. A three-level hierarchy provides sufficient granularity without excessive complexity:
Level 1: Service area (Email, Network, Applications, Hardware, Access)
Level 2: Specific service (Exchange Online, VPN, Finance System, Laptop, Active Directory)
Level 3: Component (Mailbox, Attachment, Connection, Login, Password Reset)

Select the most specific category that accurately describes the incident. Miscategorised incidents route to wrong teams and delay resolution.

  1. Assess impact by determining how many users or business functions are affected:

    Impact levelCriteria
    1 - ExtensiveEntire organisation affected; core business function unavailable; over 100 users impacted
    2 - SignificantDepartment or office affected; important business function degraded; 20-100 users impacted
    3 - ModerateTeam or workgroup affected; business function has workaround; 5-20 users impacted
    4 - MinorIndividual user affected; workaround available; 1-4 users impacted
  2. Assess urgency by determining how quickly the business needs resolution:

    Urgency levelCriteria
    1 - CriticalWork cannot continue; no workaround; deadline or compliance at risk
    2 - HighWork significantly impaired; workaround inadequate; impact worsening
    3 - MediumWork affected but continuing; acceptable workaround exists
    4 - LowMinimal immediate effect; can wait for scheduled resolution
  3. Derive priority from the intersection of impact and urgency using this matrix:

URGENCY
1-Critical 2-High 3-Medium 4-Low
+----------+----------+----------+----------+
IMPACT | | | | |
1-Extensive | P1 | P1 | P2 | P3 |
+----------+----------+----------+----------+
2-Significant| | | | |
| P1 | P2 | P2 | P3 |
+----------+----------+----------+----------+
3-Moderate | | | | |
| P2 | P2 | P3 | P4 |
+----------+----------+----------+----------+
4-Minor | | | | |
| P3 | P3 | P4 | P4 |
+----------+----------+----------+----------+
Priority definitions:
P1 - Critical: Resolve within 1 hour; continuous work until resolved
P2 - High: Resolve within 4 hours; takes precedence over normal work
P3 - Medium: Resolve within 8 business hours; normal queue position
P4 - Low: Resolve within 40 business hours; scheduled as capacity allows

Figure 1: Priority matrix determining response and resolution targets

The priority determines your response target (time to begin active work) and resolution target (time to restore service). Response targets are typically 15 minutes for P1, 30 minutes for P2, 2 hours for P3, and 8 hours for P4. Resolution targets appear in the matrix above. Your organisation’s specific targets are defined in SLA Management.

Initial diagnosis

Diagnosis identifies the cause of the incident sufficiently to enable resolution or to determine the correct escalation path. The goal is service restoration, not root cause analysis. Apply diagnostic effort proportionate to priority: a P1 incident should have someone actively diagnosing within 15 minutes, while a P4 incident can wait for scheduled diagnostic time.

  1. Review the incident details for clues: error messages, timing, affected scope, recent changes. Check whether the user has tried any troubleshooting steps and their results.

  2. Search the knowledge base for matching symptoms. Effective search terms include exact error message text, service name combined with symptom, and error codes. A matching article with verified solution moves directly to resolution.

  3. Check the monitoring dashboard for related alerts. An incident reported by one user may connect to infrastructure alerts showing broader issues. Cross-reference the incident time with alert timestamps.

  4. Query the CMDB or service documentation for the affected service’s dependencies. An email issue might stem from authentication services, network connectivity, or the mail platform itself. Understanding the service architecture guides diagnostic focus.

  5. Perform basic diagnostic steps appropriate to the category. For access issues, verify the account status in the identity provider. For connectivity issues, test network path from a known-good location. For application issues, confirm the service is running and accessible from the server side.

  6. Document findings in the incident record after each diagnostic step. Record what you checked, what you found, and what this rules in or out. This documentation prevents duplicate effort if the incident escalates or spans shift changes.

Diagnosis for P1 and P2 incidents should identify a resolution path or an escalation target within 15-30 minutes. If diagnosis stalls, escalate rather than continuing to investigate while the business impact grows.

Escalation

Escalation transfers an incident to individuals or groups with greater expertise, authority, or access. Functional escalation moves the incident to a specialist team when first-line skills are insufficient. Hierarchical escalation notifies management when resolution targets are at risk or business impact requires executive awareness.

+------------------------------------------------------------------+
| ESCALATION DECISION FLOW |
+------------------------------------------------------------------+
|
+---------v---------+
| Resolution |
| within capability?|
+---------+---------+
|
+---------------+---------------+
| |
| Yes | No
v v
+---------+---------+ +---------+---------+
| Continue | | Identify correct |
| resolution | | resolver group |
+---------+---------+ +---------+---------+
|
+---------v---------+
| Functional |
| escalation |
+---------+---------+
|
+------------------------------------------------------------------+
|
+---------v---------+
| Resolution target |
| at risk? |
+---------+---------+
|
+---------------+---------------+
| |
| No | Yes
v v
+---------+---------+ +---------+---------+
| Continue | | Hierarchical |
| resolution | | escalation |
+-------------------+ +---------+---------+
|
+---------v---------+
| Notify management |
| per escalation |
| matrix |
+-------------------+

Figure 2: Escalation decision flow for functional and hierarchical paths

  1. Determine whether functional escalation is needed. Indicators include: the issue requires access you do not have, diagnosis points to a system outside your expertise, or the technical complexity exceeds first-line capability. Identify the appropriate resolver group from the service-to-owner mapping in the CMDB.

  2. Execute functional escalation by reassigning the incident to the target group. Update the incident with a clear summary of diagnosis completed, findings, and what you believe the escalation target needs to investigate. Do not simply reassign without context; the receiving team should understand why they are receiving the incident and what has already been tried.

  3. Determine whether hierarchical escalation is needed. Triggers include: P1 incidents at any point, P2 incidents not resolved within 2 hours, any incident where the resolution target will be breached, incidents affecting executives or external stakeholders, and incidents with potential financial or reputational impact exceeding thresholds (for example, affecting donor systems during a campaign).

  4. Execute hierarchical escalation by notifying the appropriate management level. For P1 incidents, notify the IT manager or director immediately by phone, not just email. Provide: incident number, affected service, business impact, current status, and estimated time to resolution. Update the incident record to show that escalation occurred.

  5. Continue working the incident after escalation. Escalation transfers awareness and authority, not responsibility for resolution. You remain assigned until a resolver group accepts the incident or management explicitly reassigns it.

Resolution and recovery

Resolution restores normal service operation. A workaround that restores service is a valid resolution even if the underlying cause remains; permanent fixes are the domain of problem management.

  1. Implement the identified fix or workaround. Follow the resolution steps from the knowledge base article if one exists, or apply the solution identified during diagnosis. For changes requiring formal approval, verify that emergency change procedures apply if this is a P1/P2 incident, or wait for approval if the incident priority permits.

  2. Verify that the fix resolves the symptom. Test from the user’s perspective where possible: can they now access the service, complete the transaction, or perform the work that was blocked? For widespread incidents, test with multiple affected users before declaring resolution.

  3. Document the resolution in the incident record. Record what was done, not just “fixed” or “resolved.” Specific documentation enables knowledge base updates and helps if the issue recurs. Example: “Cleared corrupted profile by renaming C:\Users\jsmith\AppData\Local\Microsoft\Outlook folder and restarting Outlook, which regenerated the profile.”

  4. Communicate resolution to affected users. For individually reported incidents, contact the reporter directly. For widespread incidents, update the service status page and notify the affected user group. Include what was wrong, what was done, and any actions users need to take.

Closure

Closure confirms that service is restored and the user accepts the resolution. Incidents should not remain open indefinitely awaiting user confirmation.

  1. Contact the reporting user to confirm resolution. Ask explicitly whether the issue is resolved from their perspective. If they report continuing issues, reopen diagnosis rather than closing and creating a new incident.

  2. If the user confirms resolution, close the incident. Select the appropriate closure code (resolved, workaround applied, no fault found, user error, duplicate, cancelled). Enter closure notes summarising the root symptom and the resolution.

  3. If the user does not respond within 48 hours after resolution notification, send a final confirmation request stating the incident will close automatically if no response is received within 24 hours. After that period, close with a closure code of “resolved - no response” and document the outreach attempts.

  4. Review whether the incident should trigger problem management. Recurring incidents (same symptom from same or different users), incidents with only workarounds (no permanent fix), and major incidents always warrant problem records. Create a problem record or add to an existing one as appropriate.

  5. Identify knowledge gaps. If no knowledge base article existed for this symptom, or the existing article was incomplete, create or update the article following Knowledge Management procedures.

Major incident handling

A major incident causes significant business disruption and requires coordinated response beyond normal incident management. Criteria for declaring a major incident include: P1 priority, multiple services affected, public-facing service unavailable, incident duration exceeding 30 minutes with no resolution in sight, or explicit declaration by IT management or business leadership.

+------------------------------------------------------------------+
| MAJOR INCIDENT STRUCTURE |
+------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | Incident Manager | | Communications | |
| | (coordination) | | Lead | |
| +--------+---------+ +--------+---------+ |
| | | |
| +------------+-----------+ |
| | |
| +---------v----------+ |
| | Technical Lead | |
| | (diagnosis/fix) | |
| +---------+----------+ |
| | |
| +------------------+------------------+ |
| | | | |
| v v v |
| +---+---+ +---+---+ +---+---+ |
| | SME 1 | | SME 2 | | SME 3 | |
| +-------+ +-------+ +-------+ |
| |
+------------------------------------------------------------------+
| BRIDGE CALL |
| All participants remain connected until incident resolved |
| Incident manager controls speaking order and actions |
+------------------------------------------------------------------+

Figure 3: Major incident organisational structure showing roles and communication

  1. Declare the major incident formally. The on-call manager, IT director, or designated authority makes the declaration. Update the incident record to major incident status, which triggers notifications and changes escalation rules.

  2. Establish the incident bridge. Start a conference call or virtual meeting and circulate the join details to all required participants: incident manager, technical lead, subject matter experts for affected services, and communications lead. The bridge remains open until resolution.

  3. Assign roles explicitly. The incident manager coordinates activity and makes decisions; they do not perform technical work. The technical lead directs diagnostic and resolution efforts. The communications lead handles stakeholder updates. Subject matter experts perform technical investigation and fixes.

  4. Implement communications cadence. Send initial notification within 15 minutes of major incident declaration to: executive leadership, affected department heads, and the all-staff incident notification channel. Subsequent updates occur every 30 minutes or immediately on significant status change. Update the service status page with user-facing impact description.

  5. Focus on service restoration. The major incident bridge is not for root cause analysis, blame, or unrelated issues. The incident manager actively controls the discussion to maintain focus. Parallel investigation threads report findings to the technical lead, who coordinates the resolution path.

  6. Document actions in real time. A designated scribe (often the incident manager or a delegate) records all significant actions, findings, and decisions with timestamps. This log supports the post-incident review.

  7. Declare resolution when service is restored and verified. The incident manager makes the formal declaration. Send resolution notification to all stakeholders who received the incident notification. The bridge closes after resolution confirmation and handover instructions are clear.

  8. Schedule the post-incident review within 48 hours. Major incidents always require formal review. Create the review meeting, assign the facilitator, and ensure the incident log and timeline are preserved.

Incident record template

Use this structure when creating incident records. Fields marked with asterisks are mandatory at creation; others are completed during the lifecycle.

INCIDENT RECORD
===============
Record Information
------------------
Incident ID: [Auto-generated]*
Created: [Date/time]*
Created by: [Analyst name]*
Last updated: [Date/time]
Status: [New | In Progress | Pending | Resolved | Closed]*
Classification
--------------
Category: [Level 1 > Level 2 > Level 3]*
Service: [From service catalogue]*
Priority: [P1 | P2 | P3 | P4]*
Impact: [1 | 2 | 3 | 4]*
Urgency: [1 | 2 | 3 | 4]*
Major incident: [Yes | No]
Contact Information
-------------------
Reported by: [User name]*
Contact method: [Phone | Email | Portal | Walk-up]*
Contact details: [Phone/email for updates]*
VIP: [Yes | No]
Incident Details
----------------
Title: [Service - Brief symptom description]*
Description: [Full symptom details, error messages, timeline]*
Affected users: [Count or list]
Affected location: [Office, region, or remote]
Related alerts: [Monitoring alert IDs if applicable]
Related incidents: [Linked incident IDs if applicable]
Related changes: [Recent change IDs if applicable]
Assignment
----------
Assigned group: [Resolver group name]*
Assigned to: [Individual analyst if applicable]
Escalated to: [Higher tier group if escalated]
Escalated time: [Date/time of escalation]
Timeline
--------
Response target: [Date/time based on priority]
Resolution target: [Date/time based on priority]
Responded: [Date/time first action taken]
Resolved: [Date/time service restored]
Closed: [Date/time after user confirmation]
Work Log
--------
[Date/time] [Analyst] - [Action taken and findings]
[Date/time] [Analyst] - [Action taken and findings]
...
Resolution
----------
Resolution code: [Resolved | Workaround | No fault found | User error | Duplicate | Cancelled]
Resolution notes: [What was done to restore service]
Root cause: [If known; otherwise "Pending problem investigation"]
Knowledge article: [Link to related KB article or "Created: KB-XXXXX"]
Closure
-------
Closure code: [Confirmed by user | No response | Duplicate | Cancelled]
User satisfaction: [Survey result if collected]
Problem record: [Link if problem created]

Verification

After resolving and closing incidents, verify that the incident management process is functioning correctly through these checks.

The incident record is complete when all mandatory fields contain accurate data, the work log shows the diagnostic and resolution steps, and resolution notes describe specifically what was done rather than just “fixed” or “resolved.”

Response and resolution targets were met when the timestamps show first action within the response target and service restoration within the resolution target. For incidents that breached targets, verify that escalation occurred and breach reasons are documented.

The user was contacted when the incident record shows communication to the reporter at resolution and closure, or shows the documented outreach attempts for no-response closures.

Knowledge was captured when recurring or novel incidents resulted in new or updated knowledge base articles, and major incidents have post-incident review meetings scheduled.

Run this verification query monthly against closed incidents:

SELECT
incident_id,
priority,
CASE WHEN responded <= response_target THEN 'Met' ELSE 'Breached' END as response_sla,
CASE WHEN resolved <= resolution_target THEN 'Met' ELSE 'Breached' END as resolution_sla,
CASE WHEN resolution_notes IS NOT NULL AND LENGTH(resolution_notes) > 20 THEN 'Yes' ELSE 'No' END as documented,
CASE WHEN closure_code IS NOT NULL THEN 'Yes' ELSE 'Open' END as closed
FROM incidents
WHERE created >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
AND status = 'Closed'
ORDER BY priority, created;

Expected output shows over 90% of incidents meeting response SLA, over 85% meeting resolution SLA, and 100% with documented resolution notes.

Troubleshooting

SymptomCauseResolution
User reports issue but no matching service exists in catalogueService catalogue incomplete or user describes service incorrectlySelect closest matching service, note the discrepancy in description, and flag for service catalogue update
Cannot determine priority because user exaggerates impactUser conflates personal inconvenience with business criticalityAsk objective questions: “How many people are affected?” and “What business process is blocked?” Override stated urgency with assessed urgency
Incident assigned to wrong resolver groupMiscategorisation during logging or incorrect service-to-group mappingReassign immediately with apology note to receiving group; update the mapping if it was incorrect
Resolver group rejects escalation claiming insufficient informationEscalation documentation inadequateDocument diagnosis steps already taken, specific findings, and clear explanation of why escalation is needed; re-escalate with improved context
User insists incident is P1 but impact assessment shows P3Priority matrix based on objective criteria, not user perceptionExplain the priority criteria and what would change the priority; do not override the matrix based on user insistence unless business leadership intervenes
Incident keeps reopening after resolutionRoot cause not addressed or workaround inadequateEscalate to problem management rather than repeatedly resolving symptoms; create problem record linking all related incidents
Multiple incidents logged for same issueUsers unaware of existing incident or logging duplicate by mistakeMerge duplicates into single incident, notify all affected users with the master incident number, monitor for additional duplicates
Resolution target will be breachedDiagnosis stalled, waiting for vendor response, or insufficient resourcesInitiate hierarchical escalation before breach, not after; request additional resources or management intervention
Incident requires change that needs CAB approval but cannot waitConflict between change control and incident urgencyInvoke emergency change procedure for P1/P2 incidents; IT manager or designated authority can approve emergency changes outside normal CAB
User does not respond to resolution confirmationUser busy, on leave, or email went to spamSend follow-up with explicit deadline; if still no response, close with “no response” code after documented outreach (typically 72 hours total)
Major incident bridge becomes chaotic with cross-talkToo many participants or unclear rolesIncident manager asserts control: mute all, assign speaking order, eject non-essential participants; limit bridge to incident manager, tech lead, comms lead, and active SMEs
Post-major incident, users report service still degradedResolution verified prematurely or secondary issue existsReopen the major incident if same root cause; create new incident if different cause; do not force new reports into closed major incident
Incident data quality poor across teamInconsistent logging practices or training gapsReview sample incidents with team, identify specific quality issues, retrain on logging standards, implement quality checks in weekly review
Historical incidents cannot be found when searchingInconsistent titles, poor categorisation, or weak search terms in ITSM toolImprove title conventions (start with service name), standardise category usage, add searchable keywords to description field

See also