Problem Management
Problem management identifies and eliminates the root causes of incidents. Where incident management restores service as quickly as possible, problem management investigates why the incident occurred and implements permanent fixes to prevent recurrence. You perform problem management both reactively, triggered by incidents that have already occurred, and proactively, identifying potential failures before they cause service disruption.
A problem is the underlying cause of one or more incidents. Problems remain open until you identify the root cause, document a workaround or permanent resolution, and verify that the fix prevents recurrence. A known error is a problem with a documented root cause and either a workaround or a resolution path. The distinction matters: problems represent investigation work in progress, while known errors represent understood conditions awaiting resolution.
Prerequisites
Before initiating problem investigation, confirm you have the following access and resources available.
- Incident data access
- Read access to incident records in the service management system, including full incident history, related configuration items, and resolution notes. You need visibility into incident patterns across at least 90 days to identify meaningful trends.
- System access
- Appropriate access to investigate affected systems. For infrastructure problems, this includes read access to logs, monitoring data, and configuration. For application problems, this includes access to application logs, error tracking systems, and deployment history.
- Analysis tools
- Access to log aggregation and search tools, monitoring dashboards, and diagramming software for root cause visualisation. If your organisation uses a dedicated problem management tool, you need rights to create and update problem records.
- Time allocation
- Problem investigation requires uninterrupted analysis time. A typical problem investigation takes 2 to 8 hours of focused work. Major problems affecting critical services require 16 to 40 hours across multiple analysts. Secure this time before beginning investigation.
- Stakeholder availability
- Identify subject matter experts for affected systems. Root cause analysis requires input from people who understand normal system behaviour, recent changes, and historical issues. Confirm their availability before scheduling investigation sessions.
Verify your incident data is current and complete:
-- Check incident data completeness for problem analysisSELECT category, COUNT(*) as incident_count, COUNT(CASE WHEN root_cause IS NULL THEN 1 END) as missing_root_cause, COUNT(CASE WHEN resolution_notes IS NULL THEN 1 END) as missing_resolution, AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/3600) as avg_resolution_hoursFROM incidentsWHERE created_at > NOW() - INTERVAL '90 days' AND status = 'resolved'GROUP BY categoryORDER BY incident_count DESC;This query reveals gaps in incident documentation that will impede problem analysis. Categories with high missing_root_cause percentages indicate areas where incident management is not capturing sufficient diagnostic information.
Procedure
Problem management follows two distinct paths: reactive investigation triggered by incidents, and proactive identification through trend analysis. Both paths converge at root cause analysis and resolution.
Identifying problems reactively
Reactive problem identification begins when incidents indicate an underlying issue worth investigating. Not every incident warrants a problem record. Create a problem record when any of these conditions apply: a major incident (Priority 1 or 2) has occurred, three or more incidents share the same root cause within 30 days, or an incident reveals a systemic vulnerability regardless of impact.
Review the incident record and all related incidents. Extract the symptoms reported by users, the technical indicators observed by support staff, and the resolution applied. Note whether the resolution was a workaround (service restored but underlying issue remains) or a permanent fix.
Query for related incidents using the affected configuration item, error codes, and symptoms:
-- Find related incidents for problem correlation SELECT incident_id, created_at, summary, affected_ci, category, resolution_code, resolution_notes FROM incidents WHERE ( affected_ci = 'CI-00847' -- Same configuration item OR summary ILIKE '%connection timeout%' -- Similar symptoms OR error_code = 'ERR-5023' -- Same error code ) AND created_at > NOW() - INTERVAL '90 days' ORDER BY created_at DESC;Create a problem record linking all related incidents. Include the common symptoms, affected services, and business impact. Set the problem priority based on the highest-priority linked incident and the frequency of occurrence.
Assign the problem to an analyst with expertise in the affected technology. For problems spanning multiple technology domains, assign a primary investigator and identify supporting analysts from each domain.
Identifying problems proactively
Proactive problem identification detects patterns before they cause major incidents. This approach reduces service disruption by addressing root causes while their impact remains limited.
- Generate an incident trend report covering the previous 90 days. Group incidents by category, affected service, and configuration item:
-- Incident trend analysis for proactive problem identification SELECT category, affected_service, affected_ci, COUNT(*) as incident_count, COUNT(DISTINCT DATE(created_at)) as affected_days, SUM(CASE WHEN priority IN (1, 2) THEN 1 ELSE 0 END) as major_incidents, AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/60) as avg_mttr_minutes FROM incidents WHERE created_at > NOW() - INTERVAL '90 days' AND status = 'resolved' GROUP BY category, affected_service, affected_ci HAVING COUNT(*) >= 3 -- Threshold for pattern significance ORDER BY incident_count DESC, major_incidents DESC;- Review monitoring data for near-miss conditions. Systems that approach but do not exceed thresholds indicate emerging problems. Query for metrics that reached 80% of alerting thresholds:
# Find near-threshold conditions over 7 days max_over_time( (node_filesystem_avail_bytes / node_filesystem_size_bytes)[7d:1h] ) < 0.25This identifies filesystems that dropped below 25% free space, even if they recovered before triggering alerts.
- Analyse change correlation. Problems frequently emerge from recent changes. Compare incident timing against the change log:
-- Correlate incidents with recent changes SELECT c.change_id, c.summary as change_summary, c.implemented_at, COUNT(i.incident_id) as subsequent_incidents, MIN(i.created_at) as first_incident, EXTRACT(EPOCH FROM (MIN(i.created_at) - c.implemented_at))/3600 as hours_to_first_incident FROM changes c LEFT JOIN incidents i ON ( i.affected_ci = ANY(c.affected_cis) AND i.created_at > c.implemented_at AND i.created_at < c.implemented_at + INTERVAL '7 days' ) WHERE c.implemented_at > NOW() - INTERVAL '30 days' GROUP BY c.change_id, c.summary, c.implemented_at HAVING COUNT(i.incident_id) >= 2 ORDER BY subsequent_incidents DESC;- Create problem records for identified patterns. Link the contributing incidents and document the pattern observed. Proactive problems begin at lower priority than reactive problems but escalate if investigation reveals significant risk.
Logging and categorising problems
Problem records require structured data to enable effective tracking, reporting, and knowledge reuse.
Create the problem record with required fields:
Field Content Title Concise description of the symptom pattern, not the suspected cause Category Technology domain (network, application, infrastructure, security) Affected service Primary business service impacted Affected CIs Configuration items involved, linked from the CMDB Priority Based on business impact and incident frequency Status New, Investigation, Known Error, Pending Change, Resolved Related incidents Links to all correlated incident records Document initial observations in the investigation log. Include timestamps for all entries. The investigation log becomes the audit trail for the analysis process.
Set the target resolution date based on priority:
Priority Target resolution Review frequency Critical 5 business days Daily High 15 business days Twice weekly Medium 30 business days Weekly Low 90 business days Monthly These targets measure time to documented resolution, not necessarily implementation of a permanent fix. A problem is resolved when you have identified the root cause and either implemented a fix or created a change request for implementation.
Root cause analysis
Root cause analysis determines why an incident occurred, not just what failed. Effective analysis distinguishes symptoms from causes and identifies the deepest actionable cause.
+------------------------------------------------------------------+| ROOT CAUSE ANALYSIS FLOW |+------------------------------------------------------------------+ | +---------v---------+ | Gather evidence | | (logs, metrics, | | interviews) | +---------+---------+ | +---------------+---------------+ | | +---------v---------+ +---------v---------+ | Simple problem | | Complex problem | | (single cause) | | (multiple | | | | factors) | +---------+---------+ +---------+---------+ | | +---------v---------+ +---------v---------+ | 5 Whys analysis | | Ishikawa or | | | | Fault Tree | +---------+---------+ +---------+---------+ | | +---------------+---------------+ | +---------v---------+ | Identify root | | cause(s) | +---------+---------+ | +---------v---------+ | Validate with | | evidence | +---------+---------+ | +---------v---------+ | Document and | | create KEDB | +-------------------+Figure 1: Root cause analysis flow from evidence gathering through documentation
Five Whys analysis works well for problems with a single causal chain. You ask “why” repeatedly until you reach a cause that, if addressed, would prevent the incident from recurring.
State the problem clearly: “The payment processing service was unavailable for 47 minutes on 14 March 2024.”
Ask why and answer with evidence:
- Why was the service unavailable? The application server stopped responding to requests.
- Why did the server stop responding? The JVM ran out of heap memory and entered a garbage collection loop.
- Why did the JVM exhaust heap memory? A memory leak in the session handling code accumulated objects over 6 days.
- Why did the memory leak occur? A code change on 8 March introduced a reference that prevented session objects from being garbage collected.
- Why was the defective code deployed? The change passed code review but the reviewer did not check for object lifecycle management.
Identify the root cause and contributing factors. In this example, the root cause is the code defect. Contributing factors include the absence of memory leak testing in the CI pipeline and the 6-day accumulation period before symptoms appeared.
Document the analysis chain in the problem record with evidence supporting each step.
Ishikawa (fishbone) analysis addresses complex problems with multiple contributing factors across different domains.
+---------------+ | PROBLEM: | | Database | | timeouts | | during | | peak hours | +-------+-------+ | +---------------------------------------------------+ | | | |+-------v-------+ +-------v-------+ +-------v-------+ || PEOPLE | | PROCESS | | TECHNOLOGY | |+---------------+ +---------------+ +---------------+ || | | | | | || - No DBA on | | - No capacity | | - Connection | || call during | | review in | | pool sized | || peak hours | | change | | for 2019 | || | | process | | load | || - Query | | | | | || optimisation| | - Batch jobs | | - No query | || skills gap | | scheduled | | caching | || | | during | | layer | || | | business | | | || | | hours | | - 5-year-old | || | | | | indexes not | |+---------------+ +---------------+ | rebuilt | | +---------------+ | |+-------v-------+ +-------v-------+ || ENVIRONMENT | | DATA |-----------------------++---------------+ +---------------+| | | || - Network | | - Transaction || latency to | | volume 3x || cloud DB | | 2019 levels || increased | | || after ISP | | - Table sizes || change | | exceed || | | partition || | | thresholds |+---------------+ +---------------+Figure 2: Ishikawa diagram showing contributing factors across categories
Draw the fishbone structure with the problem statement at the head. Use standard categories: People, Process, Technology, Environment, and Data.
Brainstorm potential contributing factors in each category. Include factors even if you are uncertain of their contribution. You will validate with evidence.
For each factor, gather supporting or refuting evidence. In the example above, validating “connection pool sized for 2019 load” requires comparing current pool configuration against current connection demand.
Identify the primary and contributing causes. Multiple factors can combine to cause a single problem. The root cause is the factor that, if addressed first, would have the greatest impact on prevention.
Fault tree analysis works backwards from a failure to identify all possible causes in a logical structure.
Define the top event (the failure you are analysing) and place it at the root of the tree.
Identify immediate causes using AND/OR logic gates. An AND gate means all child events must occur for the parent to occur. An OR gate means any child event is sufficient.
Continue decomposing until you reach basic events that cannot be further subdivided or events whose probability is known.
Analyse the tree to identify minimal cut sets: the smallest combinations of basic events that cause the top event.
Managing the known error database
When you identify a root cause and document a workaround or resolution, the problem becomes a known error. The Known Error Database (KEDB) stores these records for reference during incident management.
- Create the known error record with structured fields:
known_error: id: KE-2024-0147 title: "JVM heap exhaustion in payment service under sustained load" root_cause: | Memory leak in session handling code introduced in release 2.4.1. SessionManager.createSession() stores reference in static map that is never cleared when session expires. symptoms: - Payment API response times exceed 5 seconds - JVM garbage collection consuming >80% CPU - OutOfMemoryError in application logs affected_cis: - APP-PAYMENT-PROD-01 - APP-PAYMENT-PROD-02 workaround: | Restart affected application server during low-traffic window. Schedule daily restart at 03:00 until permanent fix deployed. Monitor heap usage; restart if used heap exceeds 85%. permanent_resolution: | Deploy release 2.4.2 containing fix for session map cleanup. Change request CR-2024-0892 approved, scheduled for 2024-03-22. created: 2024-03-15 status: pending_changeLink the known error to the originating problem record and all related incidents.
Notify service desk staff that a known error exists. Provide the symptoms and workaround so they can apply it during future incidents without escalation.
Update the known error status when the permanent resolution is implemented. Retain closed known errors for 2 years to support trend analysis and similar-problem identification.
Initiating changes for permanent resolution
Problems requiring system changes transition to the change management process. You create and track the change request while maintaining ownership of the problem until the change is verified effective.
- Document the proposed resolution in sufficient detail for change assessment:
## Proposed Resolution
Deploy payment-service release 2.4.2 containing commit a]7f3b2c1.
### Technical Details - Modified SessionManager.java to use WeakReference for session map entries - Added scheduled cleanup task running every 15 minutes - Added heap usage metric exposed on /actuator/metrics endpoint
### Testing Completed - Unit tests: 147 passed, 0 failed - Load test: 10,000 sessions created/destroyed, heap stable at 2.1GB - Soak test: 72 hours continuous load, no memory growth observed
### Rollback Plan Redeploy release 2.4.1 from container registry. Reinstate daily restart schedule.
### Risk Assessment Low risk. Change affects session lifecycle only. No database changes. No external API changes. Rollback tested and verified.Submit the change request referencing the problem record. The change advisory board evaluates risk and scheduling.
Track change implementation progress in the problem record. Update stakeholders on expected resolution dates.
After change implementation, verify the fix prevents recurrence. This verification occurs during problem closure.
Closing problems
Problem closure confirms that the root cause is addressed and provides data for continual improvement.
Verify the resolution effectiveness. For problems resolved by changes, confirm no related incidents have occurred for a defined observation period:
Problem priority Observation period Critical 14 days High 21 days Medium 30 days Low 45 days Document the final root cause and resolution in the problem record. This documentation feeds knowledge management and helps future analysts facing similar issues.
Update all linked incident records to reference the problem resolution. This enables reporting on incident reduction achieved through problem management.
Calculate problem management metrics:
-- Problem resolution metrics SELECT DATE_TRUNC('month', resolved_at) as month, COUNT(*) as problems_resolved, AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/86400) as avg_days_to_resolve, SUM(linked_incident_count) as incidents_addressed, COUNT(CASE WHEN resolution_type = 'permanent_fix' THEN 1 END) as permanent_fixes, COUNT(CASE WHEN resolution_type = 'workaround_only' THEN 1 END) as workaround_only FROM problems WHERE resolved_at > NOW() - INTERVAL '12 months' GROUP BY DATE_TRUNC('month', resolved_at) ORDER BY month DESC;- Close the problem record and notify stakeholders.
+------------------------------------------------------------------+| PROBLEM-INCIDENT-CHANGE RELATIONSHIP |+------------------------------------------------------------------+
+-------------+ +-------------+ +-------------+ | Incident | | Incident | | Incident | | INC-001 | | INC-002 | | INC-003 | | (resolved) | | (resolved) | | (resolved) | +------+------+ +------+------+ +------+------+ | | | +-------------------+-------------------+ | v +--------+--------+ | Problem | | PRB-047 | | | | Root cause | | identified | +--------+--------+ | +--------------+--------------+ | | v v +--------+--------+ +--------+--------+ | Known Error | | Change | | KE-147 | | CHG-892 | | | | | | Workaround: | | Permanent fix | | Daily restart | | Release 2.4.2 | +--------+--------+ +--------+--------+ | | +-------------+---------------+ | v +--------+--------+ | Problem | | PRB-047 | | (closed) | | | | Resolution | | verified | +-----------------+Figure 3: Relationship between incidents, problems, known errors, and changes
Proactive trend analysis
Beyond individual problem investigation, regular trend analysis identifies systemic issues and emerging risks.
Generate monthly trend reports covering:
- Incident volume by category and service
- Repeat incidents (same CI, same symptoms within 30 days)
- Incident-to-problem conversion rate
- Problem backlog age distribution
- Known error count and age
Identify patterns warranting proactive investigation:
-- Identify repeat incident patterns SELECT affected_ci, category, COUNT(*) as incident_count, COUNT(DISTINCT DATE(created_at)) as distinct_days, ARRAY_AGG(DISTINCT symptom_code) as symptom_codes, BOOL_OR(linked_problem_id IS NOT NULL) as has_problem_record FROM incidents WHERE created_at > NOW() - INTERVAL '30 days' GROUP BY affected_ci, category HAVING COUNT(*) >= 3 AND NOT BOOL_OR(linked_problem_id IS NOT NULL) ORDER BY incident_count DESC;This query finds incident patterns without associated problem records, indicating gaps in reactive problem identification.
- Present findings at the monthly problem review meeting. Prioritise proactive investigations based on potential business impact and investigation effort.
Verification
After completing problem closure, verify the effectiveness of your problem management activities.
Resolution verification: Confirm no related incidents have occurred during the observation period:
-- Verify no recurrence after problem resolutionSELECT COUNT(*) as recurrence_countFROM incidents iJOIN problems p ON i.affected_ci = ANY(p.affected_cis)WHERE p.problem_id = 'PRB-2024-0147' AND p.resolved_at IS NOT NULL AND i.created_at > p.resolved_at AND i.created_at < p.resolved_at + INTERVAL '30 days' AND i.category = p.category;
-- Expected result: 0A non-zero result indicates the resolution was ineffective. Reopen the problem for further investigation.
Documentation verification: Confirm the known error database entry is complete and accessible:
# Verify KEDB entry accessibilitycurl -s "https://itsm.example.org/api/v1/known-errors/KE-2024-0147" \ -H "Authorization: Bearer $ITSM_TOKEN" | jq '.status, .workaround, .permanent_resolution'
# Expected output:# "closed"# "Restart affected application server..."# "Deploy release 2.4.2..."Knowledge transfer verification: Confirm service desk staff can locate and apply the workaround:
- Ask a service desk analyst to search for the known error using likely symptoms
- Verify they can find the KEDB entry within 2 minutes
- Confirm the workaround instructions are clear and executable
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| Root cause analysis produces multiple equally plausible causes | Insufficient evidence gathered before analysis | Return to evidence gathering. Query additional log sources, interview more stakeholders, extend the time window analysed |
| Five Whys analysis terminates at a human error | Stopped too early. Human error is a symptom, not a root cause | Continue asking why. What enabled the error? What controls should have prevented it? What process gap exists? |
| Problem investigation stalls awaiting expert availability | Critical expert is single point of knowledge | Document current findings and blockers. Escalate to management for resource prioritisation. Identify alternative experts or external consultants |
| Same problem reopened after declared resolved | Resolution addressed symptom, not root cause | Review the original root cause analysis. Apply Ishikawa to identify missed contributing factors. Extend the observation period before future closures |
| Known error workaround is not applied during incidents | Service desk unaware of KEDB entry | Review notification process. Integrate KEDB search into incident workflow. Add workaround summary to affected CI record |
| Proactive trend analysis produces false positives | Correlation mistaken for causation, or threshold too low | Increase the incident count threshold for pattern significance. Validate correlations with technical evidence before creating problem records |
| Change request rejected for problem resolution | Insufficient risk assessment or testing evidence | Document additional testing. Quantify business impact of not implementing the change. Request expedited CAB review if business impact is significant |
| Problem backlog growing faster than resolution rate | Insufficient analyst capacity or problem scope too broad | Prioritise critical and high problems. Close or merge low-priority duplicates. Consider workaround-only closure for low-impact problems |
| Incident-to-problem linking is inconsistent | No clear criteria for when to create or link to problems | Document specific triggers for problem creation. Train incident managers on linking criteria. Add problem identification to incident closure checklist |
| Root cause documentation is not reusable | Written for current investigation only, not future reference | Apply knowledge base article standards to KEDB entries. Include symptom-based titles, searchable keywords, and step-by-step workarounds |