Skip to main content

Problem Management

Problem management identifies and eliminates the root causes of incidents. Where incident management restores service as quickly as possible, problem management investigates why the incident occurred and implements permanent fixes to prevent recurrence. You perform problem management both reactively, triggered by incidents that have already occurred, and proactively, identifying potential failures before they cause service disruption.

A problem is the underlying cause of one or more incidents. Problems remain open until you identify the root cause, document a workaround or permanent resolution, and verify that the fix prevents recurrence. A known error is a problem with a documented root cause and either a workaround or a resolution path. The distinction matters: problems represent investigation work in progress, while known errors represent understood conditions awaiting resolution.

Prerequisites

Before initiating problem investigation, confirm you have the following access and resources available.

Incident data access
Read access to incident records in the service management system, including full incident history, related configuration items, and resolution notes. You need visibility into incident patterns across at least 90 days to identify meaningful trends.
System access
Appropriate access to investigate affected systems. For infrastructure problems, this includes read access to logs, monitoring data, and configuration. For application problems, this includes access to application logs, error tracking systems, and deployment history.
Analysis tools
Access to log aggregation and search tools, monitoring dashboards, and diagramming software for root cause visualisation. If your organisation uses a dedicated problem management tool, you need rights to create and update problem records.
Time allocation
Problem investigation requires uninterrupted analysis time. A typical problem investigation takes 2 to 8 hours of focused work. Major problems affecting critical services require 16 to 40 hours across multiple analysts. Secure this time before beginning investigation.
Stakeholder availability
Identify subject matter experts for affected systems. Root cause analysis requires input from people who understand normal system behaviour, recent changes, and historical issues. Confirm their availability before scheduling investigation sessions.

Verify your incident data is current and complete:

-- Check incident data completeness for problem analysis
SELECT
category,
COUNT(*) as incident_count,
COUNT(CASE WHEN root_cause IS NULL THEN 1 END) as missing_root_cause,
COUNT(CASE WHEN resolution_notes IS NULL THEN 1 END) as missing_resolution,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/3600) as avg_resolution_hours
FROM incidents
WHERE created_at > NOW() - INTERVAL '90 days'
AND status = 'resolved'
GROUP BY category
ORDER BY incident_count DESC;

This query reveals gaps in incident documentation that will impede problem analysis. Categories with high missing_root_cause percentages indicate areas where incident management is not capturing sufficient diagnostic information.

Procedure

Problem management follows two distinct paths: reactive investigation triggered by incidents, and proactive identification through trend analysis. Both paths converge at root cause analysis and resolution.

Identifying problems reactively

Reactive problem identification begins when incidents indicate an underlying issue worth investigating. Not every incident warrants a problem record. Create a problem record when any of these conditions apply: a major incident (Priority 1 or 2) has occurred, three or more incidents share the same root cause within 30 days, or an incident reveals a systemic vulnerability regardless of impact.

  1. Review the incident record and all related incidents. Extract the symptoms reported by users, the technical indicators observed by support staff, and the resolution applied. Note whether the resolution was a workaround (service restored but underlying issue remains) or a permanent fix.

  2. Query for related incidents using the affected configuration item, error codes, and symptoms:

-- Find related incidents for problem correlation
SELECT
incident_id,
created_at,
summary,
affected_ci,
category,
resolution_code,
resolution_notes
FROM incidents
WHERE (
affected_ci = 'CI-00847' -- Same configuration item
OR summary ILIKE '%connection timeout%' -- Similar symptoms
OR error_code = 'ERR-5023' -- Same error code
)
AND created_at > NOW() - INTERVAL '90 days'
ORDER BY created_at DESC;
  1. Create a problem record linking all related incidents. Include the common symptoms, affected services, and business impact. Set the problem priority based on the highest-priority linked incident and the frequency of occurrence.

  2. Assign the problem to an analyst with expertise in the affected technology. For problems spanning multiple technology domains, assign a primary investigator and identify supporting analysts from each domain.

Identifying problems proactively

Proactive problem identification detects patterns before they cause major incidents. This approach reduces service disruption by addressing root causes while their impact remains limited.

  1. Generate an incident trend report covering the previous 90 days. Group incidents by category, affected service, and configuration item:
-- Incident trend analysis for proactive problem identification
SELECT
category,
affected_service,
affected_ci,
COUNT(*) as incident_count,
COUNT(DISTINCT DATE(created_at)) as affected_days,
SUM(CASE WHEN priority IN (1, 2) THEN 1 ELSE 0 END) as major_incidents,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/60) as avg_mttr_minutes
FROM incidents
WHERE created_at > NOW() - INTERVAL '90 days'
AND status = 'resolved'
GROUP BY category, affected_service, affected_ci
HAVING COUNT(*) >= 3 -- Threshold for pattern significance
ORDER BY incident_count DESC, major_incidents DESC;
  1. Review monitoring data for near-miss conditions. Systems that approach but do not exceed thresholds indicate emerging problems. Query for metrics that reached 80% of alerting thresholds:
# Find near-threshold conditions over 7 days
max_over_time(
(node_filesystem_avail_bytes / node_filesystem_size_bytes)[7d:1h]
) < 0.25

This identifies filesystems that dropped below 25% free space, even if they recovered before triggering alerts.

  1. Analyse change correlation. Problems frequently emerge from recent changes. Compare incident timing against the change log:
-- Correlate incidents with recent changes
SELECT
c.change_id,
c.summary as change_summary,
c.implemented_at,
COUNT(i.incident_id) as subsequent_incidents,
MIN(i.created_at) as first_incident,
EXTRACT(EPOCH FROM (MIN(i.created_at) - c.implemented_at))/3600 as hours_to_first_incident
FROM changes c
LEFT JOIN incidents i ON (
i.affected_ci = ANY(c.affected_cis)
AND i.created_at > c.implemented_at
AND i.created_at < c.implemented_at + INTERVAL '7 days'
)
WHERE c.implemented_at > NOW() - INTERVAL '30 days'
GROUP BY c.change_id, c.summary, c.implemented_at
HAVING COUNT(i.incident_id) >= 2
ORDER BY subsequent_incidents DESC;
  1. Create problem records for identified patterns. Link the contributing incidents and document the pattern observed. Proactive problems begin at lower priority than reactive problems but escalate if investigation reveals significant risk.

Logging and categorising problems

Problem records require structured data to enable effective tracking, reporting, and knowledge reuse.

  1. Create the problem record with required fields:

    FieldContent
    TitleConcise description of the symptom pattern, not the suspected cause
    CategoryTechnology domain (network, application, infrastructure, security)
    Affected servicePrimary business service impacted
    Affected CIsConfiguration items involved, linked from the CMDB
    PriorityBased on business impact and incident frequency
    StatusNew, Investigation, Known Error, Pending Change, Resolved
    Related incidentsLinks to all correlated incident records
  2. Document initial observations in the investigation log. Include timestamps for all entries. The investigation log becomes the audit trail for the analysis process.

  3. Set the target resolution date based on priority:

    PriorityTarget resolutionReview frequency
    Critical5 business daysDaily
    High15 business daysTwice weekly
    Medium30 business daysWeekly
    Low90 business daysMonthly

    These targets measure time to documented resolution, not necessarily implementation of a permanent fix. A problem is resolved when you have identified the root cause and either implemented a fix or created a change request for implementation.

Root cause analysis

Root cause analysis determines why an incident occurred, not just what failed. Effective analysis distinguishes symptoms from causes and identifies the deepest actionable cause.

+------------------------------------------------------------------+
| ROOT CAUSE ANALYSIS FLOW |
+------------------------------------------------------------------+
|
+---------v---------+
| Gather evidence |
| (logs, metrics, |
| interviews) |
+---------+---------+
|
+---------------+---------------+
| |
+---------v---------+ +---------v---------+
| Simple problem | | Complex problem |
| (single cause) | | (multiple |
| | | factors) |
+---------+---------+ +---------+---------+
| |
+---------v---------+ +---------v---------+
| 5 Whys analysis | | Ishikawa or |
| | | Fault Tree |
+---------+---------+ +---------+---------+
| |
+---------------+---------------+
|
+---------v---------+
| Identify root |
| cause(s) |
+---------+---------+
|
+---------v---------+
| Validate with |
| evidence |
+---------+---------+
|
+---------v---------+
| Document and |
| create KEDB |
+-------------------+

Figure 1: Root cause analysis flow from evidence gathering through documentation

Five Whys analysis works well for problems with a single causal chain. You ask “why” repeatedly until you reach a cause that, if addressed, would prevent the incident from recurring.

  1. State the problem clearly: “The payment processing service was unavailable for 47 minutes on 14 March 2024.”

  2. Ask why and answer with evidence:

    • Why was the service unavailable? The application server stopped responding to requests.
    • Why did the server stop responding? The JVM ran out of heap memory and entered a garbage collection loop.
    • Why did the JVM exhaust heap memory? A memory leak in the session handling code accumulated objects over 6 days.
    • Why did the memory leak occur? A code change on 8 March introduced a reference that prevented session objects from being garbage collected.
    • Why was the defective code deployed? The change passed code review but the reviewer did not check for object lifecycle management.
  3. Identify the root cause and contributing factors. In this example, the root cause is the code defect. Contributing factors include the absence of memory leak testing in the CI pipeline and the 6-day accumulation period before symptoms appeared.

  4. Document the analysis chain in the problem record with evidence supporting each step.

Ishikawa (fishbone) analysis addresses complex problems with multiple contributing factors across different domains.

+---------------+
| PROBLEM: |
| Database |
| timeouts |
| during |
| peak hours |
+-------+-------+
|
+---------------------------------------------------+
| | | |
+-------v-------+ +-------v-------+ +-------v-------+ |
| PEOPLE | | PROCESS | | TECHNOLOGY | |
+---------------+ +---------------+ +---------------+ |
| | | | | | |
| - No DBA on | | - No capacity | | - Connection | |
| call during | | review in | | pool sized | |
| peak hours | | change | | for 2019 | |
| | | process | | load | |
| - Query | | | | | |
| optimisation| | - Batch jobs | | - No query | |
| skills gap | | scheduled | | caching | |
| | | during | | layer | |
| | | business | | | |
| | | hours | | - 5-year-old | |
| | | | | indexes not | |
+---------------+ +---------------+ | rebuilt | |
+---------------+ |
|
+-------v-------+ +-------v-------+ |
| ENVIRONMENT | | DATA |-----------------------+
+---------------+ +---------------+
| | | |
| - Network | | - Transaction |
| latency to | | volume 3x |
| cloud DB | | 2019 levels |
| increased | | |
| after ISP | | - Table sizes |
| change | | exceed |
| | | partition |
| | | thresholds |
+---------------+ +---------------+

Figure 2: Ishikawa diagram showing contributing factors across categories

  1. Draw the fishbone structure with the problem statement at the head. Use standard categories: People, Process, Technology, Environment, and Data.

  2. Brainstorm potential contributing factors in each category. Include factors even if you are uncertain of their contribution. You will validate with evidence.

  3. For each factor, gather supporting or refuting evidence. In the example above, validating “connection pool sized for 2019 load” requires comparing current pool configuration against current connection demand.

  4. Identify the primary and contributing causes. Multiple factors can combine to cause a single problem. The root cause is the factor that, if addressed first, would have the greatest impact on prevention.

Fault tree analysis works backwards from a failure to identify all possible causes in a logical structure.

  1. Define the top event (the failure you are analysing) and place it at the root of the tree.

  2. Identify immediate causes using AND/OR logic gates. An AND gate means all child events must occur for the parent to occur. An OR gate means any child event is sufficient.

  3. Continue decomposing until you reach basic events that cannot be further subdivided or events whose probability is known.

  4. Analyse the tree to identify minimal cut sets: the smallest combinations of basic events that cause the top event.

Managing the known error database

When you identify a root cause and document a workaround or resolution, the problem becomes a known error. The Known Error Database (KEDB) stores these records for reference during incident management.

  1. Create the known error record with structured fields:
known_error:
id: KE-2024-0147
title: "JVM heap exhaustion in payment service under sustained load"
root_cause: |
Memory leak in session handling code introduced in release 2.4.1.
SessionManager.createSession() stores reference in static map
that is never cleared when session expires.
symptoms:
- Payment API response times exceed 5 seconds
- JVM garbage collection consuming >80% CPU
- OutOfMemoryError in application logs
affected_cis:
- APP-PAYMENT-PROD-01
- APP-PAYMENT-PROD-02
workaround: |
Restart affected application server during low-traffic window.
Schedule daily restart at 03:00 until permanent fix deployed.
Monitor heap usage; restart if used heap exceeds 85%.
permanent_resolution: |
Deploy release 2.4.2 containing fix for session map cleanup.
Change request CR-2024-0892 approved, scheduled for 2024-03-22.
created: 2024-03-15
status: pending_change
  1. Link the known error to the originating problem record and all related incidents.

  2. Notify service desk staff that a known error exists. Provide the symptoms and workaround so they can apply it during future incidents without escalation.

  3. Update the known error status when the permanent resolution is implemented. Retain closed known errors for 2 years to support trend analysis and similar-problem identification.

Initiating changes for permanent resolution

Problems requiring system changes transition to the change management process. You create and track the change request while maintaining ownership of the problem until the change is verified effective.

  1. Document the proposed resolution in sufficient detail for change assessment:
## Proposed Resolution
Deploy payment-service release 2.4.2 containing commit a]7f3b2c1.
### Technical Details
- Modified SessionManager.java to use WeakReference for session map entries
- Added scheduled cleanup task running every 15 minutes
- Added heap usage metric exposed on /actuator/metrics endpoint
### Testing Completed
- Unit tests: 147 passed, 0 failed
- Load test: 10,000 sessions created/destroyed, heap stable at 2.1GB
- Soak test: 72 hours continuous load, no memory growth observed
### Rollback Plan
Redeploy release 2.4.1 from container registry. Reinstate daily restart schedule.
### Risk Assessment
Low risk. Change affects session lifecycle only. No database changes.
No external API changes. Rollback tested and verified.
  1. Submit the change request referencing the problem record. The change advisory board evaluates risk and scheduling.

  2. Track change implementation progress in the problem record. Update stakeholders on expected resolution dates.

  3. After change implementation, verify the fix prevents recurrence. This verification occurs during problem closure.

Closing problems

Problem closure confirms that the root cause is addressed and provides data for continual improvement.

  1. Verify the resolution effectiveness. For problems resolved by changes, confirm no related incidents have occurred for a defined observation period:

    Problem priorityObservation period
    Critical14 days
    High21 days
    Medium30 days
    Low45 days
  2. Document the final root cause and resolution in the problem record. This documentation feeds knowledge management and helps future analysts facing similar issues.

  3. Update all linked incident records to reference the problem resolution. This enables reporting on incident reduction achieved through problem management.

  4. Calculate problem management metrics:

-- Problem resolution metrics
SELECT
DATE_TRUNC('month', resolved_at) as month,
COUNT(*) as problems_resolved,
AVG(EXTRACT(EPOCH FROM (resolved_at - created_at))/86400) as avg_days_to_resolve,
SUM(linked_incident_count) as incidents_addressed,
COUNT(CASE WHEN resolution_type = 'permanent_fix' THEN 1 END) as permanent_fixes,
COUNT(CASE WHEN resolution_type = 'workaround_only' THEN 1 END) as workaround_only
FROM problems
WHERE resolved_at > NOW() - INTERVAL '12 months'
GROUP BY DATE_TRUNC('month', resolved_at)
ORDER BY month DESC;
  1. Close the problem record and notify stakeholders.
+------------------------------------------------------------------+
| PROBLEM-INCIDENT-CHANGE RELATIONSHIP |
+------------------------------------------------------------------+
+-------------+ +-------------+ +-------------+
| Incident | | Incident | | Incident |
| INC-001 | | INC-002 | | INC-003 |
| (resolved) | | (resolved) | | (resolved) |
+------+------+ +------+------+ +------+------+
| | |
+-------------------+-------------------+
|
v
+--------+--------+
| Problem |
| PRB-047 |
| |
| Root cause |
| identified |
+--------+--------+
|
+--------------+--------------+
| |
v v
+--------+--------+ +--------+--------+
| Known Error | | Change |
| KE-147 | | CHG-892 |
| | | |
| Workaround: | | Permanent fix |
| Daily restart | | Release 2.4.2 |
+--------+--------+ +--------+--------+
| |
+-------------+---------------+
|
v
+--------+--------+
| Problem |
| PRB-047 |
| (closed) |
| |
| Resolution |
| verified |
+-----------------+

Figure 3: Relationship between incidents, problems, known errors, and changes

Proactive trend analysis

Beyond individual problem investigation, regular trend analysis identifies systemic issues and emerging risks.

  1. Generate monthly trend reports covering:

    • Incident volume by category and service
    • Repeat incidents (same CI, same symptoms within 30 days)
    • Incident-to-problem conversion rate
    • Problem backlog age distribution
    • Known error count and age
  2. Identify patterns warranting proactive investigation:

-- Identify repeat incident patterns
SELECT
affected_ci,
category,
COUNT(*) as incident_count,
COUNT(DISTINCT DATE(created_at)) as distinct_days,
ARRAY_AGG(DISTINCT symptom_code) as symptom_codes,
BOOL_OR(linked_problem_id IS NOT NULL) as has_problem_record
FROM incidents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY affected_ci, category
HAVING COUNT(*) >= 3
AND NOT BOOL_OR(linked_problem_id IS NOT NULL)
ORDER BY incident_count DESC;

This query finds incident patterns without associated problem records, indicating gaps in reactive problem identification.

  1. Present findings at the monthly problem review meeting. Prioritise proactive investigations based on potential business impact and investigation effort.

Verification

After completing problem closure, verify the effectiveness of your problem management activities.

Resolution verification: Confirm no related incidents have occurred during the observation period:

-- Verify no recurrence after problem resolution
SELECT COUNT(*) as recurrence_count
FROM incidents i
JOIN problems p ON i.affected_ci = ANY(p.affected_cis)
WHERE p.problem_id = 'PRB-2024-0147'
AND p.resolved_at IS NOT NULL
AND i.created_at > p.resolved_at
AND i.created_at < p.resolved_at + INTERVAL '30 days'
AND i.category = p.category;
-- Expected result: 0

A non-zero result indicates the resolution was ineffective. Reopen the problem for further investigation.

Documentation verification: Confirm the known error database entry is complete and accessible:

Terminal window
# Verify KEDB entry accessibility
curl -s "https://itsm.example.org/api/v1/known-errors/KE-2024-0147" \
-H "Authorization: Bearer $ITSM_TOKEN" | jq '.status, .workaround, .permanent_resolution'
# Expected output:
# "closed"
# "Restart affected application server..."
# "Deploy release 2.4.2..."

Knowledge transfer verification: Confirm service desk staff can locate and apply the workaround:

  1. Ask a service desk analyst to search for the known error using likely symptoms
  2. Verify they can find the KEDB entry within 2 minutes
  3. Confirm the workaround instructions are clear and executable

Troubleshooting

SymptomCauseResolution
Root cause analysis produces multiple equally plausible causesInsufficient evidence gathered before analysisReturn to evidence gathering. Query additional log sources, interview more stakeholders, extend the time window analysed
Five Whys analysis terminates at a human errorStopped too early. Human error is a symptom, not a root causeContinue asking why. What enabled the error? What controls should have prevented it? What process gap exists?
Problem investigation stalls awaiting expert availabilityCritical expert is single point of knowledgeDocument current findings and blockers. Escalate to management for resource prioritisation. Identify alternative experts or external consultants
Same problem reopened after declared resolvedResolution addressed symptom, not root causeReview the original root cause analysis. Apply Ishikawa to identify missed contributing factors. Extend the observation period before future closures
Known error workaround is not applied during incidentsService desk unaware of KEDB entryReview notification process. Integrate KEDB search into incident workflow. Add workaround summary to affected CI record
Proactive trend analysis produces false positivesCorrelation mistaken for causation, or threshold too lowIncrease the incident count threshold for pattern significance. Validate correlations with technical evidence before creating problem records
Change request rejected for problem resolutionInsufficient risk assessment or testing evidenceDocument additional testing. Quantify business impact of not implementing the change. Request expedited CAB review if business impact is significant
Problem backlog growing faster than resolution rateInsufficient analyst capacity or problem scope too broadPrioritise critical and high problems. Close or merge low-priority duplicates. Consider workaround-only closure for low-impact problems
Incident-to-problem linking is inconsistentNo clear criteria for when to create or link to problemsDocument specific triggers for problem creation. Train incident managers on linking criteria. Add problem identification to incident closure checklist
Root cause documentation is not reusableWritten for current investigation only, not future referenceApply knowledge base article standards to KEDB entries. Include symptom-based titles, searchable keywords, and step-by-step workarounds

See also