Major Service Outage
A major service outage is an unplanned interruption to IT services that affects multiple users or critical business functions beyond the scope of normal incident management. This playbook governs the response when email systems fail for an entire office, when the grants management system becomes inaccessible during a reporting deadline, or when network connectivity drops across a region. The procedures here focus on service restoration and business continuity rather than security investigation, which is covered by dedicated security incident playbooks.
The distinction between a standard incident and a major outage lies in organisational impact. A single user unable to print is an incident. Fifty users unable to access beneficiary records during a distribution is a major outage requiring coordinated response, executive communication, and potentially manual workarounds that affect programme delivery.
Activation criteria
Invoke this playbook when any of the following conditions are met:
| Criterion | Threshold | Examples |
|---|---|---|
| User impact | 20+ users affected, or any executive/leadership affected | Email down for headquarters; finance system unavailable |
| Duration | Service unavailable for 30+ minutes with no resolution in sight | Database server unresponsive after initial troubleshooting |
| SLA breach | Imminent or actual breach of documented service level | 99.9% availability SLA breached; 4-hour response time exceeded |
| Business criticality | Any Tier 1 service unavailable regardless of user count | Payroll system during pay run; beneficiary database during emergency response |
| Cascading failure | Two or more services affected by same root cause | Authentication failure affecting all SSO-integrated applications |
| Field operations impact | Any outage affecting active humanitarian response | Data collection platform down during needs assessment |
Service tier classification determines activation thresholds. Tier 1 services (email, identity provider, core programme systems, finance) trigger immediate major outage response upon confirmed unavailability. Tier 2 services (document management, secondary applications) trigger major outage response after 30 minutes or when affecting 20+ users. Tier 3 services (convenience applications, non-critical tools) follow standard incident management unless impact escalates.
Security incidents
If the outage results from or coincides with suspected malicious activity, invoke the appropriate security playbook instead. Ransomware, denial-of-service attacks, and compromised infrastructure require security-focused response even when service availability is the visible symptom.
Roles
| Role | Responsibility | Typical assignee | Backup |
|---|---|---|---|
| Incident commander | Overall coordination, decisions, external escalation, declares resolution | IT Manager or Head of IT | Senior Systems Administrator |
| Technical lead | Investigation, diagnosis, remediation execution, technical updates | Systems Administrator or Engineer | Application Administrator |
| Communications lead | Stakeholder updates, status page management, user communication | IT Service Desk Lead or Communications Officer | IT Manager |
| Business liaison | Programme impact assessment, workaround coordination, priority input | Programme Manager or Operations Director | Country Director representative |
| Scribe | Timeline documentation, action tracking, decision recording | Service Desk Analyst | Any available IT staff |
For organisations with single-person IT functions, the IT staff member assumes incident commander and technical lead roles while delegating communications to a designated non-IT colleague and business liaison to the relevant programme manager.
Phase 1: Initial assessment
Objective: Confirm outage scope, establish incident command, and initiate communication within 15 minutes of detection.
Timeframe: 0-15 minutes
- Confirm service unavailability through independent verification. Do not rely solely on user reports. Access monitoring dashboards, attempt service access from multiple locations, and check vendor status pages for cloud services.
# Quick service verification examples curl -I https://mail.example.org/health ping -c 3 fileserver.internal nslookup grants.example.orgDocument the exact time service was confirmed unavailable. This becomes the official outage start time for SLA calculations and post-incident reporting.
Identify the incident commander. If the designated incident commander is unavailable, the most senior IT staff member present assumes the role. The incident commander must be reachable by phone for the duration of the outage.
Assess initial scope by answering these questions:
- Which specific services are affected?
- How many users are affected and in which locations?
- What business functions are impacted?
- Are there dependent services that may fail as a consequence?
- Is this affecting active programme delivery or humanitarian operations?
Classify severity based on initial assessment:
Severity User impact Business impact Response level Critical 100+ users or all of a location Core operations stopped; programme delivery halted Full war room; executive notification within 30 minutes High 20-100 users Significant degradation; workarounds difficult Technical team assembled; management notification within 1 hour Medium Under 20 users but Tier 1 service Limited impact; workarounds available Standard escalation; management notification within 2 hours Establish the war room for Critical and High severity outages. The war room is a dedicated communication channel where all incident participants coordinate in real time.
For remote/distributed teams, create a dedicated video call and messaging channel:
Channel name: OUTAGE-[DATE]-[SERVICE] Example: OUTAGE-20241116-EMAILFor co-located teams, designate a physical meeting room. Post the room number or video link to all IT staff immediately.
- Send initial notification to IT team and management:
Subject: [MAJOR OUTAGE] [Service name] - Investigation underway
Affected service: [Service name] Start time: [HH:MM timezone] Impact: [Brief description - X users, Y locations] Severity: [Critical/High/Medium]
War room: [Location/link] Incident commander: [Name, phone]
Next update: [Time - typically 30 minutes]Decision point: If initial assessment reveals security indicators (unusual access patterns, ransom notes, defacement), transition to appropriate security playbook immediately.
Checkpoint: Proceed to Phase 2 when incident commander is assigned, war room is active, initial scope is documented, and first notification is sent.
Phase 2: Impact analysis and communication
Objective: Fully characterise the outage impact, establish communication cadence, and implement immediate workarounds.
Timeframe: 15-60 minutes
Map the complete impact by working through affected services systematically. For each affected service, document:
- Direct users who cannot perform their normal functions
- Dependent systems that rely on the affected service
- Business processes that are blocked or degraded
- Data that may be at risk (unsaved work, incomplete transactions)
- Scheduled activities that will be affected (reports, payroll runs, distributions)
Example impact documentation:
Service: Microsoft 365 (Email and SharePoint) Direct users: 340 (all HQ and regional offices) Dependent systems: - Grants portal (SSO authentication) - HR system (email notifications) - Approval workflows (stuck) Blocked processes: - Donor communication - Document collaboration - Calendar/scheduling At-risk data: Emails composed offline will queue locally Scheduled: Board report submission due 17:00 todayIdentify business priorities with the business liaison. Ask specifically:
- What cannot wait until tomorrow?
- Who are the most affected individuals?
- Are there external deadlines (donor reports, regulatory filings)?
- Is there active programme delivery that depends on this service?
Determine and communicate workarounds. Effective workarounds maintain business function without the affected system.
Affected service Potential workarounds Email Personal email for urgent external communication; messaging platform for internal; phone for critical contacts File storage Local copies; USB transfer for critical files (note security implications); alternative cloud storage Finance system Manual tracking; defer non-urgent transactions; paper-based approvals Data collection Paper forms; offline mobile data collection; SMS-based reporting Video conferencing Alternative platform; audio-only dial-in; postponement Field office workarounds
Field offices often have existing offline procedures that can scale during outages. Consult field IT staff or programme managers for established manual processes before creating new workarounds.
Establish communication cadence based on severity:
Severity Internal IT updates Management updates User updates Critical Every 15 minutes Every 30 minutes Every 30 minutes High Every 30 minutes Every hour Every hour Medium Every hour Every 2 hours As significant changes occur Send first user communication through available channels. If email is affected, use messaging platforms, SMS, intranet, or phone trees.
Subject: [Service name] currently unavailable - Workarounds available
[Service name] became unavailable at [time] and our team is working to restore it.
IMPACT: [What you cannot do]
WORKAROUNDS: - [Specific alternative 1] - [Specific alternative 2]
We will provide updates every [timeframe].
For urgent needs, contact [name] at [phone/channel].
Next update: [time]Notify affected external parties if the outage affects partner organisations, donors, or beneficiaries:
- Partners with system integrations: Direct contact to relationship owner
- Donors expecting deliverables: Proactive notification with revised timeline
- Beneficiaries: Through programme staff using established communication channels
Decision point: If the outage will exceed 4 hours, activate business continuity measures through the Service Continuity playbook.
Checkpoint: Proceed to Phase 3 when full impact is documented, workarounds are communicated, communication cadence is established, and external parties are notified.
Phase 3: Diagnosis and resolution
Objective: Identify root cause and restore service through systematic troubleshooting.
Timeframe: Ongoing until resolution (typically 1-4 hours for most outages)
Gather diagnostic information systematically. The technical lead directs troubleshooting while the scribe documents all findings.
For infrastructure services:
# Server status systemctl status [service] journalctl -u [service] --since "1 hour ago"
# Resource utilisation df -h # Disk space free -m # Memory top -bn1 # CPU and processes
# Network connectivity netstat -tlnp # Listening ports traceroute [destination]For cloud services:
- Check vendor status page (bookmark these in advance)
- Review admin console for service health
- Check recent changes in audit logs
- Verify authentication and licensing status
For applications:
- Review application logs for errors
- Check database connectivity
- Verify integration endpoints
- Test with minimal/default configuration
Correlate timeline with recent changes. Most outages follow changes. Review:
- Changes deployed in the past 72 hours
- Automated updates (patches, definition updates)
- Certificate or credential expirations
- Vendor maintenance windows
- Infrastructure changes (network, storage, virtualisation)
Create a timeline correlating the outage start with any identified changes:
14:30 - Automated Windows Update deployed to file server 14:45 - File server rebooted 14:52 - First user reports of file access issues 14:55 - Monitoring alert: file server unresponsive 15:00 - Outage confirmed, this playbook activatedEngage vendor support for third-party and cloud services when internal troubleshooting does not identify the cause within 30 minutes.
Have this information ready before calling:
- Account/tenant ID
- Affected service and specific error messages
- Timeline of outage and troubleshooting steps taken
- Business impact (vendors prioritise based on severity)
Document the vendor case number and update the war room:
Vendor case opened: [Vendor] #[Case number] Contact: [Name] [Phone/Email] Estimated response: [Time vendor committed to]Implement resolution. When root cause is identified, determine the resolution approach:
Scenario Resolution approach Authorisation needed Configuration error Correct configuration Technical lead Failed update Roll back change Incident commander Hardware failure Failover or replacement Technical lead Vendor issue Await vendor fix; implement workaround Incident commander decides workaround Capacity exhaustion Free resources or scale Technical lead; procurement if cost involved Change control during outages
Emergency changes to restore service do not require standard change approval but must be documented. Record all changes made, by whom, and with whose authorisation. These feed into post-incident review.
Verify resolution before declaring the outage resolved:
- Technical lead confirms service responds correctly
- Test from multiple locations (HQ, field office, external)
- Verify dependent services have recovered
- Confirm monitoring shows healthy status
- Have at least two users confirm they can work normally
Update stakeholders on resolution progress every cycle as defined in Phase 2, using this structure:
[MAJOR OUTAGE UPDATE] [Service] - [Status]
Current status: [Investigating / Root cause identified / Implementing fix / Resolved]
Progress since last update: - [Action taken 1] - [Action taken 2]
Current theory/action: [What you're trying now]
Estimated restoration: [Time if known, or "Still assessing"]
Next update: [Time]Decision point: If resolution is not achievable within the estimated time, consider whether to invoke disaster recovery procedures. Discuss with incident commander and business liaison.
Checkpoint: Proceed to Phase 4 when service is restored and verified operational.
Phase 4: Recovery and stabilisation
Objective: Confirm full service restoration, clear any backlogs, and ensure stability before standing down.
Timeframe: 30-60 minutes post-restoration
- Announce service restoration to all stakeholders:
Subject: [RESOLVED] [Service name] restored
[Service name] was restored at [time]. Full functionality has been verified.
Total outage duration: [X hours Y minutes]
WHAT TO DO NOW: - [Any user actions needed - retry failed operations, check for data loss] - [How to report ongoing issues]
CAUSE: [Brief, non-technical explanation]
A full review will be conducted and improvements implemented to prevent recurrence.
Thank you for your patience. Contact [service desk] if you experience any continuing issues.Clear backlogs created during the outage:
- Email: Verify mail queues are processing; monitor for delivery delays
- Transactions: Review any transactions that may have failed mid-process
- Scheduled jobs: Restart or manually trigger missed automated processes
- Integrations: Verify data synchronisation between systems
- Approvals: Notify approvers of pending items that accumulated
Monitor for stability. Keep the war room active for 30-60 minutes after restoration to catch any recurrence or related issues.
Watch for:
- Service degradation or intermittent failures
- Performance problems (slow response, timeouts)
- User reports of ongoing issues
- Monitoring alerts related to the affected service
Document the outage timeline completely:
OUTAGE TIMELINE: [Service name] - [Date]
Detection: - [Time]: First user report / monitoring alert - [Time]: Outage confirmed, playbook activated
Response: - [Time]: Incident commander [Name] assumed command - [Time]: War room established - [Time]: Initial notification sent
Investigation: - [Time]: [Key diagnostic finding] - [Time]: Root cause identified: [Cause] - [Time]: Vendor engaged (if applicable)
Resolution: - [Time]: Fix implemented: [Action taken] - [Time]: Service restored - [Time]: Resolution verified
Recovery: - [Time]: Backlog cleared - [Time]: Stability confirmed - [Time]: War room closed
TOTAL OUTAGE DURATION: [X hours Y minutes] TOTAL RESPONSE DURATION: [X hours Y minutes]- Schedule post-incident review. All Critical and High severity outages require formal review within 5 business days. The incident commander initiates scheduling before closing the war room.
Checkpoint: Close the incident when service is verified stable for 30+ minutes, backlogs are cleared, timeline is documented, and post-incident review is scheduled.
Phase 5: Post-incident
Objective: Learn from the outage and implement improvements to prevent recurrence.
Timeframe: Within 5 business days of resolution
Conduct post-incident review meeting with all incident participants and relevant stakeholders. The incident commander facilitates. Focus on:
- What happened (factual timeline)
- What went well in the response
- What could have been handled better
- What systemic issues contributed to the outage
- What actions would prevent recurrence or improve detection
Avoid blame. The goal is system improvement, not individual accountability. Ask “what allowed this to happen” rather than “who made the mistake.”
Produce post-incident report documenting:
- Executive summary (1 paragraph)
- Impact (users affected, duration, business impact)
- Timeline (from detection through resolution)
- Root cause analysis
- Response effectiveness
- Improvement actions with owners and deadlines
Implement improvements. Common improvement categories:
Category Example actions Monitoring Add alerts for early warning; reduce detection time Redundancy Implement failover; add capacity headroom Documentation Update runbooks; document tribal knowledge Process Clarify escalation paths; update contact lists Change control Add validation steps; improve testing Communication Update templates; improve status page Track improvement actions to completion. Assign each action an owner and deadline. Review progress in regular IT team meetings until all actions are complete.
Update playbooks and runbooks based on lessons learned. If this playbook’s procedures proved inadequate, propose updates through normal documentation change processes.
Communications
Communication channels by priority
| Channel | Use when | Limitations |
|---|---|---|
| In-person/phone | Executives, critical decisions, escalations | Doesn’t scale; no record |
| Video call war room | Technical coordination, real-time decisions | Requires connectivity |
| Messaging platform | Technical updates, quick coordination | May be affected by outage |
| Formal notifications, wide distribution | Slow; may be affected by outage | |
| SMS | Emergency contact when email unavailable | Character limits; cost |
| Intranet/status page | User updates, self-service information | Users must check proactively |
Communication templates
Executive notification (Critical severity, within 30 minutes):
To: [Executive team distribution]Subject: [CRITICAL OUTAGE] [Service] - Executive notification
A major outage is affecting [service name]. This notification providesinitial information; updates will follow every 30 minutes.
IMPACT:- [X] users affected across [locations]- Critical function affected: [What can't happen]- Business risk: [External deadline, programme impact, financial exposure]
STATUS:- Incident commander: [Name, phone]- Investigation underway since [time]- [Current theory or action if known]
DECISIONS THAT MAY BE NEEDED:- [Any executive decisions anticipated]
Next update: [Time]
For questions: Contact [Name] directly at [phone].User notification (multiple versions for different phases):
Initial notification:
Subject: [Service name] unavailable - We're working on it
[Service] became unavailable at approximately [time]. Our IT team isinvestigating and working to restore service as quickly as possible.
WHAT THIS MEANS:You cannot currently [specific user impact].
WHAT TO DO:- For urgent matters: [Workaround or alternative]- Do not [actions that won't help - e.g., "repeatedly try to log in"]
We will update you within [timeframe].Update notification:
Subject: [Service name] update - [Status summary]
UPDATE: [Progress since last communication]
CURRENT STATUS: [Investigating / Fix in progress / Testing / etc.]
ESTIMATED RESTORATION: [Time if known, or "We're still working todetermine this"]
WORKAROUNDS REMINDER:- [Workaround if still relevant]
Next update: [Time]Resolution notification:
Subject: [RESOLVED] [Service name] restored
[Service] has been restored as of [time].
WHAT TO DO NOW:1. [Any immediate user actions]2. [How to report if still experiencing issues]
IMPACT SUMMARY:- Service was unavailable for approximately [duration]- [X] users were affected- [Brief, simple explanation of cause]
We apologise for the disruption. A review is underway to preventsimilar issues in future.
Questions? Contact [service desk].Stakeholder notification matrix
| Stakeholder | When to notify | Who notifies | Channel |
|---|---|---|---|
| Executive team | Critical: within 30 min; High: within 1 hour | Incident commander | Email + phone for CEO/ED |
| All staff | All major outages | Communications lead | Email or messaging |
| Board members | Critical outages exceeding 4 hours | CEO/ED | Direct from leadership |
| Donors | If affecting deliverables or data | Programme lead | Per relationship norms |
| Partners with integrations | Any outage affecting shared systems | Relationship owner | Direct contact |
| Media | Only if outage becomes public | Communications/leadership | Per media policy |
Field office considerations
Field offices experience outages differently than headquarters. Connectivity constraints, time zone differences, and distinct operational contexts require adapted response.
Field-specific impact assessment questions:
- Are field offices on different infrastructure that may be unaffected?
- Does the outage affect data synchronisation with offline-capable systems?
- Are field teams currently in active programme delivery (distributions, assessments)?
- What local workarounds exist that headquarters may not know about?
Communication adaptations for field contexts:
- Account for time zones when scheduling updates (avoid middle-of-night notifications unless Critical)
- Use SMS or phone for locations with unreliable internet
- Provide field-appropriate workarounds (paper processes, offline tools)
- Designate a field liaison in the war room for geographically distributed outages
Field offices often have established offline procedures from operating in low-connectivity environments. These procedures can provide workarounds for headquarters staff unfamiliar with manual alternatives. Consult field IT staff or programme managers early in impact assessment.
Recovery verification for field offices:
- Explicitly test from field locations before declaring resolution
- Verify data synchronisation has resumed for offline-capable systems
- Confirm field staff can access restored services through their typical connection paths
Service dependency mapping
Effective outage response requires understanding which services depend on which infrastructure. Maintain a dependency map showing relationships between services.
+---------------------------------------------------------------+| SERVICE DEPENDENCY MAP |+---------------------------------------------------------------+| || +------------------+ || | Identity Provider| || | (Entra ID) | || +--------+---------+ || | || +--------------------+------------------+ || | | | || v v v || +------+-------+ +-------+------+ +------+------+ || | Email | | File Storage | | Line of | || | (Exchange) | | (SharePoint) | | Business | || +------+-------+ +-------+------+ | Apps | || | | +------+------+ || | | | || v v v || +------+-------+ +-------+------+ +------+------+ || | Calendaring | | Document | | Finance | || | Scheduling | | Workflows | | (SSO) | || +--------------+ +--------------+ +-------------+ || || +--------------+ +--------------+ +-------------+ || | Grants | | HR System | | Case Mgmt | || | Management | | (SSO) | | (Direct DB) | || +------+-------+ +-------+------+ +------+------+ || | | | || v v v || +------+-------+ +-------+------+ +------+-------+ || | Reporting DB | | Payroll | | Beneficiary | || | (Integration)| | (Integration)| | Database | || +--------------+ +--------------+ +--------------+ || |+---------------------------------------------------------------+When a service fails, trace downstream to identify cascading impact. In the example above, an identity provider outage affects email, file storage, and every application using SSO authentication. A database server failure might affect only a single application.
Update the dependency map when implementing new services or integrations. Review during post-incident reviews to verify accuracy based on actual outage impact.
Escalation structure
+------------------------------------------------------------------+| ESCALATION STRUCTURE |+------------------------------------------------------------------+| || +-----------------+ || | Initial report | User report, monitoring alert, or || | (Anyone) | automated detection || +--------+--------+ || | || v || +--------+--------+ || | Service Desk | Triage and initial assessment || | (On-call tech) | Determine if major outage criteria met || +--------+--------+ || | || | Major outage criteria met || v || +--------+--------+ || | IT Manager | Assumes incident commander role || | (Incident Cmdr) | Activates this playbook || +--------+--------+ || | || | Critical severity or executive impact || v || +--------+--------+ || | Director of Ops | Business decisions, resource allocation || | or equivalent | External stakeholder communication || +--------+--------+ || | || | Organisational impact or >4 hours Critical || v || +--------+--------+ || | Executive Dir / | Board notification decisions || | CEO | Reputational/donor impact decisions || +-----------------+ || |+------------------------------------------------------------------+Escalation triggers automatically at time thresholds regardless of perceived progress:
| Severity | Auto-escalate to Director | Auto-escalate to Executive |
|---|---|---|
| Critical | 1 hour if unresolved | 4 hours if unresolved |
| High | 4 hours if unresolved | 8 hours if unresolved |
| Medium | No automatic escalation | No automatic escalation |
See also
- Service Continuity -for extended outages requiring business continuity activation
- Incident Management -for standard incident management procedures
- SLA Management -for service level agreement context
- Alerting and Escalation -for monitoring and escalation concepts
- Incident Triage Matrix -for incident classification criteria