On this page

Major Service Outage

A major service outage is an unplanned interruption to IT services that affects multiple users or critical business functions beyond the scope of normal incident management. This playbook governs the response when email systems fail for an entire office, when the grants management system becomes inaccessible during a reporting deadline, or when network connectivity drops across a region. The procedures here focus on service restoration and business continuity rather than security investigation, which is covered by dedicated security incident playbooks.

The distinction between a standard incident and a major outage lies in organisational impact. A single user unable to print is an incident. Fifty users unable to access beneficiary records during a distribution is a major outage requiring coordinated response, executive communication, and potentially manual workarounds that affect programme delivery.

Activation criteria

Invoke this playbook when any of the following conditions are met:

Criterion	Threshold	Examples
User impact	20+ users affected, or any executive/leadership affected	Email down for headquarters; finance system unavailable
Duration	Service unavailable for 30+ minutes with no resolution in sight	Database server unresponsive after initial troubleshooting
SLA breach	Imminent or actual breach of documented service level	99.9% availability SLA breached; 4-hour response time exceeded
Business criticality	Any Tier 1 service unavailable regardless of user count	Payroll system during pay run; beneficiary database during emergency response
Cascading failure	Two or more services affected by same root cause	Authentication failure affecting all SSO-integrated applications
Field operations impact	Any outage affecting active humanitarian response	Data collection platform down during needs assessment

Service tier classification determines activation thresholds. Tier 1 services (email, identity provider, core programme systems, finance) trigger immediate major outage response upon confirmed unavailability. Tier 2 services (document management, secondary applications) trigger major outage response after 30 minutes or when affecting 20+ users. Tier 3 services (convenience applications, non-critical tools) follow standard incident management unless impact escalates.

Security incidents

If the outage results from or coincides with suspected malicious activity, invoke the appropriate security playbook instead. Ransomware, denial-of-service attacks, and compromised infrastructure require security-focused response even when service availability is the visible symptom.

Roles

Role	Responsibility	Typical assignee	Backup
Incident commander	Overall coordination, decisions, external escalation, declares resolution	IT Manager or Head of IT	Senior Systems Administrator
Technical lead	Investigation, diagnosis, remediation execution, technical updates	Systems Administrator or Engineer	Application Administrator
Communications lead	Stakeholder updates, status page management, user communication	IT Service Desk Lead or Communications Officer	IT Manager
Business liaison	Programme impact assessment, workaround coordination, priority input	Programme Manager or Operations Director	Country Director representative
Scribe	Timeline documentation, action tracking, decision recording	Service Desk Analyst	Any available IT staff

For organisations with single-person IT functions, the IT staff member assumes incident commander and technical lead roles while delegating communications to a designated non-IT colleague and business liaison to the relevant programme manager.

Phase 1: Initial assessment

Objective: Confirm outage scope, establish incident command, and initiate communication within 15 minutes of detection.

Timeframe: 0-15 minutes

Confirm service unavailability through independent verification. Do not rely solely on user reports. Access monitoring dashboards, attempt service access from multiple locations, and check vendor status pages for cloud services.

   # Quick service verification examples
   curl -I https://mail.example.org/health
   ping -c 3 fileserver.internal
   nslookup grants.example.org

Document the exact time service was confirmed unavailable. This becomes the official outage start time for SLA calculations and post-incident reporting.

Identify the incident commander. If the designated incident commander is unavailable, the most senior IT staff member present assumes the role. The incident commander must be reachable by phone for the duration of the outage.
Assess initial scope by answering these questions:
- Which specific services are affected?
- How many users are affected and in which locations?
- What business functions are impacted?
- Are there dependent services that may fail as a consequence?
- Is this affecting active programme delivery or humanitarian operations?

Classify severity based on initial assessment:

Severity	User impact	Business impact	Response level
Critical	100+ users or all of a location	Core operations stopped; programme delivery halted	Full war room; executive notification within 30 minutes
High	20-100 users	Significant degradation; workarounds difficult	Technical team assembled; management notification within 1 hour
Medium	Under 20 users but Tier 1 service	Limited impact; workarounds available	Standard escalation; management notification within 2 hours

Establish the war room for Critical and High severity outages. The war room is a dedicated communication channel where all incident participants coordinate in real time.
For remote/distributed teams, create a dedicated video call and messaging channel:

   Channel name: OUTAGE-[DATE]-[SERVICE]
   Example: OUTAGE-20241116-EMAIL

For co-located teams, designate a physical meeting room. Post the room number or video link to all IT staff immediately.

Send initial notification to IT team and management:

   Subject: [MAJOR OUTAGE] [Service name] - Investigation underway

   Affected service: [Service name]
   Start time: [HH:MM timezone]
   Impact: [Brief description - X users, Y locations]
   Severity: [Critical/High/Medium]

   War room: [Location/link]
   Incident commander: [Name, phone]

   Next update: [Time - typically 30 minutes]

Decision point: If initial assessment reveals security indicators (unusual access patterns, ransom notes, defacement), transition to appropriate security playbook immediately.

Checkpoint: Proceed to Phase 2 when incident commander is assigned, war room is active, initial scope is documented, and first notification is sent.

Phase 2: Impact analysis and communication

Objective: Fully characterise the outage impact, establish communication cadence, and implement immediate workarounds.

Timeframe: 15-60 minutes

Map the complete impact by working through affected services systematically. For each affected service, document:
- Direct users who cannot perform their normal functions
- Dependent systems that rely on the affected service
- Business processes that are blocked or degraded
- Data that may be at risk (unsaved work, incomplete transactions)
- Scheduled activities that will be affected (reports, payroll runs, distributions)
Example impact documentation:

   Service: Microsoft 365 (Email and SharePoint)
   Direct users: 340 (all HQ and regional offices)
   Dependent systems:
     - Grants portal (SSO authentication)
     - HR system (email notifications)
     - Approval workflows (stuck)
   Blocked processes:
     - Donor communication
     - Document collaboration
     - Calendar/scheduling
   At-risk data: Emails composed offline will queue locally
   Scheduled: Board report submission due 17:00 today

Identify business priorities with the business liaison. Ask specifically:
- What cannot wait until tomorrow?
- Who are the most affected individuals?
- Are there external deadlines (donor reports, regulatory filings)?
- Is there active programme delivery that depends on this service?

Determine and communicate workarounds. Effective workarounds maintain business function without the affected system.

Affected service	Potential workarounds
Email	Personal email for urgent external communication; messaging platform for internal; phone for critical contacts
File storage	Local copies; USB transfer for critical files (note security implications); alternative cloud storage
Finance system	Manual tracking; defer non-urgent transactions; paper-based approvals
Data collection	Paper forms; offline mobile data collection; SMS-based reporting
Video conferencing	Alternative platform; audio-only dial-in; postponement

Field office workarounds

Field offices often have existing offline procedures that can scale during outages. Consult field IT staff or programme managers for established manual processes before creating new workarounds.

Establish communication cadence based on severity:

Severity	Internal IT updates	Management updates	User updates
Critical	Every 15 minutes	Every 30 minutes	Every 30 minutes
High	Every 30 minutes	Every hour	Every hour
Medium	Every hour	Every 2 hours	As significant changes occur

Send first user communication through available channels. If email is affected, use messaging platforms, SMS, intranet, or phone trees.

   Subject: [Service name] currently unavailable - Workarounds available

   [Service name] became unavailable at [time] and our team is working
   to restore it.

   IMPACT: [What you cannot do]

   WORKAROUNDS:
   - [Specific alternative 1]
   - [Specific alternative 2]

   We will provide updates every [timeframe].

   For urgent needs, contact [name] at [phone/channel].

   Next update: [time]

Notify affected external parties if the outage affects partner organisations, donors, or beneficiaries:
- Partners with system integrations: Direct contact to relationship owner
- Donors expecting deliverables: Proactive notification with revised timeline
- Beneficiaries: Through programme staff using established communication channels

Decision point: If the outage will exceed 4 hours, activate business continuity measures through the Service Continuity playbook.

Checkpoint: Proceed to Phase 3 when full impact is documented, workarounds are communicated, communication cadence is established, and external parties are notified.

Phase 3: Diagnosis and resolution

Objective: Identify root cause and restore service through systematic troubleshooting.

Timeframe: Ongoing until resolution (typically 1-4 hours for most outages)

Gather diagnostic information systematically. The technical lead directs troubleshooting while the scribe documents all findings.
For infrastructure services:

   # Server status
   systemctl status [service]
   journalctl -u [service] --since "1 hour ago"

   # Resource utilisation
   df -h          # Disk space
   free -m        # Memory
   top -bn1       # CPU and processes

   # Network connectivity
   netstat -tlnp  # Listening ports
   traceroute [destination]

For cloud services:

Check vendor status page (bookmark these in advance)
Review admin console for service health
Check recent changes in audit logs
Verify authentication and licensing status

For applications:

Review application logs for errors
Check database connectivity
Verify integration endpoints
Test with minimal/default configuration

Correlate timeline with recent changes. Most outages follow changes. Review:
- Changes deployed in the past 72 hours
- Automated updates (patches, definition updates)
- Certificate or credential expirations
- Vendor maintenance windows
- Infrastructure changes (network, storage, virtualisation)
Create a timeline correlating the outage start with any identified changes:

   14:30 - Automated Windows Update deployed to file server
   14:45 - File server rebooted
   14:52 - First user reports of file access issues
   14:55 - Monitoring alert: file server unresponsive
   15:00 - Outage confirmed, this playbook activated

Engage vendor support for third-party and cloud services when internal troubleshooting does not identify the cause within 30 minutes.
Have this information ready before calling:
- Account/tenant ID
- Affected service and specific error messages
- Timeline of outage and troubleshooting steps taken
- Business impact (vendors prioritise based on severity)
Document the vendor case number and update the war room:

   Vendor case opened: [Vendor] #[Case number]
   Contact: [Name] [Phone/Email]
   Estimated response: [Time vendor committed to]

Implement resolution. When root cause is identified, determine the resolution approach:

Scenario	Resolution approach	Authorisation needed
Configuration error	Correct configuration	Technical lead
Failed update	Roll back change	Incident commander
Hardware failure	Failover or replacement	Technical lead
Vendor issue	Await vendor fix; implement workaround	Incident commander decides workaround
Capacity exhaustion	Free resources or scale	Technical lead; procurement if cost involved

Change control during outages

Emergency changes to restore service do not require standard change approval but must be documented. Record all changes made, by whom, and with whose authorisation. These feed into post-incident review.

Verify resolution before declaring the outage resolved:
- Technical lead confirms service responds correctly
- Test from multiple locations (HQ, field office, external)
- Verify dependent services have recovered
- Confirm monitoring shows healthy status
- Have at least two users confirm they can work normally
Update stakeholders on resolution progress every cycle as defined in Phase 2, using this structure:

   [MAJOR OUTAGE UPDATE] [Service] - [Status]

   Current status: [Investigating / Root cause identified / Implementing fix / Resolved]

   Progress since last update:
   - [Action taken 1]
   - [Action taken 2]

   Current theory/action: [What you're trying now]

   Estimated restoration: [Time if known, or "Still assessing"]

   Next update: [Time]

Decision point: If resolution is not achievable within the estimated time, consider whether to invoke disaster recovery procedures. Discuss with incident commander and business liaison.

Checkpoint: Proceed to Phase 4 when service is restored and verified operational.

Phase 4: Recovery and stabilisation

Objective: Confirm full service restoration, clear any backlogs, and ensure stability before standing down.

Timeframe: 30-60 minutes post-restoration

Announce service restoration to all stakeholders:

   Subject: [RESOLVED] [Service name] restored

   [Service name] was restored at [time]. Full functionality has been
   verified.

   Total outage duration: [X hours Y minutes]

   WHAT TO DO NOW:
   - [Any user actions needed - retry failed operations, check for data loss]
   - [How to report ongoing issues]

   CAUSE: [Brief, non-technical explanation]

   A full review will be conducted and improvements implemented to
   prevent recurrence.

   Thank you for your patience. Contact [service desk] if you
   experience any continuing issues.

Clear backlogs created during the outage:
- Email: Verify mail queues are processing; monitor for delivery delays
- Transactions: Review any transactions that may have failed mid-process
- Scheduled jobs: Restart or manually trigger missed automated processes
- Integrations: Verify data synchronisation between systems
- Approvals: Notify approvers of pending items that accumulated
Monitor for stability. Keep the war room active for 30-60 minutes after restoration to catch any recurrence or related issues.
Watch for:
- Service degradation or intermittent failures
- Performance problems (slow response, timeouts)
- User reports of ongoing issues
- Monitoring alerts related to the affected service
Document the outage timeline completely:

   OUTAGE TIMELINE: [Service name] - [Date]

   Detection:
   - [Time]: First user report / monitoring alert
   - [Time]: Outage confirmed, playbook activated

   Response:
   - [Time]: Incident commander [Name] assumed command
   - [Time]: War room established
   - [Time]: Initial notification sent

   Investigation:
   - [Time]: [Key diagnostic finding]
   - [Time]: Root cause identified: [Cause]
   - [Time]: Vendor engaged (if applicable)

   Resolution:
   - [Time]: Fix implemented: [Action taken]
   - [Time]: Service restored
   - [Time]: Resolution verified

   Recovery:
   - [Time]: Backlog cleared
   - [Time]: Stability confirmed
   - [Time]: War room closed

   TOTAL OUTAGE DURATION: [X hours Y minutes]
   TOTAL RESPONSE DURATION: [X hours Y minutes]

Schedule post-incident review. All Critical and High severity outages require formal review within 5 business days. The incident commander initiates scheduling before closing the war room.

Checkpoint: Close the incident when service is verified stable for 30+ minutes, backlogs are cleared, timeline is documented, and post-incident review is scheduled.

Phase 5: Post-incident

Objective: Learn from the outage and implement improvements to prevent recurrence.

Timeframe: Within 5 business days of resolution

Conduct post-incident review meeting with all incident participants and relevant stakeholders. The incident commander facilitates. Focus on:
- What happened (factual timeline)
- What went well in the response
- What could have been handled better
- What systemic issues contributed to the outage
- What actions would prevent recurrence or improve detection
Avoid blame. The goal is system improvement, not individual accountability. Ask “what allowed this to happen” rather than “who made the mistake.”
Produce post-incident report documenting:
- Executive summary (1 paragraph)
- Impact (users affected, duration, business impact)
- Timeline (from detection through resolution)
- Root cause analysis
- Response effectiveness
- Improvement actions with owners and deadlines

Implement improvements. Common improvement categories:

Category	Example actions
Monitoring	Add alerts for early warning; reduce detection time
Redundancy	Implement failover; add capacity headroom
Documentation	Update runbooks; document tribal knowledge
Process	Clarify escalation paths; update contact lists
Change control	Add validation steps; improve testing
Communication	Update templates; improve status page

Track improvement actions to completion. Assign each action an owner and deadline. Review progress in regular IT team meetings until all actions are complete.
Update playbooks and runbooks based on lessons learned. If this playbook’s procedures proved inadequate, propose updates through normal documentation change processes.

Communications

Communication channels by priority

Channel	Use when	Limitations
In-person/phone	Executives, critical decisions, escalations	Doesn’t scale; no record
Video call war room	Technical coordination, real-time decisions	Requires connectivity
Messaging platform	Technical updates, quick coordination	May be affected by outage
Email	Formal notifications, wide distribution	Slow; may be affected by outage
SMS	Emergency contact when email unavailable	Character limits; cost
Intranet/status page	User updates, self-service information	Users must check proactively

Communication templates

Executive notification (Critical severity, within 30 minutes):

To: [Executive team distribution]
Subject: [CRITICAL OUTAGE] [Service] - Executive notification

A major outage is affecting [service name]. This notification provides
initial information; updates will follow every 30 minutes.

IMPACT:
- [X] users affected across [locations]
- Critical function affected: [What can't happen]
- Business risk: [External deadline, programme impact, financial exposure]

STATUS:
- Incident commander: [Name, phone]
- Investigation underway since [time]
- [Current theory or action if known]

DECISIONS THAT MAY BE NEEDED:
- [Any executive decisions anticipated]

Next update: [Time]

For questions: Contact [Name] directly at [phone].

User notification (multiple versions for different phases):

Initial notification:

Subject: [Service name] unavailable - We're working on it

[Service] became unavailable at approximately [time]. Our IT team is
investigating and working to restore service as quickly as possible.

WHAT THIS MEANS:
You cannot currently [specific user impact].

WHAT TO DO:
- For urgent matters: [Workaround or alternative]
- Do not [actions that won't help - e.g., "repeatedly try to log in"]

We will update you within [timeframe].

Update notification:

Subject: [Service name] update - [Status summary]

UPDATE: [Progress since last communication]

CURRENT STATUS: [Investigating / Fix in progress / Testing / etc.]

ESTIMATED RESTORATION: [Time if known, or "We're still working to
determine this"]

WORKAROUNDS REMINDER:
- [Workaround if still relevant]

Next update: [Time]

Resolution notification:

Subject: [RESOLVED] [Service name] restored

[Service] has been restored as of [time].

WHAT TO DO NOW:
1. [Any immediate user actions]
2. [How to report if still experiencing issues]

IMPACT SUMMARY:
- Service was unavailable for approximately [duration]
- [X] users were affected
- [Brief, simple explanation of cause]

We apologise for the disruption. A review is underway to prevent
similar issues in future.

Questions? Contact [service desk].

Stakeholder notification matrix

Stakeholder	When to notify	Who notifies	Channel
Executive team	Critical: within 30 min; High: within 1 hour	Incident commander	Email + phone for CEO/ED
All staff	All major outages	Communications lead	Email or messaging
Board members	Critical outages exceeding 4 hours	CEO/ED	Direct from leadership
Donors	If affecting deliverables or data	Programme lead	Per relationship norms
Partners with integrations	Any outage affecting shared systems	Relationship owner	Direct contact
Media	Only if outage becomes public	Communications/leadership	Per media policy

Field office considerations

Field offices experience outages differently than headquarters. Connectivity constraints, time zone differences, and distinct operational contexts require adapted response.

Field-specific impact assessment questions:

Are field offices on different infrastructure that may be unaffected?
Does the outage affect data synchronisation with offline-capable systems?
Are field teams currently in active programme delivery (distributions, assessments)?
What local workarounds exist that headquarters may not know about?

Communication adaptations for field contexts:

Account for time zones when scheduling updates (avoid middle-of-night notifications unless Critical)
Use SMS or phone for locations with unreliable internet
Provide field-appropriate workarounds (paper processes, offline tools)
Designate a field liaison in the war room for geographically distributed outages

Field offices often have established offline procedures from operating in low-connectivity environments. These procedures can provide workarounds for headquarters staff unfamiliar with manual alternatives. Consult field IT staff or programme managers early in impact assessment.

Recovery verification for field offices:

Explicitly test from field locations before declaring resolution
Verify data synchronisation has resumed for offline-capable systems
Confirm field staff can access restored services through their typical connection paths

Service dependency mapping

Effective outage response requires understanding which services depend on which infrastructure. Maintain a dependency map showing relationships between services.

+---------------------------------------------------------------+
|                    SERVICE DEPENDENCY MAP                     |
+---------------------------------------------------------------+
|                                                               |
|                     +------------------+                      |
|                     | Identity Provider|                      |
|                     | (Entra ID)       |                      |
|                     +--------+---------+                      |
|                              |                                |
|         +--------------------+------------------+             |
|         |                    |                  |             |
|         v                    v                  v             |
|  +------+-------+    +-------+------+    +------+------+      |
|  | Email        |    | File Storage |    | Line of     |      |
|  | (Exchange)   |    | (SharePoint) |    | Business    |      |
|  +------+-------+    +-------+------+    | Apps        |      |
|         |                    |           +------+------+      |
|         |                    |                  |             |
|         v                    v                  v             |
|  +------+-------+    +-------+------+    +------+------+      |
|  | Calendaring  |    | Document     |    | Finance     |      |
|  | Scheduling   |    | Workflows    |    | (SSO)       |      |
|  +--------------+    +--------------+    +-------------+      |
|                                                               |
|  +--------------+    +--------------+    +-------------+      |
|  | Grants       |    | HR System    |    | Case Mgmt   |      |
|  | Management   |    | (SSO)        |    | (Direct DB) |      |
|  +------+-------+    +-------+------+    +------+------+      |
|         |                    |                  |             |
|         v                    v                  v             |
|  +------+-------+    +-------+------+    +------+-------+     |
|  | Reporting DB |    | Payroll      |    | Beneficiary  |     |
|  | (Integration)|    | (Integration)|    | Database     |     |
|  +--------------+    +--------------+    +--------------+     |
|                                                               |
+---------------------------------------------------------------+

When a service fails, trace downstream to identify cascading impact. In the example above, an identity provider outage affects email, file storage, and every application using SSO authentication. A database server failure might affect only a single application.

Update the dependency map when implementing new services or integrations. Review during post-incident reviews to verify accuracy based on actual outage impact.

Escalation structure

+------------------------------------------------------------------+
|                     ESCALATION STRUCTURE                         |
+------------------------------------------------------------------+
|                                                                  |
|  +-----------------+                                             |
|  | Initial report  |  User report, monitoring alert, or          |
|  | (Anyone)        |  automated detection                        |
|  +--------+--------+                                             |
|           |                                                      |
|           v                                                      |
|  +--------+--------+                                             |
|  | Service Desk    |  Triage and initial assessment              |
|  | (On-call tech)  |  Determine if major outage criteria met     |
|  +--------+--------+                                             |
|           |                                                      |
|           | Major outage criteria met                            |
|           v                                                      |
|  +--------+--------+                                             |
|  | IT Manager      |  Assumes incident commander role            |
|  | (Incident Cmdr) |  Activates this playbook                    |
|  +--------+--------+                                             |
|           |                                                      |
|           | Critical severity or executive impact                |
|           v                                                      |
|  +--------+--------+                                             |
|  | Director of Ops |  Business decisions, resource allocation    |
|  | or equivalent   |  External stakeholder communication         |
|  +--------+--------+                                             |
|           |                                                      |
|           | Organisational impact or >4 hours Critical           |
|           v                                                      |
|  +--------+--------+                                             |
|  | Executive Dir / |  Board notification decisions               |
|  | CEO             |  Reputational/donor impact decisions        |
|  +-----------------+                                             |
|                                                                  |
+------------------------------------------------------------------+

Escalation triggers automatically at time thresholds regardless of perceived progress:

Severity	Auto-escalate to Director	Auto-escalate to Executive
Critical	1 hour if unresolved	4 hours if unresolved
High	4 hours if unresolved	8 hours if unresolved
Medium	No automatic escalation	No automatic escalation