Skip to main content

Major Service Outage

A major service outage is an unplanned interruption to IT services that affects multiple users or critical business functions beyond the scope of normal incident management. This playbook governs the response when email systems fail for an entire office, when the grants management system becomes inaccessible during a reporting deadline, or when network connectivity drops across a region. The procedures here focus on service restoration and business continuity rather than security investigation, which is covered by dedicated security incident playbooks.

The distinction between a standard incident and a major outage lies in organisational impact. A single user unable to print is an incident. Fifty users unable to access beneficiary records during a distribution is a major outage requiring coordinated response, executive communication, and potentially manual workarounds that affect programme delivery.

Activation criteria

Invoke this playbook when any of the following conditions are met:

CriterionThresholdExamples
User impact20+ users affected, or any executive/leadership affectedEmail down for headquarters; finance system unavailable
DurationService unavailable for 30+ minutes with no resolution in sightDatabase server unresponsive after initial troubleshooting
SLA breachImminent or actual breach of documented service level99.9% availability SLA breached; 4-hour response time exceeded
Business criticalityAny Tier 1 service unavailable regardless of user countPayroll system during pay run; beneficiary database during emergency response
Cascading failureTwo or more services affected by same root causeAuthentication failure affecting all SSO-integrated applications
Field operations impactAny outage affecting active humanitarian responseData collection platform down during needs assessment

Service tier classification determines activation thresholds. Tier 1 services (email, identity provider, core programme systems, finance) trigger immediate major outage response upon confirmed unavailability. Tier 2 services (document management, secondary applications) trigger major outage response after 30 minutes or when affecting 20+ users. Tier 3 services (convenience applications, non-critical tools) follow standard incident management unless impact escalates.

Security incidents

If the outage results from or coincides with suspected malicious activity, invoke the appropriate security playbook instead. Ransomware, denial-of-service attacks, and compromised infrastructure require security-focused response even when service availability is the visible symptom.

Roles

RoleResponsibilityTypical assigneeBackup
Incident commanderOverall coordination, decisions, external escalation, declares resolutionIT Manager or Head of ITSenior Systems Administrator
Technical leadInvestigation, diagnosis, remediation execution, technical updatesSystems Administrator or EngineerApplication Administrator
Communications leadStakeholder updates, status page management, user communicationIT Service Desk Lead or Communications OfficerIT Manager
Business liaisonProgramme impact assessment, workaround coordination, priority inputProgramme Manager or Operations DirectorCountry Director representative
ScribeTimeline documentation, action tracking, decision recordingService Desk AnalystAny available IT staff

For organisations with single-person IT functions, the IT staff member assumes incident commander and technical lead roles while delegating communications to a designated non-IT colleague and business liaison to the relevant programme manager.

Phase 1: Initial assessment

Objective: Confirm outage scope, establish incident command, and initiate communication within 15 minutes of detection.

Timeframe: 0-15 minutes

  1. Confirm service unavailability through independent verification. Do not rely solely on user reports. Access monitoring dashboards, attempt service access from multiple locations, and check vendor status pages for cloud services.
Terminal window
# Quick service verification examples
curl -I https://mail.example.org/health
ping -c 3 fileserver.internal
nslookup grants.example.org

Document the exact time service was confirmed unavailable. This becomes the official outage start time for SLA calculations and post-incident reporting.

  1. Identify the incident commander. If the designated incident commander is unavailable, the most senior IT staff member present assumes the role. The incident commander must be reachable by phone for the duration of the outage.

  2. Assess initial scope by answering these questions:

    • Which specific services are affected?
    • How many users are affected and in which locations?
    • What business functions are impacted?
    • Are there dependent services that may fail as a consequence?
    • Is this affecting active programme delivery or humanitarian operations?
  3. Classify severity based on initial assessment:

    SeverityUser impactBusiness impactResponse level
    Critical100+ users or all of a locationCore operations stopped; programme delivery haltedFull war room; executive notification within 30 minutes
    High20-100 usersSignificant degradation; workarounds difficultTechnical team assembled; management notification within 1 hour
    MediumUnder 20 users but Tier 1 serviceLimited impact; workarounds availableStandard escalation; management notification within 2 hours
  4. Establish the war room for Critical and High severity outages. The war room is a dedicated communication channel where all incident participants coordinate in real time.

    For remote/distributed teams, create a dedicated video call and messaging channel:

Channel name: OUTAGE-[DATE]-[SERVICE]
Example: OUTAGE-20241116-EMAIL

For co-located teams, designate a physical meeting room. Post the room number or video link to all IT staff immediately.

  1. Send initial notification to IT team and management:
Subject: [MAJOR OUTAGE] [Service name] - Investigation underway
Affected service: [Service name]
Start time: [HH:MM timezone]
Impact: [Brief description - X users, Y locations]
Severity: [Critical/High/Medium]
War room: [Location/link]
Incident commander: [Name, phone]
Next update: [Time - typically 30 minutes]

Decision point: If initial assessment reveals security indicators (unusual access patterns, ransom notes, defacement), transition to appropriate security playbook immediately.

Checkpoint: Proceed to Phase 2 when incident commander is assigned, war room is active, initial scope is documented, and first notification is sent.

Phase 2: Impact analysis and communication

Objective: Fully characterise the outage impact, establish communication cadence, and implement immediate workarounds.

Timeframe: 15-60 minutes

  1. Map the complete impact by working through affected services systematically. For each affected service, document:

    • Direct users who cannot perform their normal functions
    • Dependent systems that rely on the affected service
    • Business processes that are blocked or degraded
    • Data that may be at risk (unsaved work, incomplete transactions)
    • Scheduled activities that will be affected (reports, payroll runs, distributions)

    Example impact documentation:

Service: Microsoft 365 (Email and SharePoint)
Direct users: 340 (all HQ and regional offices)
Dependent systems:
- Grants portal (SSO authentication)
- HR system (email notifications)
- Approval workflows (stuck)
Blocked processes:
- Donor communication
- Document collaboration
- Calendar/scheduling
At-risk data: Emails composed offline will queue locally
Scheduled: Board report submission due 17:00 today
  1. Identify business priorities with the business liaison. Ask specifically:

    • What cannot wait until tomorrow?
    • Who are the most affected individuals?
    • Are there external deadlines (donor reports, regulatory filings)?
    • Is there active programme delivery that depends on this service?
  2. Determine and communicate workarounds. Effective workarounds maintain business function without the affected system.

    Affected servicePotential workarounds
    EmailPersonal email for urgent external communication; messaging platform for internal; phone for critical contacts
    File storageLocal copies; USB transfer for critical files (note security implications); alternative cloud storage
    Finance systemManual tracking; defer non-urgent transactions; paper-based approvals
    Data collectionPaper forms; offline mobile data collection; SMS-based reporting
    Video conferencingAlternative platform; audio-only dial-in; postponement

    Field office workarounds

    Field offices often have existing offline procedures that can scale during outages. Consult field IT staff or programme managers for established manual processes before creating new workarounds.

  3. Establish communication cadence based on severity:

    SeverityInternal IT updatesManagement updatesUser updates
    CriticalEvery 15 minutesEvery 30 minutesEvery 30 minutes
    HighEvery 30 minutesEvery hourEvery hour
    MediumEvery hourEvery 2 hoursAs significant changes occur
  4. Send first user communication through available channels. If email is affected, use messaging platforms, SMS, intranet, or phone trees.

Subject: [Service name] currently unavailable - Workarounds available
[Service name] became unavailable at [time] and our team is working
to restore it.
IMPACT: [What you cannot do]
WORKAROUNDS:
- [Specific alternative 1]
- [Specific alternative 2]
We will provide updates every [timeframe].
For urgent needs, contact [name] at [phone/channel].
Next update: [time]
  1. Notify affected external parties if the outage affects partner organisations, donors, or beneficiaries:

    • Partners with system integrations: Direct contact to relationship owner
    • Donors expecting deliverables: Proactive notification with revised timeline
    • Beneficiaries: Through programme staff using established communication channels

Decision point: If the outage will exceed 4 hours, activate business continuity measures through the Service Continuity playbook.

Checkpoint: Proceed to Phase 3 when full impact is documented, workarounds are communicated, communication cadence is established, and external parties are notified.

Phase 3: Diagnosis and resolution

Objective: Identify root cause and restore service through systematic troubleshooting.

Timeframe: Ongoing until resolution (typically 1-4 hours for most outages)

  1. Gather diagnostic information systematically. The technical lead directs troubleshooting while the scribe documents all findings.

    For infrastructure services:

Terminal window
# Server status
systemctl status [service]
journalctl -u [service] --since "1 hour ago"
# Resource utilisation
df -h # Disk space
free -m # Memory
top -bn1 # CPU and processes
# Network connectivity
netstat -tlnp # Listening ports
traceroute [destination]

For cloud services:

  • Check vendor status page (bookmark these in advance)
  • Review admin console for service health
  • Check recent changes in audit logs
  • Verify authentication and licensing status

For applications:

  • Review application logs for errors
  • Check database connectivity
  • Verify integration endpoints
  • Test with minimal/default configuration
  1. Correlate timeline with recent changes. Most outages follow changes. Review:

    • Changes deployed in the past 72 hours
    • Automated updates (patches, definition updates)
    • Certificate or credential expirations
    • Vendor maintenance windows
    • Infrastructure changes (network, storage, virtualisation)

    Create a timeline correlating the outage start with any identified changes:

14:30 - Automated Windows Update deployed to file server
14:45 - File server rebooted
14:52 - First user reports of file access issues
14:55 - Monitoring alert: file server unresponsive
15:00 - Outage confirmed, this playbook activated
  1. Engage vendor support for third-party and cloud services when internal troubleshooting does not identify the cause within 30 minutes.

    Have this information ready before calling:

    • Account/tenant ID
    • Affected service and specific error messages
    • Timeline of outage and troubleshooting steps taken
    • Business impact (vendors prioritise based on severity)

    Document the vendor case number and update the war room:

Vendor case opened: [Vendor] #[Case number]
Contact: [Name] [Phone/Email]
Estimated response: [Time vendor committed to]
  1. Implement resolution. When root cause is identified, determine the resolution approach:

    ScenarioResolution approachAuthorisation needed
    Configuration errorCorrect configurationTechnical lead
    Failed updateRoll back changeIncident commander
    Hardware failureFailover or replacementTechnical lead
    Vendor issueAwait vendor fix; implement workaroundIncident commander decides workaround
    Capacity exhaustionFree resources or scaleTechnical lead; procurement if cost involved

    Change control during outages

    Emergency changes to restore service do not require standard change approval but must be documented. Record all changes made, by whom, and with whose authorisation. These feed into post-incident review.

  2. Verify resolution before declaring the outage resolved:

    • Technical lead confirms service responds correctly
    • Test from multiple locations (HQ, field office, external)
    • Verify dependent services have recovered
    • Confirm monitoring shows healthy status
    • Have at least two users confirm they can work normally
  3. Update stakeholders on resolution progress every cycle as defined in Phase 2, using this structure:

[MAJOR OUTAGE UPDATE] [Service] - [Status]
Current status: [Investigating / Root cause identified / Implementing fix / Resolved]
Progress since last update:
- [Action taken 1]
- [Action taken 2]
Current theory/action: [What you're trying now]
Estimated restoration: [Time if known, or "Still assessing"]
Next update: [Time]

Decision point: If resolution is not achievable within the estimated time, consider whether to invoke disaster recovery procedures. Discuss with incident commander and business liaison.

Checkpoint: Proceed to Phase 4 when service is restored and verified operational.

Phase 4: Recovery and stabilisation

Objective: Confirm full service restoration, clear any backlogs, and ensure stability before standing down.

Timeframe: 30-60 minutes post-restoration

  1. Announce service restoration to all stakeholders:
Subject: [RESOLVED] [Service name] restored
[Service name] was restored at [time]. Full functionality has been
verified.
Total outage duration: [X hours Y minutes]
WHAT TO DO NOW:
- [Any user actions needed - retry failed operations, check for data loss]
- [How to report ongoing issues]
CAUSE: [Brief, non-technical explanation]
A full review will be conducted and improvements implemented to
prevent recurrence.
Thank you for your patience. Contact [service desk] if you
experience any continuing issues.
  1. Clear backlogs created during the outage:

    • Email: Verify mail queues are processing; monitor for delivery delays
    • Transactions: Review any transactions that may have failed mid-process
    • Scheduled jobs: Restart or manually trigger missed automated processes
    • Integrations: Verify data synchronisation between systems
    • Approvals: Notify approvers of pending items that accumulated
  2. Monitor for stability. Keep the war room active for 30-60 minutes after restoration to catch any recurrence or related issues.

    Watch for:

    • Service degradation or intermittent failures
    • Performance problems (slow response, timeouts)
    • User reports of ongoing issues
    • Monitoring alerts related to the affected service
  3. Document the outage timeline completely:

OUTAGE TIMELINE: [Service name] - [Date]
Detection:
- [Time]: First user report / monitoring alert
- [Time]: Outage confirmed, playbook activated
Response:
- [Time]: Incident commander [Name] assumed command
- [Time]: War room established
- [Time]: Initial notification sent
Investigation:
- [Time]: [Key diagnostic finding]
- [Time]: Root cause identified: [Cause]
- [Time]: Vendor engaged (if applicable)
Resolution:
- [Time]: Fix implemented: [Action taken]
- [Time]: Service restored
- [Time]: Resolution verified
Recovery:
- [Time]: Backlog cleared
- [Time]: Stability confirmed
- [Time]: War room closed
TOTAL OUTAGE DURATION: [X hours Y minutes]
TOTAL RESPONSE DURATION: [X hours Y minutes]
  1. Schedule post-incident review. All Critical and High severity outages require formal review within 5 business days. The incident commander initiates scheduling before closing the war room.

Checkpoint: Close the incident when service is verified stable for 30+ minutes, backlogs are cleared, timeline is documented, and post-incident review is scheduled.

Phase 5: Post-incident

Objective: Learn from the outage and implement improvements to prevent recurrence.

Timeframe: Within 5 business days of resolution

  1. Conduct post-incident review meeting with all incident participants and relevant stakeholders. The incident commander facilitates. Focus on:

    • What happened (factual timeline)
    • What went well in the response
    • What could have been handled better
    • What systemic issues contributed to the outage
    • What actions would prevent recurrence or improve detection

    Avoid blame. The goal is system improvement, not individual accountability. Ask “what allowed this to happen” rather than “who made the mistake.”

  2. Produce post-incident report documenting:

    • Executive summary (1 paragraph)
    • Impact (users affected, duration, business impact)
    • Timeline (from detection through resolution)
    • Root cause analysis
    • Response effectiveness
    • Improvement actions with owners and deadlines
  3. Implement improvements. Common improvement categories:

    CategoryExample actions
    MonitoringAdd alerts for early warning; reduce detection time
    RedundancyImplement failover; add capacity headroom
    DocumentationUpdate runbooks; document tribal knowledge
    ProcessClarify escalation paths; update contact lists
    Change controlAdd validation steps; improve testing
    CommunicationUpdate templates; improve status page
  4. Track improvement actions to completion. Assign each action an owner and deadline. Review progress in regular IT team meetings until all actions are complete.

  5. Update playbooks and runbooks based on lessons learned. If this playbook’s procedures proved inadequate, propose updates through normal documentation change processes.

Communications

Communication channels by priority

ChannelUse whenLimitations
In-person/phoneExecutives, critical decisions, escalationsDoesn’t scale; no record
Video call war roomTechnical coordination, real-time decisionsRequires connectivity
Messaging platformTechnical updates, quick coordinationMay be affected by outage
EmailFormal notifications, wide distributionSlow; may be affected by outage
SMSEmergency contact when email unavailableCharacter limits; cost
Intranet/status pageUser updates, self-service informationUsers must check proactively

Communication templates

Executive notification (Critical severity, within 30 minutes):

To: [Executive team distribution]
Subject: [CRITICAL OUTAGE] [Service] - Executive notification
A major outage is affecting [service name]. This notification provides
initial information; updates will follow every 30 minutes.
IMPACT:
- [X] users affected across [locations]
- Critical function affected: [What can't happen]
- Business risk: [External deadline, programme impact, financial exposure]
STATUS:
- Incident commander: [Name, phone]
- Investigation underway since [time]
- [Current theory or action if known]
DECISIONS THAT MAY BE NEEDED:
- [Any executive decisions anticipated]
Next update: [Time]
For questions: Contact [Name] directly at [phone].

User notification (multiple versions for different phases):

Initial notification:

Subject: [Service name] unavailable - We're working on it
[Service] became unavailable at approximately [time]. Our IT team is
investigating and working to restore service as quickly as possible.
WHAT THIS MEANS:
You cannot currently [specific user impact].
WHAT TO DO:
- For urgent matters: [Workaround or alternative]
- Do not [actions that won't help - e.g., "repeatedly try to log in"]
We will update you within [timeframe].

Update notification:

Subject: [Service name] update - [Status summary]
UPDATE: [Progress since last communication]
CURRENT STATUS: [Investigating / Fix in progress / Testing / etc.]
ESTIMATED RESTORATION: [Time if known, or "We're still working to
determine this"]
WORKAROUNDS REMINDER:
- [Workaround if still relevant]
Next update: [Time]

Resolution notification:

Subject: [RESOLVED] [Service name] restored
[Service] has been restored as of [time].
WHAT TO DO NOW:
1. [Any immediate user actions]
2. [How to report if still experiencing issues]
IMPACT SUMMARY:
- Service was unavailable for approximately [duration]
- [X] users were affected
- [Brief, simple explanation of cause]
We apologise for the disruption. A review is underway to prevent
similar issues in future.
Questions? Contact [service desk].

Stakeholder notification matrix

StakeholderWhen to notifyWho notifiesChannel
Executive teamCritical: within 30 min; High: within 1 hourIncident commanderEmail + phone for CEO/ED
All staffAll major outagesCommunications leadEmail or messaging
Board membersCritical outages exceeding 4 hoursCEO/EDDirect from leadership
DonorsIf affecting deliverables or dataProgramme leadPer relationship norms
Partners with integrationsAny outage affecting shared systemsRelationship ownerDirect contact
MediaOnly if outage becomes publicCommunications/leadershipPer media policy

Field office considerations

Field offices experience outages differently than headquarters. Connectivity constraints, time zone differences, and distinct operational contexts require adapted response.

Field-specific impact assessment questions:

  • Are field offices on different infrastructure that may be unaffected?
  • Does the outage affect data synchronisation with offline-capable systems?
  • Are field teams currently in active programme delivery (distributions, assessments)?
  • What local workarounds exist that headquarters may not know about?

Communication adaptations for field contexts:

  • Account for time zones when scheduling updates (avoid middle-of-night notifications unless Critical)
  • Use SMS or phone for locations with unreliable internet
  • Provide field-appropriate workarounds (paper processes, offline tools)
  • Designate a field liaison in the war room for geographically distributed outages

Field offices often have established offline procedures from operating in low-connectivity environments. These procedures can provide workarounds for headquarters staff unfamiliar with manual alternatives. Consult field IT staff or programme managers early in impact assessment.

Recovery verification for field offices:

  • Explicitly test from field locations before declaring resolution
  • Verify data synchronisation has resumed for offline-capable systems
  • Confirm field staff can access restored services through their typical connection paths

Service dependency mapping

Effective outage response requires understanding which services depend on which infrastructure. Maintain a dependency map showing relationships between services.

+---------------------------------------------------------------+
| SERVICE DEPENDENCY MAP |
+---------------------------------------------------------------+
| |
| +------------------+ |
| | Identity Provider| |
| | (Entra ID) | |
| +--------+---------+ |
| | |
| +--------------------+------------------+ |
| | | | |
| v v v |
| +------+-------+ +-------+------+ +------+------+ |
| | Email | | File Storage | | Line of | |
| | (Exchange) | | (SharePoint) | | Business | |
| +------+-------+ +-------+------+ | Apps | |
| | | +------+------+ |
| | | | |
| v v v |
| +------+-------+ +-------+------+ +------+------+ |
| | Calendaring | | Document | | Finance | |
| | Scheduling | | Workflows | | (SSO) | |
| +--------------+ +--------------+ +-------------+ |
| |
| +--------------+ +--------------+ +-------------+ |
| | Grants | | HR System | | Case Mgmt | |
| | Management | | (SSO) | | (Direct DB) | |
| +------+-------+ +-------+------+ +------+------+ |
| | | | |
| v v v |
| +------+-------+ +-------+------+ +------+-------+ |
| | Reporting DB | | Payroll | | Beneficiary | |
| | (Integration)| | (Integration)| | Database | |
| +--------------+ +--------------+ +--------------+ |
| |
+---------------------------------------------------------------+

When a service fails, trace downstream to identify cascading impact. In the example above, an identity provider outage affects email, file storage, and every application using SSO authentication. A database server failure might affect only a single application.

Update the dependency map when implementing new services or integrations. Review during post-incident reviews to verify accuracy based on actual outage impact.

Escalation structure

+------------------------------------------------------------------+
| ESCALATION STRUCTURE |
+------------------------------------------------------------------+
| |
| +-----------------+ |
| | Initial report | User report, monitoring alert, or |
| | (Anyone) | automated detection |
| +--------+--------+ |
| | |
| v |
| +--------+--------+ |
| | Service Desk | Triage and initial assessment |
| | (On-call tech) | Determine if major outage criteria met |
| +--------+--------+ |
| | |
| | Major outage criteria met |
| v |
| +--------+--------+ |
| | IT Manager | Assumes incident commander role |
| | (Incident Cmdr) | Activates this playbook |
| +--------+--------+ |
| | |
| | Critical severity or executive impact |
| v |
| +--------+--------+ |
| | Director of Ops | Business decisions, resource allocation |
| | or equivalent | External stakeholder communication |
| +--------+--------+ |
| | |
| | Organisational impact or >4 hours Critical |
| v |
| +--------+--------+ |
| | Executive Dir / | Board notification decisions |
| | CEO | Reputational/donor impact decisions |
| +-----------------+ |
| |
+------------------------------------------------------------------+

Escalation triggers automatically at time thresholds regardless of perceived progress:

SeverityAuto-escalate to DirectorAuto-escalate to Executive
Critical1 hour if unresolved4 hours if unresolved
High4 hours if unresolved8 hours if unresolved
MediumNo automatic escalationNo automatic escalation

See also