Skip to main content

Service Continuity

Service continuity procedures maintain organisational operations when critical IT services degrade or fail, bridging the gap between outage detection and full restoration. This playbook covers mail continuity, communications outages, and queue backlog management. Invoke these procedures when services remain unavailable beyond normal incident resolution timeframes and operational impact requires immediate workarounds.

Activation criteria

Activate this playbook when any of the following conditions are met. These thresholds represent points where standard incident management transitions to continuity operations.

IndicatorActivation threshold
Email unavailabilitySending or receiving blocked for more than 30 minutes
Video conferencing failurePlatform unavailable for scheduled meetings within 2 hours
Team messaging outagePrimary messaging platform inaccessible for more than 15 minutes
Voice telephony failureInbound or outbound calling unavailable for more than 15 minutes
Queue backlog growthProcessing queue exceeds 200% of normal depth
Multi-service degradationTwo or more communication services simultaneously impaired
Vendor notificationCloud provider announces expected outage exceeding 1 hour
Field office isolationField location loses all communication channels

Relationship to other playbooks

This playbook addresses service-level workarounds. If underlying infrastructure has failed, also invoke Infrastructure Recovery. If the outage results from a security incident, defer to the appropriate incident response playbook.

Roles

RoleResponsibilityTypical assigneeBackup
Continuity coordinatorOverall coordination, workaround decisions, user communicationIT Operations LeadSenior Systems Administrator
Technical leadService diagnostics, workaround implementation, restoration verificationSystems AdministratorApplication Support Specialist
Communications leadUser notifications, status updates, executive briefingIT Service Desk ManagerCommunications Officer
Service ownerBusiness impact assessment, priority decisions, acceptance of workaroundsBusiness Unit RepresentativeProgramme Manager

Phase 1: Initial assessment

Objective: Determine which services are affected, estimate restoration time, and identify appropriate workarounds.

Timeframe: Complete within 15 minutes of activation.

  1. Confirm service status through monitoring dashboards and direct testing. Access the service status page at https://status.example.org/ and verify against user reports. Record the precise failure mode: complete outage, degraded performance, or partial functionality.

  2. Contact the service provider or check their status page for known incidents. For cloud services, note the incident identifier, affected regions, and estimated time to resolution (ETR). If no ETR is available, assume a minimum 2-hour outage for planning purposes.

  3. Assess business impact by identifying critical operations affected. Determine whether scheduled meetings, deadlines, or time-sensitive communications are at risk within the next 4 hours.

  4. Select the appropriate continuity track based on affected services:

    • Email unavailable: proceed to Phase 2A
    • Video/messaging unavailable: proceed to Phase 2B
    • Queue backlog growing: proceed to Phase 2C
    • Multiple services affected: address in priority order (voice, email, messaging, video)

Decision point: If the vendor provides an ETR under 30 minutes and no critical operations are immediately affected, hold at monitoring status rather than implementing workarounds. Re-evaluate every 15 minutes.

Checkpoint: Before proceeding, confirm you have documented the affected services, the failure mode, any vendor incident reference, and the selected continuity track.

+---------------------------------------------------------------+
| SERVICE CONTINUITY DECISION |
+---------------------------------------------------------------+
|
v
+-------------------+
| Service |
| unavailable? |
+-------------------+
|
+---------------+---------------+
| |
v v
[ Yes ] [ Degraded ]
| |
v v
+-------------------+ +-------------------+
| ETR under | | Performance |
| 30 minutes? | | below 50%? |
+-------------------+ +-------------------+
| |
+-------+-------+ +-------+-------+
| | | |
v v v v
[ Yes ] [ No ] [ Yes ] [ No ]
| | | |
v v v v
+---------+ +-----------+ +-----------+ +-----------+
| Monitor | | Activate | | Activate | | Monitor |
| only | | workaround| | workaround| | only |
+---------+ +-----------+ +-----------+ +-----------+

Figure 1: Service continuity activation decision tree

Phase 2A: Mail continuity

Objective: Ensure critical email communication continues during mail system outages.

Timeframe: Implement primary workaround within 30 minutes; complete alternative routing within 2 hours if needed.

  1. Verify the mail failure scope. Test both sending and receiving by sending a message from an external account (personal or alternative organisational account) to a monitored mailbox, and by sending outbound from webmail if available. Record which directions are affected.

  2. Check mail queue status on any on-premises mail transport servers:

    Terminal window
    # For Postfix
    postqueue -p | tail -1
    # Shows queue depth, e.g., "-- 847 Kbytes in 234 Requests."
    # For Exchange (PowerShell)
    Get-Queue | Select-Object Identity,MessageCount,Status

    If queues are accepting messages, mail will deliver when connectivity restores. Inform users that sent messages are queued, not lost.

  3. Activate the alternative mail notification channel. Post to the designated Teams/Slack channel or SMS distribution list:

    MAIL SERVICE NOTICE - [TIME]
    Email sending/receiving is currently unavailable.
    Estimated restoration: [TIME or "under investigation"]
    For urgent communication, use [ALTERNATIVE].
    Updates every 30 minutes.
  4. For critical communications that cannot wait, implement one of the following workarounds based on available alternatives:

    Personal email forwarding (short-term only):

    • Identify staff with time-critical external communications
    • Authorise temporary use of personal email with organisational signature
    • Require BCC to a designated archive address when service restores

    Alternative domain routing (if pre-configured):

    • Activate secondary MX records pointing to backup mail service
    • DNS TTL determines propagation time (typically 1-4 hours)
    • Verify receiving capability before announcing to external parties

    Webmail from alternative provider:

    • If primary service is Microsoft 365, check Outlook Web Access at https://outlook.office.com
    • If primary service is Google Workspace, check Gmail at https://mail.google.com
    • Cloud providers sometimes maintain webmail when desktop clients fail
  5. Implement mail queue hold if partial functionality creates delivery inconsistency. Holding the queue prevents some messages delivering while others fail, which creates confusion:

    Terminal window
    # Postfix: hold all queued mail
    postsuper -h ALL
    # Release when ready
    postsuper -r ALL
  6. Monitor mail flow indicators every 15 minutes during the outage:

    • Queue depth trend (growing, stable, or draining)
    • External delivery test results
    • User-reported symptoms

Decision point: If mail remains unavailable after 4 hours and business-critical external communications are blocked, escalate to emergency communication procedures using the alternative domain or third-party relay service.

Checkpoint: Before proceeding to restoration, confirm that all temporary workarounds are documented, users have been notified of the current state, and you have a list of any messages that require manual follow-up after restoration.

Phase 2B: Communications continuity

Objective: Maintain voice, video, and messaging capabilities during platform outages.

Timeframe: Activate alternative channels within 15 minutes; complete full failover within 1 hour.

  1. Identify the specific communication failure. Test each channel independently:

    • Voice: attempt internal and external calls from desk phone and softphone
    • Video: attempt to join or create a meeting
    • Messaging: send and receive in primary channels

    Record which platforms, which directions (inbound/outbound), and which clients (desktop/mobile/web) are affected.

  2. Activate alternative channels according to the failover map:

    Primary platformFirst alternativeSecond alternative
    Microsoft Teams (chat)Slack, if licensedSMS group, Signal
    Microsoft Teams (video)ZoomGoogle Meet
    ZoomMicrosoft TeamsJitsi Meet
    SlackMicrosoft TeamsEmail + SMS
    VoIP desk phonesMobile phonesSatellite phone (field)
    Primary mobile networkSecondary SIMWiFi calling
  3. For video conferencing failover, notify meeting organisers of scheduled meetings within the next 4 hours:

    Subject: Meeting platform change - [MEETING NAME]
    Due to [PLATFORM] unavailability, your meeting at [TIME] will
    use [ALTERNATIVE PLATFORM] instead.
    New meeting link: [LINK]
    Dial-in (if available): [NUMBER]
    Please forward to external participants.
  4. Configure call forwarding for critical phone lines if VoIP is unavailable:

    From provider portal (if accessible):

    • Log into telephony admin portal
    • Navigate to call handling rules
    • Set unconditional forward to mobile numbers

    From handset (if lines partially functional):

    • Dial the forwarding activation code (commonly *72 or *21*)
    • Enter destination number
    • Confirm activation tone

    Document all forwarding configurations for later removal.

  5. Establish a communication bridge for coordinating responders if primary messaging is unavailable. Create a temporary Signal group or SMS thread including:

    • Continuity coordinator
    • Technical lead
    • Service desk representatives
    • Key business stakeholders

    Use this channel for response coordination only; do not share sensitive organisational data on personal devices unless encrypted.

  6. For field offices that lose all connectivity, activate the emergency communication check-in protocol:

    • Field office attempts contact via satellite phone at scheduled intervals
    • HQ monitors designated satellite phone line
    • Check-in confirms safety and captures urgent communication needs
    • HQ relays critical information during check-in windows

Decision point: If the primary platform vendor announces an outage exceeding 24 hours, transition from “workaround” to “temporary migration” mode, which involves more comprehensive user communication and potential calendar rescheduling.

Checkpoint: Before declaring communications continuity achieved, verify that users can reach the IT response team through at least two independent channels, and that external parties can contact the organisation for urgent matters.

+------------------------------------------------------------------+
| COMMUNICATION CHANNEL FAILOVER |
+------------------------------------------------------------------+
PRIMARY FIRST FAILOVER SECOND FAILOVER
+---------------+ +---------------+ +---------------+
| | | | | |
| MS Teams +------>| Zoom +------>| Jitsi |
| Video | | (licensed) | | (self-hosted) |
| | | | | |
+---------------+ +---------------+ +---------------+
+---------------+ +---------------+ +---------------+
| | | | | |
| MS Teams +------>| Slack +------>| Signal |
| Chat | | (if licensed) | | (emergency) |
| | | | | |
+---------------+ +---------------+ +---------------+
+---------------+ +---------------+ +---------------+
| | | | | |
| VoIP Phones +------>| Mobile Phones +------>| Satphone |
| | | | | (field only) |
| | | | | |
+---------------+ +---------------+ +---------------+
+---------------+ +---------------+ +---------------+
| | | | | |
| Email +------>| SMS Broadcast +------>| WhatsApp |
| (internal) | | | | (emergency) |
| | | | | |
+---------------+ +---------------+ +---------------+

Figure 2: Communication channel failover hierarchy

Phase 2C: Queue backlog management

Objective: Prevent data loss and maintain processing continuity when system queues grow beyond normal capacity.

Timeframe: Implement queue management within 30 minutes; clear backlog within 24 hours of service restoration.

  1. Identify the queue experiencing backlog and determine the cause. Common scenarios include:

    • Integration queue: upstream system sending faster than downstream can process
    • Mail queue: delivery failures causing retry accumulation
    • Job queue: worker processes failing or insufficient capacity
    • Sync queue: offline clients reconnecting simultaneously

    Query queue depth and oldest message:

    Terminal window
    # RabbitMQ
    rabbitmqctl list_queues name messages messages_ready
    # Redis (using redis-cli)
    LLEN queue_name
    # PostgreSQL job queue
    SELECT COUNT(*), MIN(created_at) FROM job_queue WHERE status = 'pending';
  2. Assess queue growth rate to determine urgency:

    Terminal window
    # Sample queue depth every minute for 5 minutes
    for i in {1..5}; do
    echo "$(date): $(rabbitmqctl list_queues name messages | grep target_queue)"
    sleep 60
    done

    If the queue grows by more than 10% per minute, the system will exhaust resources before normal processing recovers. Immediate intervention is required.

  3. Implement queue stabilisation based on growth rate:

    Growth rate under 5% per minute (stable):

    • Monitor only; queue will self-clear when processing resumes
    • Ensure sufficient disk space for queue persistence

    Growth rate 5-10% per minute (concerning):

    • Reduce input rate if possible (pause non-critical integrations)
    • Increase consumer/worker capacity if bottleneck is processing speed

    Growth rate over 10% per minute (critical):

    • Pause input entirely using circuit breaker or upstream hold
    • Assess whether queue can be safely purged (transient data) or must be preserved
  4. Implement message prioritisation if the queue contains mixed-priority items:

    # Example: Reorder queue by priority field
    # This pseudocode represents the approach; actual implementation varies by platform
    high_priority = []
    normal_priority = []
    for message in queue:
    if message.priority == 'high' or message.age_hours > 4:
    high_priority.append(message)
    else:
    normal_priority.append(message)
    reordered_queue = high_priority + normal_priority
  5. For processing bottlenecks, scale worker capacity:

    Terminal window
    # Kubernetes: scale worker deployment
    kubectl scale deployment queue-worker --replicas=10
    # Docker Compose: scale service
    docker-compose up -d --scale worker=10
    # Systemd: start additional worker instances
    for i in {2..5}; do
    systemctl start worker@$i
    done

    Monitor resource utilisation to ensure scaled workers do not exhaust CPU, memory, or database connections.

  6. Configure dead-letter handling for messages that fail repeatedly:

    Terminal window
    # RabbitMQ: check dead-letter queue
    rabbitmqctl list_queues name messages | grep dead
    # Move dead letters to inspection queue rather than losing them
    # (implementation varies by message broker)

    Dead-lettered messages require manual review after the incident to determine whether they should be reprocessed, corrected, or discarded.

  7. Document queue state for post-incident analysis:

    • Maximum queue depth reached
    • Oldest message age
    • Number of messages dead-lettered
    • Processing rate before, during, and after intervention

Decision point: If a queue contains more than 100,000 messages and normal processing would require more than 12 hours to clear, consider bulk processing options or temporary data archiving followed by selective replay.

Checkpoint: Before declaring queue management complete, verify that queue depth is decreasing, no messages are older than the acceptable threshold (typically 4 hours for operational data), and dead-letter queues have been reviewed.

+-----------------------------------------------------------------+
| QUEUE BACKLOG PRIORITISATION |
+-----------------------------------------------------------------+
+-------------------+
| Backlog |
| Detected |
+-------------------+
|
v
+-------------------+
| Growth rate? |
+-------------------+
|
+-------------+-------------+-------------+
| | | |
v v v v
[ <5%/min ] [ 5-10%/min] [ >10%/min ] [ Stable ]
| | | |
v v v v
+-----------+ +-----------+ +-----------+ +-----------+
| Monitor | | Reduce | | Pause | | No action |
| queue | | input | | input | | required |
| depth | | rate | | entirely | | |
+-----------+ +-----------+ +-----------+ +-----------+
| | |
+------------+------------+
|
v
+----------------+
| Scale workers |
| if processing |
| bottleneck |
+----------------+
|
v
+----------------+
| Prioritise |
| high-value |
| messages |
+----------------+
|
v
+----------------+
| Clear backlog |
| within 24hrs |
+----------------+

Figure 3: Queue backlog management decision flow

Phase 3: User communication

Objective: Keep users informed of service status, available workarounds, and expected restoration time.

Timeframe: Initial notification within 15 minutes of activation; updates every 30 minutes during active incident.

  1. Issue the initial user notification through all available channels (the affected service may not be available for notification):

    • Intranet status banner (if self-service portal is available)
    • Alternative messaging platform
    • SMS broadcast for critical roles
    • Email to personal addresses for extended outages (with user consent)
  2. Use the appropriate template from the Communications section below, customised with:

    • Specific services affected
    • Current workaround instructions
    • Expected restoration time (or “under investigation”)
    • Next update time
  3. Establish a predictable update cadence:

    • First 2 hours: update every 30 minutes
    • Hours 2-8: update every hour
    • Beyond 8 hours: update every 2 hours

    Even if no new information is available, confirm that the team is actively working and state the next update time.

  4. Communicate workaround details with sufficient specificity that users can self-serve. Include:

    • Exact steps to access alternative services
    • Any credentials or links needed
    • Limitations of the workaround
    • What to do if the workaround does not work for their use case
  5. Notify external stakeholders if the outage affects commitments:

    • Partners expecting data transfers
    • Donors expecting reports
    • Beneficiaries expecting communications
    • Vendors with integration dependencies

    The service owner, not IT, should deliver external notifications, with IT providing technical talking points.

  6. Prepare the restoration notification before service returns:

    • Confirm what has been restored
    • Note any residual impacts or catch-up activities
    • Thank users for patience
    • Request reports of any ongoing issues

Decision point: If an outage extends beyond 4 hours and significantly impacts programme delivery, escalate to leadership communication. The Communications lead should prepare a brief for senior management.

Checkpoint: Before each scheduled update, verify the current service status, confirm or revise the estimated restoration time, and gather any new workaround information from the technical team.

Phase 4: Service restoration verification

Objective: Confirm services are fully operational before declaring the incident resolved.

Timeframe: Complete verification within 30 minutes of vendor or technical team reporting restoration.

  1. Verify core functionality through direct testing, not just monitoring dashboard status:

    Email restoration verification:

    Terminal window
    # Send test message to external address
    echo "Restoration test $(date)" | mail -s "Test" external@example.com
    # Verify delivery within 5 minutes
    # Check for bounce or delay notification

    Video conferencing verification:

    • Create a test meeting
    • Join from two different clients (desktop and mobile)
    • Test screen sharing and audio

    Messaging verification:

    • Send messages in multiple channels
    • Verify message delivery and notification
    • Test file sharing
  2. Check for data integrity issues caused by the outage:

    • Email: verify no duplicate deliveries or lost messages in queue
    • Queues: confirm backlog processing completed without errors
    • Sync: verify offline changes synchronised correctly

    Query for anomalies:

    -- Check for duplicate records created during outage window
    SELECT identifier, COUNT(*)
    FROM records
    WHERE created_at BETWEEN '[outage_start]' AND '[outage_end]'
    GROUP BY identifier
    HAVING COUNT(*) > 1;
  3. Remove temporary configurations implemented during the outage:

    • Call forwarding rules
    • Alternative routing configurations
    • Elevated permissions or bypass rules
    • Temporary integration holds

    Document each removal and verify the configuration matches the pre-incident state.

  4. Clear any remaining backlog according to priority. If queue backlog exceeds 4 hours of normal processing time, implement parallel processing:

    Terminal window
    # Example: parallel message processing with GNU parallel
    cat pending_messages.txt | parallel -j 10 ./process_message.sh {}
  5. Verify from the user perspective by requesting confirmation from representative users in different locations and roles. Create a brief survey:

    • Can you send and receive email?
    • Can you join video meetings?
    • Can you access team messaging?
    • Are you experiencing any issues not present before the outage?
  6. Confirm with field offices if they were affected. Field locations may have different restoration timing due to network routing or local caching. Do not declare restoration complete until field verification is received.

Decision point: If verification reveals residual issues affecting more than 10% of users, do not declare restoration. Return to the appropriate continuity phase and issue an updated user communication.

Checkpoint: Before declaring the incident resolved, confirm that all temporary configurations have been removed, all user verification responses are positive, and queue backlogs are below normal operating thresholds.

Phase 5: Post-incident

Objective: Document the incident, capture lessons learned, and identify improvements.

Timeframe: Complete documentation within 24 hours; schedule review within 5 business days.

  1. Document the incident timeline:

    • Time of first user report or monitoring alert
    • Time of activation decision
    • Time workarounds were implemented
    • Time service was restored
    • Time incident was closed

    Include the total duration of user impact, not just technical outage time.

  2. Calculate the impact metrics:

    • Number of users affected
    • Duration of unavailability
    • Estimated productivity impact (user-hours lost)
    • Any financial impact (missed deadlines, failed transactions)
    • Reputational impact (external visibility)
  3. Document workaround effectiveness:

    • Which workarounds were used
    • How quickly they were implemented
    • User adoption rate
    • Limitations encountered
    • Improvements for future incidents
  4. Capture root cause if known. For vendor outages, record:

    • Vendor incident identifier
    • Vendor’s root cause statement
    • Whether SLA credits apply

    For internal issues, identify the contributing factors and corrective actions.

  5. Schedule a post-incident review if the outage exceeded 2 hours or significantly impacted operations. Include:

    • Continuity coordinator
    • Technical lead
    • Affected service owners
    • IT management

    Focus the review on systemic improvements, not individual performance.

  6. Update runbooks and procedures based on lessons learned. Common updates include:

    • New activation thresholds based on actual impact
    • Additional or modified workaround procedures
    • Updated contact lists or escalation paths
    • Revised user communication templates

Checkpoint: Before closing the incident record, confirm that documentation is complete, any required vendor communications (SLA claims) have been initiated, and improvement actions have been assigned owners and due dates.

Communications

Stakeholder notification matrix

StakeholderTimingChannelMessage ownerTemplate
All staffWithin 15 minutesIntranet, messagingCommunications leadInitial notification
Executive leadershipWithin 30 minutesDirect message/callContinuity coordinatorExecutive brief
Field officesWithin 30 minutesMessaging, SMSCommunications leadField notification
External partnersWithin 2 hours if affectedEmail (alternative), phoneService ownerPartner notification
Service deskImmediateInternal channelTechnical leadService desk brief

Communication templates

Initial notification (all staff):

SERVICE NOTICE - [TIME]
[SERVICE NAME] is currently unavailable.
Impact: [BRIEF DESCRIPTION OF WHAT USERS CANNOT DO]
Workaround: [SPECIFIC INSTRUCTIONS]
We are working to restore service. Estimated restoration: [TIME or "under investigation"]
Next update: [TIME]
For urgent assistance: [CONTACT METHOD]

Executive brief (30 minutes):

Subject: Service Disruption - [SERVICE] - [TIME]
Current status: [SERVICE] unavailable since [TIME]
Business impact: [SPECIFIC IMPACT - e.g., "Staff cannot send external
email; 3 scheduled donor calls affected"]
Cause: [KNOWN CAUSE or "Under investigation - vendor incident #12345"]
Workaround in place: [YES/NO and brief description]
Estimated restoration: [TIME or "Awaiting vendor update"]
Next update to you: [TIME]
Escalation needed: [YES - specific request / NO]

User update (30-minute intervals):

SERVICE UPDATE - [TIME]
Status: [SERVICE] remains unavailable / is partially restored / is fully restored
What's changed: [ANY NEW INFORMATION]
Workaround reminder: [BRIEF INSTRUCTIONS]
Estimated restoration: [UPDATED TIME]
Next update: [TIME]

Restoration notification:

SERVICE RESTORED - [TIME]
[SERVICE] has been restored to normal operation.
What you need to do:
- [ANY USER ACTIONS REQUIRED, e.g., "Restart Outlook to reconnect"]
- [OR "No action required - service should work normally"]
If you experience issues, please contact [SERVICE DESK/CHANNEL].
Thank you for your patience during this disruption.

Field office notification:

FIELD NOTICE - [TIME]
[SERVICE] is unavailable at headquarters. This may affect:
- [SPECIFIC IMPACT FOR FIELD, e.g., "Email to HQ addresses"]
- [SPECIFIC IMPACT]
Your local systems: [AFFECTED / NOT AFFECTED]
For urgent HQ communication: [ALTERNATIVE METHOD]
Check-in schedule: [IF ACTIVATED, e.g., "Satellite phone check-in
at 10:00, 14:00, 18:00 UTC"]
Updates via: [CHANNEL]

Evidence preservation

During service continuity incidents, preserve the following for post-incident review and potential vendor SLA claims:

Evidence typeWhat to captureRetention
Monitoring dataDashboard screenshots showing outage period90 days
Vendor statusScreenshots of vendor status page with timestamps90 days
User reportsInitial reports and sample user communications90 days
Queue metricsQueue depth logs, processing rates, dead-letter counts30 days
Configuration changesBefore/after of any temporary configurations90 days
TimelineDetailed incident timeline with timestamps1 year
Impact metricsUser count, duration, business impact assessment1 year

For vendor SLA claims, retain:

  • Vendor incident identifier
  • Service status page archives
  • Independent verification of outage (monitoring data)
  • Documentation of business impact
  • Evidence of workaround costs if claiming compensation

See also