Skip to main content

DR Site Invocation

Disaster recovery site invocation transfers organisational IT operations from an unavailable primary site to pre-established recovery infrastructure. This playbook governs full DR activation when component-level recovery proves insufficient and extended primary site unavailability requires sustained operations from alternate infrastructure. Invoke this playbook only after determining that targeted recovery through Cloud Failover, Infrastructure Recovery, or Backup Recovery playbooks cannot restore operations within acceptable timeframes.

Activation criteria

Invoke this playbook when the following conditions are met:

CriterionThresholdVerification method
Primary site unavailabilityConfirmed or expected to exceed 4 hoursFacility management confirmation, provider status
Critical service impact3 or more Tier 1 services affectedService monitoring dashboard
Component recovery insufficientIndividual playbooks cannot restore within RTOTechnical lead assessment
Recovery timelinePrimary site restoration exceeds 24 hoursFacility or provider estimate
Data centre accessPhysical access impossible or unsafeSecurity/facilities confirmation

Activation authority

DR invocation requires explicit authorisation from the IT Director or designated alternate. Unauthorised invocation creates coordination failures and potential data integrity issues from premature failover.

Invocation triggers by scenario

Physical facility loss occurs when fire, flood, structural damage, or utility failure renders the primary data centre inoperable. Verification requires facility management confirmation that restoration exceeds 24 hours. Power failures extending beyond UPS and generator capacity (typically 4-8 hours of fuel) trigger evaluation.

Regional infrastructure failure affects multiple organisations simultaneously through widespread power grid failure, telecommunications outage, or natural disaster. Provider status pages and regional news confirm scope. When the incident affects connectivity to the primary site rather than the site itself, evaluate whether staff can operate remotely before invoking full DR.

Cyber incident requiring isolation applies when ransomware, advanced persistent threat, or other compromise requires complete primary environment isolation for investigation. The incident commander from the relevant security playbook authorises DR invocation to maintain operations while preserving the compromised environment as evidence.

Planned invocation for testing follows the DR Testing procedures rather than this playbook. Test invocations use controlled conditions and predetermined rollback points.

Roles

RoleResponsibilityPrimary assigneeBackup
DR CommanderOverall coordination, go/no-go decisions, stakeholder communicationIT DirectorDeputy IT Director
Technical LeadInfrastructure activation sequence, technical verificationInfrastructure ManagerSenior Systems Administrator
Application LeadApplication recovery sequencing, data validationApplications ManagerLead Developer
Communications LeadStaff notification, external communications, status updatesCommunications ManagerIT Director
Operations LeadStaff logistics, access coordination, operational continuityOperations ManagerHR Manager

The DR Commander holds decision authority throughout invocation. Technical recommendations flow to the DR Commander who authorises each phase transition. The DR Commander may delegate phase-level authority to leads for their domains while retaining overall coordination.

+------------------------------------------------------------------+
| DR COMMAND STRUCTURE |
+------------------------------------------------------------------+
| |
| +------------------+ |
| | DR Commander | |
| | (IT Director) | |
| +--------+---------+ |
| | |
| +---------------------+---------------------+ |
| | | | |
| v v v |
| +------+-------+ +-------+------+ +--------+-------+ |
| | Technical | | Application | | Communications | |
| | Lead | | Lead | | Lead | |
| +--------------+ +--------------+ +----------------+ |
| | | | |
| v v v |
| +------+--------+ +-------+------+ +--------+-------+ |
| | Infrastructure| | App Teams | | Operations | |
| | Team | | | | Lead | |
| +---------------+ +--------------+ +----------------+ |
| |
+------------------------------------------------------------------+

Figure 1: DR command structure showing decision authority flow

Phase 1: Decision and authorisation

Objective: Confirm DR invocation is appropriate and obtain authorisation Timeframe: 30-60 minutes

  1. Assemble the DR decision team (DR Commander, Technical Lead, Application Lead) via the emergency communication channel. Use the predetermined conference bridge or messaging channel established in the BCDR plan. If primary communication tools are affected, fall back to mobile phones using the emergency contact list.

  2. Collect situation assessment from each lead:

    • Technical Lead: Primary site status, estimated restoration time, component recovery feasibility
    • Application Lead: Affected services, data synchronisation status, RPO implications
    • Operations Lead: Staff safety status, facility access status
  3. Verify DR site readiness by confirming:

    • Last successful replication timestamp (acceptable if within RPO, typically 1-4 hours)
    • DR infrastructure health check (automated monitoring or manual verification)
    • Network path availability to DR site
    • Staff ability to access DR site (VPN, physical access if applicable)

    Run the DR readiness check:

Terminal window
# Check replication lag
./dr-status.sh --check-replication
# Expected output: Replication lag: 47 minutes (within 4-hour RPO)
# Verify DR infrastructure
./dr-status.sh --health-check
# Expected output: All 12 critical systems responding
# Test network path
./dr-status.sh --network-test
# Expected output: DR site reachable, latency 23ms
  1. Document the decision rationale including:

    • Primary site status and estimated restoration
    • Services affected and business impact
    • Recovery options considered and rejected
    • Replication status and data loss implications
  2. Obtain DR Commander authorisation. The DR Commander reviews the assessment and provides explicit verbal authorisation: “DR invocation is authorised for [scenario]. Proceeding to Phase 2.” Record the authorisation timestamp.

Decision point: If DR site readiness verification fails (replication lag exceeds RPO, infrastructure unhealthy, or network unavailable), halt invocation and address readiness issues before proceeding. Partial DR invocation creates worse outcomes than delayed invocation with full readiness.

Checkpoint: Before proceeding to Phase 2, confirm:

  • DR Commander has provided explicit authorisation
  • DR site replication is within acceptable RPO
  • DR infrastructure health check passed
  • Decision rationale documented
  • Timestamp recorded

Phase 2: Communication and coordination

Objective: Notify all stakeholders and prepare staff for transition Timeframe: 30-45 minutes (runs parallel to Phase 3 preparation)

  1. Notify executive leadership using the executive notification template below. The DR Commander or designated Communications Lead makes direct contact (phone call, not email) with:

    • Chief Executive
    • Chief Operating Officer
    • Chief Financial Officer (cost implications)
    • Board chair (if outage exceeds 24 hours or involves data loss)
  2. Activate the staff notification cascade. Use the emergency notification system (SMS broadcast, emergency app, or phone tree) to reach all staff within 15 minutes:

[URGENT] IT DR ACTIVATION
Primary systems unavailable. DR site activating.
Check email/[channel] for instructions within 1 hour.
Do NOT attempt to access primary systems.
Questions: [emergency contact]
  1. Notify critical vendors and partners:

    • Internet service providers (both primary and DR site)
    • Cloud service providers
    • Managed service providers
    • Key implementing partners who depend on shared systems
  2. Establish the operational communication rhythm:

    • Status updates every 2 hours during active recovery
    • Dedicated channel for DR team coordination
    • Separate channel for staff questions (staffed by Operations Lead)
  3. Prepare detailed staff instructions for system access from DR infrastructure. Staff need specific guidance on:

    • VPN configuration changes (if any)
    • New URLs for applications (if different)
    • Authentication changes (if DR uses different identity infrastructure)
    • Expected service limitations during DR operations

Checkpoint: Before proceeding to Phase 3 execution, confirm:

  • Executive leadership notified and acknowledged
  • Staff notification broadcast sent
  • Critical vendors notified
  • Communication channels established
  • Staff instruction document prepared

Phase 3: Infrastructure activation

Objective: Bring DR infrastructure to operational state Timeframe: 1-4 hours depending on DR architecture

The activation sequence depends on your DR architecture. The three common models require different approaches:

Hot standby maintains running infrastructure with continuous replication. Activation involves DNS cutover and verification. Expected activation time: 15-60 minutes.

Warm standby maintains infrastructure in reduced-capacity state with periodic replication. Activation involves scaling resources, applying recent replication, and DNS cutover. Expected activation time: 1-2 hours.

Cold standby maintains infrastructure definitions and backup data only. Activation involves provisioning infrastructure, restoring from backups, and DNS cutover. Expected activation time: 4-8 hours.

+------------------------------------------------------------------+
| DR ACTIVATION DECISION TREE |
+------------------------------------------------------------------+
| |
| +-------------------+ |
| | DR Architecture | |
| | Type? | |
| +--------+----------+ |
| | |
| +-------------------+-------------------+ |
| | | | |
| v v v |
| +-----+-----+ +-----+-----+ +-----+-----+ |
| | HOT | | WARM | | COLD | |
| | STANDBY | | STANDBY | | STANDBY | |
| +-----+-----+ +-----+-----+ +-----+-----+ |
| | | | |
| v v v |
| +-----+-----+ +-----+-----+ +-----+-----+ |
| | Verify | | Scale | | Provision | |
| | sync | | resources | | infra | |
| | status | +-----+-----+ +-----+-----+ |
| +-----+-----+ | | |
| | v v |
| | +-----+-----+ +-----+-----+ |
| | | Apply | | Restore | |
| | | recent | | from | |
| | | repl. | | backup | |
| | +-----+-----+ +-----+-----+ |
| | | | |
| +-------------------+-------------------+ |
| | |
| v |
| +--------+--------+ |
| | DNS cutover | |
| +-----------------+ |
| | |
| v |
| +--------+--------+ |
| | Verify access | |
| +-----------------+ |
| |
+------------------------------------------------------------------+

Figure 2: DR activation sequence by architecture type

  1. Verify final replication state before cutover. Record the last successful replication timestamp as this defines your Recovery Point:
Terminal window
# For database replication
psql -h dr-db.example.org -c "SELECT pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
# Record: Last replay: 2024-11-16 14:23:47 UTC
# For file replication
rsync --dry-run --stats primary:/data/ dr:/data/
# Record: Files requiring sync (should be minimal)
# For cloud replication
az site-recovery show-recovery-point --vault-name dr-vault
# Record: Latest recovery point timestamp
  1. Stop replication from primary (if applicable and safe). For active-passive configurations, stopping replication prevents corruption from a partially-available primary:
Terminal window
# PostgreSQL streaming replication
psql -h dr-db.example.org -c "SELECT pg_promote();"
# Azure Site Recovery
az site-recovery planned-failover --direction PrimaryToRecovery
  1. Scale DR infrastructure to production capacity (warm standby only):
Terminal window
# Kubernetes
kubectl --context dr-cluster scale deployment --all --replicas=3
# Azure VM Scale Sets
az vmss scale --name dr-vmss --new-capacity 6
# AWS Auto Scaling
aws autoscaling set-desired-capacity --auto-scaling-group-name dr-asg --desired-capacity 6
  1. Provision infrastructure from definitions (cold standby only):
Terminal window
# Terraform
cd infrastructure/dr
terraform apply -var="environment=dr-active"
# Record provisioning start time for tracking
  1. Restore from backups (cold standby only). Follow the Backup Recovery playbook for detailed restore procedures. Critical sequence:

    • Database servers first (longest restore time)
    • Application servers after database availability confirmed
    • File storage concurrent with application servers
  2. Verify infrastructure health before DNS cutover:

Terminal window
# Run comprehensive health check
./dr-health-check.sh --full
# Expected output:
# Database: HEALTHY (connections: 47, replication: N/A - promoted)
# App servers: HEALTHY (6/6 responding)
# Load balancer: HEALTHY (backend pool: 6 healthy)
# Storage: HEALTHY (capacity: 67% used)
# Network: HEALTHY (latency: 12ms to users)
  1. Execute DNS cutover. Update DNS records to point to DR infrastructure:
Terminal window
# Update A records for primary services
# Primary: app.example.org -> 203.0.113.10 (primary)
# DR: app.example.org -> 198.51.100.20 (DR site)
# Using your DNS provider CLI/API
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
--change-batch file://dr-dns-changes.json
# Set low TTL (300 seconds) initially for quick rollback capability

DNS propagation timing depends on previous TTL settings. If TTL was 3600 seconds (1 hour), expect 1 hour for full propagation. Monitor propagation:

Terminal window
# Check propagation from multiple locations
dig +short app.example.org @8.8.8.8
dig +short app.example.org @1.1.1.1
dig +short app.example.org @208.67.222.222
  1. Verify external accessibility:
Terminal window
# Test from outside the network
curl -I https://app.example.org
# Expected: HTTP/2 200, response from DR infrastructure
# Verify certificate validity
echo | openssl s_client -servername app.example.org -connect app.example.org:443 2>/dev/null | openssl x509 -noout -dates

Decision point: If health checks fail after infrastructure activation, assess whether to proceed with degraded capability or halt for remediation. Partial DR operation may be preferable to no operation, but document degraded services and communicate limitations to users.

Checkpoint: Before proceeding to Phase 4, confirm:

  • Replication state recorded (defines RPO achieved)
  • DR infrastructure scaled to production capacity
  • All health checks passing
  • DNS cutover complete
  • External accessibility verified

Phase 4: Application activation and validation

Objective: Bring applications online and verify data integrity Timeframe: 1-3 hours

Applications must activate in dependency order. The sequence below represents a typical organisation; adjust based on your application dependency map.

+------------------------------------------------------------------+
| APPLICATION ACTIVATION SEQUENCE |
+------------------------------------------------------------------+
| |
| +----------------+ |
| | Tier 1: Core | Identity, DNS, Core Database |
| | (0-30 min) | |
| +-------+--------+ |
| | |
| v |
| +-------+--------+ |
| | Tier 2: Auth | SSO, MFA, Directory Services |
| | (30-60 min) | |
| +-------+--------+ |
| | |
| v |
| +-------+--------+ |
| | Tier 3: Comms | Email, Messaging, Video |
| | (60-90 min) | |
| +-------+--------+ |
| | |
| v |
| +-------+--------+ |
| | Tier 4: Core | ERP, CRM, Grants Management |
| | Business | |
| | (90-150 min) | |
| +-------+--------+ |
| | |
| v |
| +-------+--------+ |
| | Tier 5: Prog. | Case Management, M&E, Data Collection |
| | Systems | |
| | (150-180 min) | |
| +----------------+ |
| |
+------------------------------------------------------------------+

Figure 3: Application activation sequence by dependency tier

  1. Activate Tier 1 core services:

    • Identity provider / directory services
    • Internal DNS
    • Core databases

    Verify with authentication test:

Terminal window
# Test LDAP/AD connectivity
ldapsearch -H ldaps://dr-dc.example.org -x -b "dc=example,dc=org" "(uid=testuser)"
# Test database connectivity
psql -h dr-db.example.org -U app_user -c "SELECT 1;"
  1. Activate Tier 2 authentication services:

    • Single sign-on
    • Multi-factor authentication
    • Certificate services

    Verify with end-to-end authentication:

Terminal window
# Attempt SSO login
curl -c cookies.txt -b cookies.txt -L https://sso.example.org/auth/test
  1. Activate Tier 3 communication services:

    • Email (verify mail flow)
    • Messaging platform
    • Video conferencing (if self-hosted)

    Verify mail flow:

Terminal window
# Send test email and verify delivery
echo "DR test" | mail -s "DR Mail Flow Test" dr-test@example.org
# Check mail queue
postqueue -p
  1. Activate Tier 4 core business applications:

    • Finance / ERP
    • CRM / Donor management
    • Grants management
    • HR / HCM

    For each application:

    • Start application services
    • Verify database connectivity
    • Test critical transaction (read and write)
    • Verify integration endpoints
  2. Activate Tier 5 programme systems:

    • Case management
    • M&E platforms
    • Data collection (KoboToolbox, ODK, etc.)
    • Beneficiary registration

    Verify offline data considerations: Field data collection systems may have data collected offline during the outage. Establish the process for incorporating this data after activation.

  3. Validate data integrity across all activated systems. Run integrity checks:

-- Check for orphaned records
SELECT COUNT(*) FROM transactions WHERE account_id NOT IN (SELECT id FROM accounts);
-- Verify recent data present
SELECT MAX(created_at) FROM transactions;
-- Should show timestamp close to replication cutover
-- Check referential integrity
SELECT COUNT(*) FROM cases WHERE beneficiary_id NOT IN (SELECT id FROM beneficiaries);
  1. Conduct user acceptance testing with key users from each department. Provide a test script covering:

    • Login and navigation
    • Read operations (viewing records)
    • Write operations (creating/updating records)
    • Critical workflows (e.g., payment processing, case creation)

    Document any failures or anomalies for remediation.

Decision point: If critical applications fail activation, determine whether to:

  • Continue with partial service (document unavailable systems)
  • Halt activation for remediation
  • Roll back to alternative recovery approach

Checkpoint: Before proceeding to Phase 5, confirm:

  • All tier 1-3 services operational
  • Critical business applications accessible
  • Data integrity validated
  • User acceptance testing passed
  • Known limitations documented

Phase 5: Operational transition

Objective: Establish sustainable DR operations for extended duration Timeframe: Ongoing until failback

  1. Communicate service restoration to all staff with specific guidance:
Subject: Systems Restored - Action Required
IT services have been restored on disaster recovery infrastructure.
WHAT'S WORKING:
- Email and calendar: Normal operation
- [Application]: Normal operation
- [Application]: Normal operation
LIMITATIONS:
- [System]: [Specific limitation]
- Performance: Expect 10-20% slower response times
ACTION REQUIRED:
- VPN: [Instructions if changed]
- Bookmarks: Update to [new URLs if applicable]
FIELD OFFICES:
- [Specific instructions for field connectivity]
SUPPORT: Contact [helpdesk] for issues
NEXT UPDATE: [Time]
  1. Establish DR operations monitoring:

    • Enable alerting on DR infrastructure
    • Monitor capacity utilisation (DR may have less headroom)
    • Track replication status if bidirectional replication is configured
    • Monitor cost accumulation (DR operations typically cost 150-300% of normal)
  2. Document the DR operations runbook for the current invocation:

    • Services running on DR
    • Known limitations
    • Monitoring dashboards
    • Escalation contacts
    • Shift handover procedures (for extended operations)
  3. Establish primary site monitoring for restoration:

    • Regular check-ins with facilities/provider
    • Criteria for declaring primary site ready
    • Preliminary failback timeline
  4. Begin cost tracking for DR operations. Track incremental costs including:

    • Additional cloud compute (if scaling up)
    • Network egress costs
    • Staff overtime
    • Third-party support costs

    Typical DR cost multipliers:

    ResourceNormal monthlyDR monthlyMultiplier
    Compute£8,000£18,0002.25x
    Storage£2,000£3,5001.75x
    Network£1,500£4,0002.67x
    Support£3,000£9,0003.00x
    Total£14,500£34,5002.38x
  5. Plan for extended DR operations if primary restoration exceeds 1 week:

    • Staff rotation schedules
    • Capacity expansion if needed
    • Communication cadence adjustment (daily to weekly updates)
    • Budget reforecast

Checkpoint: DR operations established when:

  • Staff notified with specific instructions
  • Monitoring active on DR infrastructure
  • Operations runbook documented
  • Primary site monitoring established
  • Cost tracking initiated

Phase 6: Failback to primary

Objective: Return operations to restored primary site with minimal disruption Timeframe: 4-8 hours (planned window)

Failback is not simply “DR invocation in reverse.” The primary site has been offline while DR accumulated production data. Failback requires reverse synchronisation, validation, and careful cutover to prevent data loss.

+------------------------------------------------------------------+
| FAILBACK SEQUENCE |
+------------------------------------------------------------------+
| |
| +------------------+ |
| | Verify primary | Confirm full restoration, all systems |
| | restoration | healthy, capacity adequate |
| | (Day -3 to -1) | |
| +--------+---------+ |
| | |
| v |
| +--------+---------+ |
| | Establish | Sync DR production data back to primary |
| | reverse sync | (may take 12-48 hours depending on delta) |
| | (Day -2 to 0) | |
| +--------+---------+ |
| | |
| v |
| +--------+---------+ |
| | Schedule | Communicate window, prepare staff |
| | maintenance | |
| | window | |
| | (Day -1) | |
| +--------+---------+ |
| | |
| v |
| +--------+---------+ |
| | Execute | Stop DR writes, final sync, DNS cutover, |
| | failback | verify primary operations |
| | (Day 0) | |
| +--------+---------+ |
| | |
| v |
| +--------+---------+ |
| | Re-establish | Primary is now production, configure |
| | DR replication | replication from primary to DR |
| | (Day 0-1) | |
| +------------------+ |
| |
+------------------------------------------------------------------+

Figure 4: Failback procedure sequence

  1. Verify primary site full restoration:
Terminal window
# Run health checks against primary infrastructure
./primary-health-check.sh --full
# Verify all expected systems present
# Verify network connectivity
# Verify storage capacity
# Verify compute capacity

Obtain written confirmation from facilities/provider that the incident is fully resolved and recurrence risk is mitigated.

  1. Establish reverse synchronisation from DR to primary. This synchronises production changes made during DR operations back to primary:
Terminal window
# For PostgreSQL
# Configure primary as replica of DR temporarily
pg_basebackup -h dr-db.example.org -D /var/lib/postgresql/data -U replication -Fp -Xs -P
# For file storage
rsync -avz --progress dr:/data/ primary:/data/
# Monitor sync progress
watch -n 60 'rsync --dry-run --stats dr:/data/ primary:/data/ | grep "Total transferred"'

Expected sync duration depends on change volume during DR operations:

DR durationTypical deltaSync time
1 day50-100 GB2-4 hours
1 week200-500 GB8-24 hours
1 month1-2 TB24-72 hours
  1. Schedule and communicate maintenance window:
Subject: Planned Maintenance - Return to Primary Systems
WHEN: [Date] [Time] - [Time] ([X] hours)
IMPACT:
- All systems unavailable during maintenance
- [Specific system]: [Extended unavailability if applicable]
ACTION REQUIRED:
- Save all work before [time]
- Log out of all systems by [time]
AFTER MAINTENANCE:
- Systems will be available at [time]
- [Any post-maintenance instructions]
  1. Execute failback cutover during maintenance window:

    a. Announce maintenance start and verify users logged out:

Terminal window
# Check active sessions
./check-active-sessions.sh
# Force disconnect if necessary (after grace period)

b. Stop application writes on DR:

Terminal window
# Put applications in read-only or maintenance mode
kubectl --context dr-cluster set env deployment/app MAINTENANCE_MODE=true

c. Execute final synchronisation:

Terminal window
# Final database sync
pg_dump -h dr-db.example.org production | psql -h primary-db.example.org production
# Verify row counts match
psql -h dr-db.example.org -c "SELECT COUNT(*) FROM transactions;"
psql -h primary-db.example.org -c "SELECT COUNT(*) FROM transactions;"

d. Update DNS to point to primary:

Terminal window
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
--change-batch file://primary-dns-changes.json
# Monitor propagation
watch -n 30 'dig +short app.example.org'

e. Start applications on primary:

Terminal window
kubectl --context primary-cluster scale deployment --all --replicas=3

f. Verify primary operations:

Terminal window
# Health checks
./primary-health-check.sh --full
# Authentication test
# Application smoke tests
  1. Re-establish normal DR replication (primary to DR):
Terminal window
# Configure DR as replica of primary
# This returns to normal DR configuration
  1. Announce maintenance completion:
Subject: Maintenance Complete - Systems Available
Systems have been restored to primary infrastructure.
All services are now available at normal performance levels.
If you experience any issues, contact [helpdesk].
  1. Scale down or stop DR infrastructure:
Terminal window
# Reduce DR to standby capacity
kubectl --context dr-cluster scale deployment --all --replicas=0
# Or maintain warm standby
kubectl --context dr-cluster scale deployment --all --replicas=1

Decision point: If verification fails after DNS cutover to primary, immediately execute rollback by returning DNS to DR infrastructure. Do not proceed with degraded primary operations.

Checkpoint: Failback complete when:

  • Primary site verified operational
  • Reverse sync complete with data validation
  • DNS pointing to primary
  • Applications operational on primary
  • Normal DR replication re-established
  • DR infrastructure scaled to standby

Phase 7: Post-incident review

Objective: Document lessons learned and improve DR capability Timeframe: Within 2 weeks of failback completion

  1. Schedule post-incident review meeting within 5 working days of failback. Include:

    • All DR team members
    • Executive sponsor
    • Key affected stakeholders
  2. Prepare incident timeline documenting:

    • Detection time
    • Decision time
    • Each phase start/end time
    • Actual vs planned duration for each phase
    • Issues encountered
    • Workarounds implemented
  3. Conduct blameless review addressing:

    • What went well?
    • What could be improved?
    • What was confusing or unclear?
    • Were runbooks adequate?
    • Were tools adequate?
    • What would we do differently?
  4. Document findings and action items:

    FindingActionOwnerDue date
    Runbook step 7 unclearRevise with specific commandsTechnical Lead+2 weeks
    Replication lag exceeded RPOReview replication configurationDBA+1 week
    Staff notification delayedUpdate contact list, test notification systemOperations Lead+1 week
  5. Update DR documentation:

    • BCDR plan revisions
    • Runbook improvements
    • Contact list updates
    • Lessons learned log
  6. Schedule follow-up DR test within 3 months to validate improvements.

Communications

Stakeholder notification matrix

StakeholderTimingChannelOwnerEscalation
Executive leadershipWithin 30 minutesPhone callDR CommanderBoard chair if >24 hours
All staffWithin 1 hourSMS + emailCommunications LeadNone
Board of directorsWithin 4 hoursEmailChief ExecutiveChair direct call
Key donorsWithin 24 hoursEmailDonor relationsExecutive call if requested
Implementing partnersWithin 4 hoursEmail + phoneProgramme leadsExecutive if critical
Regulatory bodiesPer requirementsPer requirementsLegal/ComplianceExecutive
MediaOnly if proactive neededPress statementCommunicationsExecutive approval

Communication templates

Executive notification (30 minutes):

Subject: [URGENT] DR Site Invocation - [Organisation Name]
[Executive name],
We are invoking disaster recovery procedures due to [brief cause].
CURRENT STATUS:
- Primary site: [Unavailable/Compromised/Inaccessible]
- Estimated primary restoration: [Timeline or "Unknown"]
- DR activation: In progress, estimated completion [time]
IMPACT:
- [Services affected]
- Expected data loss: [RPO - e.g., "Up to 2 hours of data"]
- Business impact: [Brief description]
DECISIONS NEEDED:
- [Any executive decisions required]
NEXT UPDATE: [Time]
DR Commander: [Name]
Contact: [Phone]

Staff notification (1 hour):

Subject: IT Systems Update - DR Activation in Progress
All staff,
Our primary IT systems are currently unavailable due to [general cause - e.g., "facility issues"]. We are activating disaster recovery systems.
CURRENT STATUS:
- DR activation in progress
- Estimated service restoration: [Time]
WHAT TO DO NOW:
- Do NOT attempt to access systems until notified
- Do NOT contact IT unless urgent (high volume expected)
- Check your email at [time] for restoration announcement
URGENT MATTERS ONLY: [Emergency contact]
We will provide an update by [time].
IT Management

Service restoration notification:

Subject: IT Systems Restored - Action Required
All staff,
IT services have been restored on disaster recovery infrastructure.
SYSTEMS AVAILABLE:
- Email and calendar: Working normally
- [System]: Working normally
- [System]: Working with limitations (see below)
KNOWN LIMITATIONS:
- [System]: [Specific limitation]
- Performance: You may notice slightly slower response times
ACTION REQUIRED:
- [Specific instructions, e.g., VPN changes, bookmark updates]
FIELD OFFICES:
- [Specific instructions]
SUPPORT: [Contact] | [Hours]
Next update: [Date/time or "When there are significant changes"]
IT Management

Failback announcement:

Subject: Planned Maintenance - [Date] - System Migration to Primary
All staff,
Following our recent DR activation, primary systems have been restored. We will migrate back to primary infrastructure during a planned maintenance window.
MAINTENANCE WINDOW:
- Date: [Date]
- Time: [Start] to [End] ([Duration])
- Impact: All IT systems unavailable
BEFORE MAINTENANCE:
- Save all work by [time]
- Log out of all systems by [time]
- Ensure any urgent tasks are completed
AFTER MAINTENANCE:
- Systems available from [time]
- [Any post-maintenance instructions]
SUPPORT: [Contact] during maintenance for urgent issues only
Thank you for your patience during this period.
IT Management

BCDR plan template

The following template provides the structure for your Business Continuity and Disaster Recovery plan. Complete this template during planning (not during an incident) and review quarterly.


BCDR Plan: [Organisation Name]

Document control:

VersionDateAuthorChanges
1.0[Date][Author]Initial version

Review schedule: Quarterly Next review: [Date] Plan owner: [Role]


Section 1: Scope and objectives

In scope:

  • [List systems, locations, and functions covered]

Out of scope:

  • [List exclusions]

Recovery objectives:

TierSystemsRTORPO
1[Core infrastructure, identity]1 hour15 minutes
2[Authentication, email]2 hours1 hour
3[Business applications]4 hours4 hours
4[Programme systems]8 hours4 hours
5[Non-critical systems]24 hours24 hours

Section 2: DR infrastructure

Primary site:

  • Location: [Address/data centre]
  • Provider: [If applicable]
  • Capacity: [Compute, storage, network]

DR site:

  • Location: [Address/data centre]
  • Provider: [If applicable]
  • Capacity: [Compute, storage, network]
  • Distance from primary: [km/miles]

DR architecture type: [Hot/Warm/Cold standby]

Replication configuration:

SystemMethodFrequencyRPO achieved
[Database][Streaming/Snapshot][Continuous/Hourly][Minutes/Hours]
[File storage][Sync/Backup][Frequency][Minutes/Hours]
[Application state][Method][Frequency][Minutes/Hours]

Section 3: Team and contacts

DR team:

RolePrimaryPhoneEmailBackup
DR Commander[Name][Phone][Email][Name]
Technical Lead[Name][Phone][Email][Name]
Application Lead[Name][Phone][Email][Name]
Communications Lead[Name][Phone][Email][Name]
Operations Lead[Name][Phone][Email][Name]

External contacts:

Vendor/PartnerContactPhoneAccount number
[ISP - Primary][Name][Phone][Account]
[ISP - DR][Name][Phone][Account]
[Cloud provider][Support channel][Phone][Account]
[Data centre - Primary][Name][Phone][Account]
[Data centre - DR][Name][Phone][Account]

Emergency conference bridge:

  • Primary: [Details]
  • Backup: [Details]

Section 4: Activation criteria

DR invocation is authorised when:

  1. [Criterion 1 with specific threshold]
  2. [Criterion 2 with specific threshold]
  3. [Criterion 3 with specific threshold]

Authorisation required from: [Role]


Section 5: System inventory

SystemTierPrimary locationDR locationDependenciesRecovery procedure
[System 1]1[Location][Location][List][Link to procedure]
[System 2]2[Location][Location][List][Link to procedure]

Section 6: Application dependency map

[Insert ASCII diagram of application dependencies]

Section 7: Network configuration

Primary site network:

  • External IP range: [Range]
  • Internal IP range: [Range]
  • DNS servers: [IPs]

DR site network:

  • External IP range: [Range]
  • Internal IP range: [Range]
  • DNS servers: [IPs]

DNS records requiring update:

RecordTypePrimary valueDR valueTTL
[app.example.org]A[IP][IP][Seconds]

Section 8: Testing record

Test dateTest typeScopeResultFindingsRemediation
[Date][Tabletop/Technical/Full][Scope][Pass/Partial/Fail][Summary][Actions]

Section 9: Revision history

DateChangeAuthorApproved by
[Date][Description][Name][Name]

See also