DR Site Invocation
Disaster recovery site invocation transfers organisational IT operations from an unavailable primary site to pre-established recovery infrastructure. This playbook governs full DR activation when component-level recovery proves insufficient and extended primary site unavailability requires sustained operations from alternate infrastructure. Invoke this playbook only after determining that targeted recovery through Cloud Failover, Infrastructure Recovery, or Backup Recovery playbooks cannot restore operations within acceptable timeframes.
Activation criteria
Invoke this playbook when the following conditions are met:
| Criterion | Threshold | Verification method |
|---|---|---|
| Primary site unavailability | Confirmed or expected to exceed 4 hours | Facility management confirmation, provider status |
| Critical service impact | 3 or more Tier 1 services affected | Service monitoring dashboard |
| Component recovery insufficient | Individual playbooks cannot restore within RTO | Technical lead assessment |
| Recovery timeline | Primary site restoration exceeds 24 hours | Facility or provider estimate |
| Data centre access | Physical access impossible or unsafe | Security/facilities confirmation |
Activation authority
DR invocation requires explicit authorisation from the IT Director or designated alternate. Unauthorised invocation creates coordination failures and potential data integrity issues from premature failover.
Invocation triggers by scenario
Physical facility loss occurs when fire, flood, structural damage, or utility failure renders the primary data centre inoperable. Verification requires facility management confirmation that restoration exceeds 24 hours. Power failures extending beyond UPS and generator capacity (typically 4-8 hours of fuel) trigger evaluation.
Regional infrastructure failure affects multiple organisations simultaneously through widespread power grid failure, telecommunications outage, or natural disaster. Provider status pages and regional news confirm scope. When the incident affects connectivity to the primary site rather than the site itself, evaluate whether staff can operate remotely before invoking full DR.
Cyber incident requiring isolation applies when ransomware, advanced persistent threat, or other compromise requires complete primary environment isolation for investigation. The incident commander from the relevant security playbook authorises DR invocation to maintain operations while preserving the compromised environment as evidence.
Planned invocation for testing follows the DR Testing procedures rather than this playbook. Test invocations use controlled conditions and predetermined rollback points.
Roles
| Role | Responsibility | Primary assignee | Backup |
|---|---|---|---|
| DR Commander | Overall coordination, go/no-go decisions, stakeholder communication | IT Director | Deputy IT Director |
| Technical Lead | Infrastructure activation sequence, technical verification | Infrastructure Manager | Senior Systems Administrator |
| Application Lead | Application recovery sequencing, data validation | Applications Manager | Lead Developer |
| Communications Lead | Staff notification, external communications, status updates | Communications Manager | IT Director |
| Operations Lead | Staff logistics, access coordination, operational continuity | Operations Manager | HR Manager |
The DR Commander holds decision authority throughout invocation. Technical recommendations flow to the DR Commander who authorises each phase transition. The DR Commander may delegate phase-level authority to leads for their domains while retaining overall coordination.
+------------------------------------------------------------------+| DR COMMAND STRUCTURE |+------------------------------------------------------------------+| || +------------------+ || | DR Commander | || | (IT Director) | || +--------+---------+ || | || +---------------------+---------------------+ || | | | || v v v || +------+-------+ +-------+------+ +--------+-------+ || | Technical | | Application | | Communications | || | Lead | | Lead | | Lead | || +--------------+ +--------------+ +----------------+ || | | | || v v v || +------+--------+ +-------+------+ +--------+-------+ || | Infrastructure| | App Teams | | Operations | || | Team | | | | Lead | || +---------------+ +--------------+ +----------------+ || |+------------------------------------------------------------------+Figure 1: DR command structure showing decision authority flow
Phase 1: Decision and authorisation
Objective: Confirm DR invocation is appropriate and obtain authorisation Timeframe: 30-60 minutes
Assemble the DR decision team (DR Commander, Technical Lead, Application Lead) via the emergency communication channel. Use the predetermined conference bridge or messaging channel established in the BCDR plan. If primary communication tools are affected, fall back to mobile phones using the emergency contact list.
Collect situation assessment from each lead:
- Technical Lead: Primary site status, estimated restoration time, component recovery feasibility
- Application Lead: Affected services, data synchronisation status, RPO implications
- Operations Lead: Staff safety status, facility access status
Verify DR site readiness by confirming:
- Last successful replication timestamp (acceptable if within RPO, typically 1-4 hours)
- DR infrastructure health check (automated monitoring or manual verification)
- Network path availability to DR site
- Staff ability to access DR site (VPN, physical access if applicable)
Run the DR readiness check:
# Check replication lag ./dr-status.sh --check-replication # Expected output: Replication lag: 47 minutes (within 4-hour RPO)
# Verify DR infrastructure ./dr-status.sh --health-check # Expected output: All 12 critical systems responding
# Test network path ./dr-status.sh --network-test # Expected output: DR site reachable, latency 23msDocument the decision rationale including:
- Primary site status and estimated restoration
- Services affected and business impact
- Recovery options considered and rejected
- Replication status and data loss implications
Obtain DR Commander authorisation. The DR Commander reviews the assessment and provides explicit verbal authorisation: “DR invocation is authorised for [scenario]. Proceeding to Phase 2.” Record the authorisation timestamp.
Decision point: If DR site readiness verification fails (replication lag exceeds RPO, infrastructure unhealthy, or network unavailable), halt invocation and address readiness issues before proceeding. Partial DR invocation creates worse outcomes than delayed invocation with full readiness.
Checkpoint: Before proceeding to Phase 2, confirm:
- DR Commander has provided explicit authorisation
- DR site replication is within acceptable RPO
- DR infrastructure health check passed
- Decision rationale documented
- Timestamp recorded
Phase 2: Communication and coordination
Objective: Notify all stakeholders and prepare staff for transition Timeframe: 30-45 minutes (runs parallel to Phase 3 preparation)
Notify executive leadership using the executive notification template below. The DR Commander or designated Communications Lead makes direct contact (phone call, not email) with:
- Chief Executive
- Chief Operating Officer
- Chief Financial Officer (cost implications)
- Board chair (if outage exceeds 24 hours or involves data loss)
Activate the staff notification cascade. Use the emergency notification system (SMS broadcast, emergency app, or phone tree) to reach all staff within 15 minutes:
[URGENT] IT DR ACTIVATION Primary systems unavailable. DR site activating. Check email/[channel] for instructions within 1 hour. Do NOT attempt to access primary systems. Questions: [emergency contact]Notify critical vendors and partners:
- Internet service providers (both primary and DR site)
- Cloud service providers
- Managed service providers
- Key implementing partners who depend on shared systems
Establish the operational communication rhythm:
- Status updates every 2 hours during active recovery
- Dedicated channel for DR team coordination
- Separate channel for staff questions (staffed by Operations Lead)
Prepare detailed staff instructions for system access from DR infrastructure. Staff need specific guidance on:
- VPN configuration changes (if any)
- New URLs for applications (if different)
- Authentication changes (if DR uses different identity infrastructure)
- Expected service limitations during DR operations
Checkpoint: Before proceeding to Phase 3 execution, confirm:
- Executive leadership notified and acknowledged
- Staff notification broadcast sent
- Critical vendors notified
- Communication channels established
- Staff instruction document prepared
Phase 3: Infrastructure activation
Objective: Bring DR infrastructure to operational state Timeframe: 1-4 hours depending on DR architecture
The activation sequence depends on your DR architecture. The three common models require different approaches:
Hot standby maintains running infrastructure with continuous replication. Activation involves DNS cutover and verification. Expected activation time: 15-60 minutes.
Warm standby maintains infrastructure in reduced-capacity state with periodic replication. Activation involves scaling resources, applying recent replication, and DNS cutover. Expected activation time: 1-2 hours.
Cold standby maintains infrastructure definitions and backup data only. Activation involves provisioning infrastructure, restoring from backups, and DNS cutover. Expected activation time: 4-8 hours.
+------------------------------------------------------------------+| DR ACTIVATION DECISION TREE |+------------------------------------------------------------------+| || +-------------------+ || | DR Architecture | || | Type? | || +--------+----------+ || | || +-------------------+-------------------+ || | | | || v v v || +-----+-----+ +-----+-----+ +-----+-----+ || | HOT | | WARM | | COLD | || | STANDBY | | STANDBY | | STANDBY | || +-----+-----+ +-----+-----+ +-----+-----+ || | | | || v v v || +-----+-----+ +-----+-----+ +-----+-----+ || | Verify | | Scale | | Provision | || | sync | | resources | | infra | || | status | +-----+-----+ +-----+-----+ || +-----+-----+ | | || | v v || | +-----+-----+ +-----+-----+ || | | Apply | | Restore | || | | recent | | from | || | | repl. | | backup | || | +-----+-----+ +-----+-----+ || | | | || +-------------------+-------------------+ || | || v || +--------+--------+ || | DNS cutover | || +-----------------+ || | || v || +--------+--------+ || | Verify access | || +-----------------+ || |+------------------------------------------------------------------+Figure 2: DR activation sequence by architecture type
- Verify final replication state before cutover. Record the last successful replication timestamp as this defines your Recovery Point:
# For database replication psql -h dr-db.example.org -c "SELECT pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();" # Record: Last replay: 2024-11-16 14:23:47 UTC
# For file replication rsync --dry-run --stats primary:/data/ dr:/data/ # Record: Files requiring sync (should be minimal)
# For cloud replication az site-recovery show-recovery-point --vault-name dr-vault # Record: Latest recovery point timestamp- Stop replication from primary (if applicable and safe). For active-passive configurations, stopping replication prevents corruption from a partially-available primary:
# PostgreSQL streaming replication psql -h dr-db.example.org -c "SELECT pg_promote();"
# Azure Site Recovery az site-recovery planned-failover --direction PrimaryToRecovery- Scale DR infrastructure to production capacity (warm standby only):
# Kubernetes kubectl --context dr-cluster scale deployment --all --replicas=3
# Azure VM Scale Sets az vmss scale --name dr-vmss --new-capacity 6
# AWS Auto Scaling aws autoscaling set-desired-capacity --auto-scaling-group-name dr-asg --desired-capacity 6- Provision infrastructure from definitions (cold standby only):
# Terraform cd infrastructure/dr terraform apply -var="environment=dr-active"
# Record provisioning start time for trackingRestore from backups (cold standby only). Follow the Backup Recovery playbook for detailed restore procedures. Critical sequence:
- Database servers first (longest restore time)
- Application servers after database availability confirmed
- File storage concurrent with application servers
Verify infrastructure health before DNS cutover:
# Run comprehensive health check ./dr-health-check.sh --full
# Expected output: # Database: HEALTHY (connections: 47, replication: N/A - promoted) # App servers: HEALTHY (6/6 responding) # Load balancer: HEALTHY (backend pool: 6 healthy) # Storage: HEALTHY (capacity: 67% used) # Network: HEALTHY (latency: 12ms to users)- Execute DNS cutover. Update DNS records to point to DR infrastructure:
# Update A records for primary services # Primary: app.example.org -> 203.0.113.10 (primary) # DR: app.example.org -> 198.51.100.20 (DR site)
# Using your DNS provider CLI/API aws route53 change-resource-record-sets --hosted-zone-id Z123456 \ --change-batch file://dr-dns-changes.json
# Set low TTL (300 seconds) initially for quick rollback capabilityDNS propagation timing depends on previous TTL settings. If TTL was 3600 seconds (1 hour), expect 1 hour for full propagation. Monitor propagation:
# Check propagation from multiple locations dig +short app.example.org @8.8.8.8 dig +short app.example.org @1.1.1.1 dig +short app.example.org @208.67.222.222- Verify external accessibility:
# Test from outside the network curl -I https://app.example.org # Expected: HTTP/2 200, response from DR infrastructure
# Verify certificate validity echo | openssl s_client -servername app.example.org -connect app.example.org:443 2>/dev/null | openssl x509 -noout -datesDecision point: If health checks fail after infrastructure activation, assess whether to proceed with degraded capability or halt for remediation. Partial DR operation may be preferable to no operation, but document degraded services and communicate limitations to users.
Checkpoint: Before proceeding to Phase 4, confirm:
- Replication state recorded (defines RPO achieved)
- DR infrastructure scaled to production capacity
- All health checks passing
- DNS cutover complete
- External accessibility verified
Phase 4: Application activation and validation
Objective: Bring applications online and verify data integrity Timeframe: 1-3 hours
Applications must activate in dependency order. The sequence below represents a typical organisation; adjust based on your application dependency map.
+------------------------------------------------------------------+| APPLICATION ACTIVATION SEQUENCE |+------------------------------------------------------------------+| || +----------------+ || | Tier 1: Core | Identity, DNS, Core Database || | (0-30 min) | || +-------+--------+ || | || v || +-------+--------+ || | Tier 2: Auth | SSO, MFA, Directory Services || | (30-60 min) | || +-------+--------+ || | || v || +-------+--------+ || | Tier 3: Comms | Email, Messaging, Video || | (60-90 min) | || +-------+--------+ || | || v || +-------+--------+ || | Tier 4: Core | ERP, CRM, Grants Management || | Business | || | (90-150 min) | || +-------+--------+ || | || v || +-------+--------+ || | Tier 5: Prog. | Case Management, M&E, Data Collection || | Systems | || | (150-180 min) | || +----------------+ || |+------------------------------------------------------------------+Figure 3: Application activation sequence by dependency tier
Activate Tier 1 core services:
- Identity provider / directory services
- Internal DNS
- Core databases
Verify with authentication test:
# Test LDAP/AD connectivity ldapsearch -H ldaps://dr-dc.example.org -x -b "dc=example,dc=org" "(uid=testuser)"
# Test database connectivity psql -h dr-db.example.org -U app_user -c "SELECT 1;"Activate Tier 2 authentication services:
- Single sign-on
- Multi-factor authentication
- Certificate services
Verify with end-to-end authentication:
# Attempt SSO login curl -c cookies.txt -b cookies.txt -L https://sso.example.org/auth/testActivate Tier 3 communication services:
- Email (verify mail flow)
- Messaging platform
- Video conferencing (if self-hosted)
Verify mail flow:
# Send test email and verify delivery echo "DR test" | mail -s "DR Mail Flow Test" dr-test@example.org
# Check mail queue postqueue -pActivate Tier 4 core business applications:
- Finance / ERP
- CRM / Donor management
- Grants management
- HR / HCM
For each application:
- Start application services
- Verify database connectivity
- Test critical transaction (read and write)
- Verify integration endpoints
Activate Tier 5 programme systems:
- Case management
- M&E platforms
- Data collection (KoboToolbox, ODK, etc.)
- Beneficiary registration
Verify offline data considerations: Field data collection systems may have data collected offline during the outage. Establish the process for incorporating this data after activation.
Validate data integrity across all activated systems. Run integrity checks:
-- Check for orphaned records SELECT COUNT(*) FROM transactions WHERE account_id NOT IN (SELECT id FROM accounts);
-- Verify recent data present SELECT MAX(created_at) FROM transactions; -- Should show timestamp close to replication cutover
-- Check referential integrity SELECT COUNT(*) FROM cases WHERE beneficiary_id NOT IN (SELECT id FROM beneficiaries);Conduct user acceptance testing with key users from each department. Provide a test script covering:
- Login and navigation
- Read operations (viewing records)
- Write operations (creating/updating records)
- Critical workflows (e.g., payment processing, case creation)
Document any failures or anomalies for remediation.
Decision point: If critical applications fail activation, determine whether to:
- Continue with partial service (document unavailable systems)
- Halt activation for remediation
- Roll back to alternative recovery approach
Checkpoint: Before proceeding to Phase 5, confirm:
- All tier 1-3 services operational
- Critical business applications accessible
- Data integrity validated
- User acceptance testing passed
- Known limitations documented
Phase 5: Operational transition
Objective: Establish sustainable DR operations for extended duration Timeframe: Ongoing until failback
- Communicate service restoration to all staff with specific guidance:
Subject: Systems Restored - Action Required
IT services have been restored on disaster recovery infrastructure.
WHAT'S WORKING: - Email and calendar: Normal operation - [Application]: Normal operation - [Application]: Normal operation
LIMITATIONS: - [System]: [Specific limitation] - Performance: Expect 10-20% slower response times
ACTION REQUIRED: - VPN: [Instructions if changed] - Bookmarks: Update to [new URLs if applicable]
FIELD OFFICES: - [Specific instructions for field connectivity]
SUPPORT: Contact [helpdesk] for issues NEXT UPDATE: [Time]Establish DR operations monitoring:
- Enable alerting on DR infrastructure
- Monitor capacity utilisation (DR may have less headroom)
- Track replication status if bidirectional replication is configured
- Monitor cost accumulation (DR operations typically cost 150-300% of normal)
Document the DR operations runbook for the current invocation:
- Services running on DR
- Known limitations
- Monitoring dashboards
- Escalation contacts
- Shift handover procedures (for extended operations)
Establish primary site monitoring for restoration:
- Regular check-ins with facilities/provider
- Criteria for declaring primary site ready
- Preliminary failback timeline
Begin cost tracking for DR operations. Track incremental costs including:
- Additional cloud compute (if scaling up)
- Network egress costs
- Staff overtime
- Third-party support costs
Typical DR cost multipliers:
Resource Normal monthly DR monthly Multiplier Compute £8,000 £18,000 2.25x Storage £2,000 £3,500 1.75x Network £1,500 £4,000 2.67x Support £3,000 £9,000 3.00x Total £14,500 £34,500 2.38x Plan for extended DR operations if primary restoration exceeds 1 week:
- Staff rotation schedules
- Capacity expansion if needed
- Communication cadence adjustment (daily to weekly updates)
- Budget reforecast
Checkpoint: DR operations established when:
- Staff notified with specific instructions
- Monitoring active on DR infrastructure
- Operations runbook documented
- Primary site monitoring established
- Cost tracking initiated
Phase 6: Failback to primary
Objective: Return operations to restored primary site with minimal disruption Timeframe: 4-8 hours (planned window)
Failback is not simply “DR invocation in reverse.” The primary site has been offline while DR accumulated production data. Failback requires reverse synchronisation, validation, and careful cutover to prevent data loss.
+------------------------------------------------------------------+| FAILBACK SEQUENCE |+------------------------------------------------------------------+| || +------------------+ || | Verify primary | Confirm full restoration, all systems || | restoration | healthy, capacity adequate || | (Day -3 to -1) | || +--------+---------+ || | || v || +--------+---------+ || | Establish | Sync DR production data back to primary || | reverse sync | (may take 12-48 hours depending on delta) || | (Day -2 to 0) | || +--------+---------+ || | || v || +--------+---------+ || | Schedule | Communicate window, prepare staff || | maintenance | || | window | || | (Day -1) | || +--------+---------+ || | || v || +--------+---------+ || | Execute | Stop DR writes, final sync, DNS cutover, || | failback | verify primary operations || | (Day 0) | || +--------+---------+ || | || v || +--------+---------+ || | Re-establish | Primary is now production, configure || | DR replication | replication from primary to DR || | (Day 0-1) | || +------------------+ || |+------------------------------------------------------------------+Figure 4: Failback procedure sequence
- Verify primary site full restoration:
# Run health checks against primary infrastructure ./primary-health-check.sh --full
# Verify all expected systems present # Verify network connectivity # Verify storage capacity # Verify compute capacityObtain written confirmation from facilities/provider that the incident is fully resolved and recurrence risk is mitigated.
- Establish reverse synchronisation from DR to primary. This synchronises production changes made during DR operations back to primary:
# For PostgreSQL # Configure primary as replica of DR temporarily pg_basebackup -h dr-db.example.org -D /var/lib/postgresql/data -U replication -Fp -Xs -P
# For file storage rsync -avz --progress dr:/data/ primary:/data/
# Monitor sync progress watch -n 60 'rsync --dry-run --stats dr:/data/ primary:/data/ | grep "Total transferred"'Expected sync duration depends on change volume during DR operations:
| DR duration | Typical delta | Sync time |
|---|---|---|
| 1 day | 50-100 GB | 2-4 hours |
| 1 week | 200-500 GB | 8-24 hours |
| 1 month | 1-2 TB | 24-72 hours |
- Schedule and communicate maintenance window:
Subject: Planned Maintenance - Return to Primary Systems
WHEN: [Date] [Time] - [Time] ([X] hours)
IMPACT: - All systems unavailable during maintenance - [Specific system]: [Extended unavailability if applicable]
ACTION REQUIRED: - Save all work before [time] - Log out of all systems by [time]
AFTER MAINTENANCE: - Systems will be available at [time] - [Any post-maintenance instructions]Execute failback cutover during maintenance window:
a. Announce maintenance start and verify users logged out:
# Check active sessions ./check-active-sessions.sh # Force disconnect if necessary (after grace period)b. Stop application writes on DR:
# Put applications in read-only or maintenance mode kubectl --context dr-cluster set env deployment/app MAINTENANCE_MODE=truec. Execute final synchronisation:
# Final database sync pg_dump -h dr-db.example.org production | psql -h primary-db.example.org production
# Verify row counts match psql -h dr-db.example.org -c "SELECT COUNT(*) FROM transactions;" psql -h primary-db.example.org -c "SELECT COUNT(*) FROM transactions;"d. Update DNS to point to primary:
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \ --change-batch file://primary-dns-changes.json
# Monitor propagation watch -n 30 'dig +short app.example.org'e. Start applications on primary:
kubectl --context primary-cluster scale deployment --all --replicas=3f. Verify primary operations:
# Health checks ./primary-health-check.sh --full
# Authentication test # Application smoke tests- Re-establish normal DR replication (primary to DR):
# Configure DR as replica of primary # This returns to normal DR configuration- Announce maintenance completion:
Subject: Maintenance Complete - Systems Available
Systems have been restored to primary infrastructure. All services are now available at normal performance levels.
If you experience any issues, contact [helpdesk].- Scale down or stop DR infrastructure:
# Reduce DR to standby capacity kubectl --context dr-cluster scale deployment --all --replicas=0
# Or maintain warm standby kubectl --context dr-cluster scale deployment --all --replicas=1Decision point: If verification fails after DNS cutover to primary, immediately execute rollback by returning DNS to DR infrastructure. Do not proceed with degraded primary operations.
Checkpoint: Failback complete when:
- Primary site verified operational
- Reverse sync complete with data validation
- DNS pointing to primary
- Applications operational on primary
- Normal DR replication re-established
- DR infrastructure scaled to standby
Phase 7: Post-incident review
Objective: Document lessons learned and improve DR capability Timeframe: Within 2 weeks of failback completion
Schedule post-incident review meeting within 5 working days of failback. Include:
- All DR team members
- Executive sponsor
- Key affected stakeholders
Prepare incident timeline documenting:
- Detection time
- Decision time
- Each phase start/end time
- Actual vs planned duration for each phase
- Issues encountered
- Workarounds implemented
Conduct blameless review addressing:
- What went well?
- What could be improved?
- What was confusing or unclear?
- Were runbooks adequate?
- Were tools adequate?
- What would we do differently?
Document findings and action items:
Finding Action Owner Due date Runbook step 7 unclear Revise with specific commands Technical Lead +2 weeks Replication lag exceeded RPO Review replication configuration DBA +1 week Staff notification delayed Update contact list, test notification system Operations Lead +1 week Update DR documentation:
- BCDR plan revisions
- Runbook improvements
- Contact list updates
- Lessons learned log
Schedule follow-up DR test within 3 months to validate improvements.
Communications
Stakeholder notification matrix
| Stakeholder | Timing | Channel | Owner | Escalation |
|---|---|---|---|---|
| Executive leadership | Within 30 minutes | Phone call | DR Commander | Board chair if >24 hours |
| All staff | Within 1 hour | SMS + email | Communications Lead | None |
| Board of directors | Within 4 hours | Chief Executive | Chair direct call | |
| Key donors | Within 24 hours | Donor relations | Executive call if requested | |
| Implementing partners | Within 4 hours | Email + phone | Programme leads | Executive if critical |
| Regulatory bodies | Per requirements | Per requirements | Legal/Compliance | Executive |
| Media | Only if proactive needed | Press statement | Communications | Executive approval |
Communication templates
Executive notification (30 minutes):
Subject: [URGENT] DR Site Invocation - [Organisation Name]
[Executive name],
We are invoking disaster recovery procedures due to [brief cause].
CURRENT STATUS:- Primary site: [Unavailable/Compromised/Inaccessible]- Estimated primary restoration: [Timeline or "Unknown"]- DR activation: In progress, estimated completion [time]
IMPACT:- [Services affected]- Expected data loss: [RPO - e.g., "Up to 2 hours of data"]- Business impact: [Brief description]
DECISIONS NEEDED:- [Any executive decisions required]
NEXT UPDATE: [Time]
DR Commander: [Name]Contact: [Phone]Staff notification (1 hour):
Subject: IT Systems Update - DR Activation in Progress
All staff,
Our primary IT systems are currently unavailable due to [general cause - e.g., "facility issues"]. We are activating disaster recovery systems.
CURRENT STATUS:- DR activation in progress- Estimated service restoration: [Time]
WHAT TO DO NOW:- Do NOT attempt to access systems until notified- Do NOT contact IT unless urgent (high volume expected)- Check your email at [time] for restoration announcement
URGENT MATTERS ONLY: [Emergency contact]
We will provide an update by [time].
IT ManagementService restoration notification:
Subject: IT Systems Restored - Action Required
All staff,
IT services have been restored on disaster recovery infrastructure.
SYSTEMS AVAILABLE:- Email and calendar: Working normally- [System]: Working normally- [System]: Working with limitations (see below)
KNOWN LIMITATIONS:- [System]: [Specific limitation]- Performance: You may notice slightly slower response times
ACTION REQUIRED:- [Specific instructions, e.g., VPN changes, bookmark updates]
FIELD OFFICES:- [Specific instructions]
SUPPORT: [Contact] | [Hours]
Next update: [Date/time or "When there are significant changes"]
IT ManagementFailback announcement:
Subject: Planned Maintenance - [Date] - System Migration to Primary
All staff,
Following our recent DR activation, primary systems have been restored. We will migrate back to primary infrastructure during a planned maintenance window.
MAINTENANCE WINDOW:- Date: [Date]- Time: [Start] to [End] ([Duration])- Impact: All IT systems unavailable
BEFORE MAINTENANCE:- Save all work by [time]- Log out of all systems by [time]- Ensure any urgent tasks are completed
AFTER MAINTENANCE:- Systems available from [time]- [Any post-maintenance instructions]
SUPPORT: [Contact] during maintenance for urgent issues only
Thank you for your patience during this period.
IT ManagementBCDR plan template
The following template provides the structure for your Business Continuity and Disaster Recovery plan. Complete this template during planning (not during an incident) and review quarterly.
BCDR Plan: [Organisation Name]
Document control:
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | [Date] | [Author] | Initial version |
Review schedule: Quarterly Next review: [Date] Plan owner: [Role]
Section 1: Scope and objectives
In scope:
- [List systems, locations, and functions covered]
Out of scope:
- [List exclusions]
Recovery objectives:
| Tier | Systems | RTO | RPO |
|---|---|---|---|
| 1 | [Core infrastructure, identity] | 1 hour | 15 minutes |
| 2 | [Authentication, email] | 2 hours | 1 hour |
| 3 | [Business applications] | 4 hours | 4 hours |
| 4 | [Programme systems] | 8 hours | 4 hours |
| 5 | [Non-critical systems] | 24 hours | 24 hours |
Section 2: DR infrastructure
Primary site:
- Location: [Address/data centre]
- Provider: [If applicable]
- Capacity: [Compute, storage, network]
DR site:
- Location: [Address/data centre]
- Provider: [If applicable]
- Capacity: [Compute, storage, network]
- Distance from primary: [km/miles]
DR architecture type: [Hot/Warm/Cold standby]
Replication configuration:
| System | Method | Frequency | RPO achieved |
|---|---|---|---|
| [Database] | [Streaming/Snapshot] | [Continuous/Hourly] | [Minutes/Hours] |
| [File storage] | [Sync/Backup] | [Frequency] | [Minutes/Hours] |
| [Application state] | [Method] | [Frequency] | [Minutes/Hours] |
Section 3: Team and contacts
DR team:
| Role | Primary | Phone | Backup | |
|---|---|---|---|---|
| DR Commander | [Name] | [Phone] | [Email] | [Name] |
| Technical Lead | [Name] | [Phone] | [Email] | [Name] |
| Application Lead | [Name] | [Phone] | [Email] | [Name] |
| Communications Lead | [Name] | [Phone] | [Email] | [Name] |
| Operations Lead | [Name] | [Phone] | [Email] | [Name] |
External contacts:
| Vendor/Partner | Contact | Phone | Account number |
|---|---|---|---|
| [ISP - Primary] | [Name] | [Phone] | [Account] |
| [ISP - DR] | [Name] | [Phone] | [Account] |
| [Cloud provider] | [Support channel] | [Phone] | [Account] |
| [Data centre - Primary] | [Name] | [Phone] | [Account] |
| [Data centre - DR] | [Name] | [Phone] | [Account] |
Emergency conference bridge:
- Primary: [Details]
- Backup: [Details]
Section 4: Activation criteria
DR invocation is authorised when:
- [Criterion 1 with specific threshold]
- [Criterion 2 with specific threshold]
- [Criterion 3 with specific threshold]
Authorisation required from: [Role]
Section 5: System inventory
| System | Tier | Primary location | DR location | Dependencies | Recovery procedure |
|---|---|---|---|---|---|
| [System 1] | 1 | [Location] | [Location] | [List] | [Link to procedure] |
| [System 2] | 2 | [Location] | [Location] | [List] | [Link to procedure] |
Section 6: Application dependency map
[Insert ASCII diagram of application dependencies]Section 7: Network configuration
Primary site network:
- External IP range: [Range]
- Internal IP range: [Range]
- DNS servers: [IPs]
DR site network:
- External IP range: [Range]
- Internal IP range: [Range]
- DNS servers: [IPs]
DNS records requiring update:
| Record | Type | Primary value | DR value | TTL |
|---|---|---|---|---|
| [app.example.org] | A | [IP] | [IP] | [Seconds] |
Section 8: Testing record
| Test date | Test type | Scope | Result | Findings | Remediation |
|---|---|---|---|---|---|
| [Date] | [Tabletop/Technical/Full] | [Scope] | [Pass/Partial/Fail] | [Summary] | [Actions] |
Section 9: Revision history
| Date | Change | Author | Approved by |
|---|---|---|---|
| [Date] | [Description] | [Name] | [Name] |
See also
- High Availability and Disaster Recovery - Architecture concepts and design principles
- DR Testing - Regular testing procedures and tabletop exercises
- Cloud Failover - Cloud-specific failover procedures
- Infrastructure Recovery - Component-level recovery
- Backup Recovery - Data restoration procedures
- Major Service Outage - Service outage response