Skip to main content

DR Testing

Disaster recovery testing validates that documented recovery procedures work as intended and that recovery time objectives (RTO) and recovery point objectives (RPO) are achievable under realistic conditions. Testing ranges from discussion-based tabletop exercises that cost nothing but participant time, through technical simulations that verify individual recovery procedures, to full failover tests that prove end-to-end recovery capability. Perform DR testing at least annually for critical systems, with tabletop exercises quarterly and technical validation after any significant infrastructure change.

Prerequisites

RequirementDetail
BCDR planCurrent business continuity and disaster recovery plan documenting recovery procedures
Test scheduleApproved annual test calendar with dates, scope, and participants
Participant availabilityConfirmed availability of key personnel for test duration
Test environmentIsolated environment for technical tests (not production)
Success criteriaDocumented RTO/RPO targets for systems under test
Management approvalWritten authorisation for test scope, especially for full failover tests
Communication planStakeholder notification procedures to prevent confusion with real incidents

Verify BCDR plan currency before scheduling tests. Plans older than 12 months or predating significant infrastructure changes require review first:

Terminal window
# Check BCDR plan metadata
cat /docs/bcdr/plan-metadata.json | jq '.last_review, .infrastructure_baseline_date'
# Expected output shows dates within 12 months
# "last_review": "2024-09-15"
# "infrastructure_baseline_date": "2024-08-01"

Confirm participant availability through calendar holds at least four weeks before tabletop exercises and eight weeks before technical tests. DR testing without key personnel invalidates results and wastes resources.

Test types

DR testing encompasses four distinct approaches, each serving different validation purposes. The progression from tabletop to full failover represents increasing realism, cost, and risk.

+------------------------------------------------------------------+
| DR TEST TYPE PROGRESSION |
+------------------------------------------------------------------+
| |
| TABLETOP WALKTHROUGH SIMULATION |
| (Discussion) (Procedural) (Technical) |
| |
| +-------------+ +-------------+ +-------------+ |
| | | | | | | |
| | Scenario | | Step-by- | | Execute | |
| | discussion +------>| step review +------>| procedures | |
| | No systems | | No systems | | Test env | |
| | | | | | | |
| +-------------+ +-------------+ +-------------+ |
| |
| Cost: Low Cost: Low Cost: Medium |
| Risk: None Risk: None Risk: Low |
| Duration: 2-4h Duration: 4-8h Duration: 1-2 days |
| Frequency: Quarterly Frequency: Biannual Frequency: Annual |
| |
+------------------------------------------------------------------+
| |
| FULL FAILOVER |
| (Production) |
| |
| +-------------+ |
| | | |
| | Actual | |
| | failover | |
| | Production | |
| | | |
| +-------------+ |
| |
| Cost: High |
| Risk: Medium-High |
| Duration: 2-5 days |
| Frequency: Annual (critical systems) |
| |
+------------------------------------------------------------------+

Tabletop exercises gather stakeholders to walk through a disaster scenario verbally, testing decision-making, communication chains, and plan comprehension without touching any systems. A facilitator presents an evolving scenario while participants describe their responses. Tabletops expose gaps in understanding, unclear responsibilities, and missing procedures at minimal cost.

Walkthrough tests extend tabletops by having participants physically trace through documented procedures step by step, identifying missing steps, outdated references, and impractical sequences. Walkthroughs validate documentation accuracy without executing actual recovery actions.

Simulation tests execute recovery procedures against test environments, validating technical capability without risking production systems. Simulations verify backup restoration times, application startup sequences, and data integrity. Results provide measured RTO/RPO achievement against test data.

Full failover tests perform actual disaster recovery by failing over production systems to secondary infrastructure. Full failovers provide the only definitive validation that recovery works under real conditions but carry service disruption risk and require extensive planning.

Procedure

Phase 1: Test planning

  1. Select test type and scope based on time since last test and recent changes:

    Last test typeTime elapsedRecommended next test
    Full failover< 6 monthsTabletop for new scenarios
    Full failover6-12 monthsSimulation of changed systems
    Full failover> 12 monthsFull failover
    Simulation< 3 monthsTabletop
    Simulation3-6 monthsSimulation of different systems
    Simulation> 6 monthsFull failover if critical
    Tabletop onlyAnySimulation minimum
  2. Define test scenario with specific parameters. Effective scenarios include:

    • Scenario trigger (ransomware detected at 14:00 Tuesday, datacentre fire overnight, cloud region outage during peak hours)
    • Systems affected (primary database server, entire Azure West Europe region, all on-premises infrastructure)
    • Data state (last backup 4 hours old, replication lag of 15 minutes, offline for 6 hours before detection)
    • Constraints (key staff on leave, weekend timing, concurrent programme delivery)

    Document scenario in test plan:

SCENARIO: Regional cloud outage
Trigger: Azure West Europe region becomes unavailable at 09:30
Monday due to provider infrastructure failure
Affected systems:
- Primary ERP instance (Azure West Europe)
- Document management (SharePoint, same region)
- Email (Exchange Online, multi-region but degraded)
Data state:
- Database replication to North Europe: 5-minute lag
- Document sync to North Europe: 30-minute lag
- Email: No data loss expected
Constraints:
- Finance month-end processing in progress
- IT Manager travelling (available by phone only)
- Donor site visit scheduled for 14:00
Success criteria:
- ERP operational in North Europe within 2 hours (RTO)
- Data loss under 15 minutes (RPO)
- User communication within 30 minutes
  1. Identify participants by role:

    RoleResponsibilityRequired for
    Test coordinatorOverall facilitation, timekeeping, documentationAll tests
    Technical leadSystem recovery execution, technical decisionsSimulation, Full
    Business ownerPriority decisions, user communicationAll tests
    IT ManagerResource allocation, escalationAll tests
    Communications leadStakeholder updatesTabletop, Full
    ObserverDocumentation, timing measurementsAll tests
    External vendorVendor-specific recovery supportAs needed
  2. Schedule test with buffer time. Technical tests frequently exceed planned duration:

    Test typePlanned durationSchedule bufferTotal calendar block
    Tabletop2 hours1 hour3 hours
    Walkthrough4 hours2 hours6 hours
    Simulation8 hours4 hours12 hours
    Full failover16 hours8 hours24 hours
  3. Prepare test environment for simulation and full failover tests:

Terminal window
# For simulation tests: verify test environment isolation
# Test environment should not connect to production data sources
# Check network isolation
nmap -sP 10.0.0.0/24 # Production network
# Expected: No hosts reachable from test environment
# Verify test database is snapshot, not live replica
psql -h testdb.internal -c "SELECT pg_is_in_recovery();"
# Expected: f (false, indicating standalone not replica)
# Confirm test environment uses separate identity
az account show --query "{sub:name, tenant:tenantId}"
# Expected: Test subscription, not production
  1. Distribute pre-reading to participants 5 working days before test:
    • Current BCDR plan sections relevant to scenario
    • System documentation for affected systems
    • Contact lists and escalation procedures
    • Previous test reports for context

Phase 2: Tabletop exercise execution

  1. Open the exercise with ground rules (15 minutes):

    • This is a learning exercise, not a performance evaluation
    • No actual systems will be touched
    • Respond as you would in a real incident
    • Note uncertainties for later investigation rather than inventing answers
    • Facilitator will inject new information as scenario evolves
  2. Present initial scenario and gather immediate responses (30 minutes):

    Facilitator reads scenario trigger. Each participant states:

    • Their immediate actions in first 15 minutes
    • Who they would contact and how
    • What information they need
    • What decisions require escalation

    Document responses without judgement. Note gaps, conflicts, and uncertainties.

  3. Inject scenario developments (60-90 minutes):

    Advance scenario through 3-5 injects spaced 15-20 minutes apart. Each inject adds complexity:

INJECT 1 (T+30 min): Users report email working intermittently.
Finance team cannot access ERP for month-end processing.
CEO asks for status update.
INJECT 2 (T+60 min): Azure status page confirms regional outage,
estimated resolution 4-6 hours. Partner organisation calls asking
if shared data is affected. IT discovers DR runbook references
server names that changed 6 months ago.
INJECT 3 (T+90 min): North Europe failover initiated but database
restore shows corruption in last 3 backups. Must restore from
backup 18 hours old. Finance asks about data loss implications.
INJECT 4 (T+120 min): Primary region begins recovery. Must decide
whether to continue failover or wait for primary. Users complain
about conflicting instructions from IT and management.

After each inject, participants describe their responses, decisions, and communications.

  1. Conduct hotwash immediately after scenario conclusion (30 minutes):

    Three questions for each participant:

    • What worked well in our response?
    • What would you do differently?
    • What gaps or issues did we discover?

    Document all items raised without filtering. Categorisation comes later.

Phase 3: Technical test execution

For simulation and full failover tests, execute recovery procedures while measuring actual performance against objectives.

  1. Confirm pre-test checklist:
PRE-TEST VERIFICATION
[ ] Test environment isolated and ready
[ ] All participants present or dialled in
[ ] Communication channels established (separate from systems under test)
[ ] Monitoring and timing tools ready
[ ] Rollback procedures confirmed
[ ] Production notification sent (if full failover)
[ ] Management authorisation confirmed for scope
[ ] Observer assigned with stopwatch and checklist
  1. Execute failover sequence while observer records timings:
TIMING LOG TEMPLATE
Test: ERP Failover Simulation
Date: 2024-11-16
Start time: 09:00
| Step | Planned | Actual | Variance | Notes |
|------|---------|--------|----------|-------|
| Declare DR event | 00:00 | 09:00 | - | Test start |
| Assess impact | 00:15 | 09:18 | +3 min | |
| Notify stakeholders | 00:30 | 09:35 | +5 min | Email template missing |
| Begin failover | 00:45 | 09:52 | +7 min | Approval delayed |
| Database restore | 01:30 | 10:58 | +16 min | Larger than expected |
| App tier startup | 02:00 | 11:35 | +25 min | Config file error |
| Connectivity test | 02:15 | 11:52 | +27 min | |
| User validation | 02:30 | 12:15 | +35 min | |
| DR complete | 02:30 | 12:15 | +35 min | RTO: 3h05m vs 2h target |
  1. Validate data integrity after recovery:
Terminal window
# Compare record counts between backup and restored database
psql -h dr-db.internal -c "SELECT COUNT(*) FROM transactions
WHERE created_at < '2024-11-16 09:00:00';"
# Verify against pre-failover count (from test documentation)
# Expected: 147,832 (matches baseline)
# If mismatch, calculate data loss:
# (Baseline - Restored) / Baseline * 100 = % data loss
# Check most recent record timestamp
psql -h dr-db.internal -c "SELECT MAX(created_at) FROM transactions;"
# Compare to scenario data state to confirm RPO achievement
  1. Test critical business functions with business owners:

    Finance team validates:

    • Can access system
    • Can view recent transactions
    • Can create new transactions
    • Reports generate correctly
    • Data appears correct (spot check 5-10 known records)
  2. For full failover tests, operate from DR environment for minimum 4 hours to validate stability. Monitor for:

    • Performance degradation
    • Integration failures
    • Capacity issues
    • User-reported problems
  3. Execute failback procedures and confirm production restoration:

Terminal window
# Post-failback validation
# Compare production database to DR state
psql -h prod-db.internal -c "SELECT COUNT(*) FROM transactions;"
# Should equal or exceed DR count (new transactions during DR)
# Verify no replication lag
psql -h prod-db.internal -c "SELECT * FROM pg_stat_replication;"
# Expected: sent_lsn = write_lsn = flush_lsn

Phase 4: Findings documentation

  1. Compile raw findings within 48 hours while memory is fresh:
RAW FINDINGS LOG
Test: Q4 2024 ERP DR Simulation
Date: 2024-11-16
OBSERVATIONS (factual, no judgement):
1. Database restore took 58 minutes vs 45 minute estimate
2. Application configuration file referenced old server name
3. Notification email template not found in documented location
4. Finance business owner unavailable; deputy authorised transactions
5. Network team not included in initial notification
6. DR runbook page 12 references deprecated Azure portal UI
7. Test database 40% larger than sizing estimate
8. SSL certificate on DR load balancer expired 2024-10-01
9. Backup verification job last ran 2024-09-15
10. User acceptance took 20 minutes vs 15 minute estimate
  1. Classify findings by type and severity:

    FindingTypeSeverityRTO/RPO impact
    Database restore timePerformanceMedium+13 min to RTO
    Config file errorDocumentationHigh+25 min to RTO
    Missing email templateDocumentationLow+5 min to RTO
    Business owner absenceProcessMediumAcceptable with deputy
    Network team notificationProcessLowNo direct impact
    Deprecated UI referencesDocumentationLowConfusion, no delay
    Database sizingPlanningMediumFuture risk
    Expired certificateTechnicalCriticalWould have blocked failover
    Backup verification gapProcessHighRPO risk
    User acceptance timeEstimateLowMinor variance
  2. Calculate achieved vs target metrics:

METRICS SUMMARY
Recovery Time Objective (RTO)
Target: 2 hours
Achieved: 3 hours 5 minutes
Status: FAILED (exceeded by 54%)
Recovery Point Objective (RPO)
Target: 15 minutes data loss
Achieved: 8 minutes data loss
Status: PASSED
Communication SLA
Target: Initial notification within 30 minutes
Achieved: 35 minutes
Status: FAILED (exceeded by 17%)
User Validation
Target: Business owner sign-off within 30 minutes of recovery
Achieved: 20 minutes
Status: PASSED

Phase 5: Gap analysis and remediation

  1. Map findings to root causes:
+-------------------------------------------------------------------+
| GAP ANALYSIS MATRIX |
+-------------------------------------------------------------------+
| |
| Finding Root Cause Category |
| -----------------------------------------------------------------|
| Config file error No change control Process gap |
| for DR configs |
| |
| Database restore Growth not tracked Capacity planning |
| slow for DR sizing |
| |
| Expired certificate No DR certificate Maintenance gap |
| in renewal scope |
| |
| Missing template Documentation not Documentation gap |
| version controlled |
| |
| Backup verification Job disabled after Process gap |
| gap false positives |
| |
+-------------------------------------------------------------------+
  1. Prioritise remediations using risk-based approach:

    RemediationEffortRisk reductionPriority
    Renew DR certificatesLow (2 hours)Critical (blocks recovery)Immediate
    Re-enable backup verificationLow (1 hour)High (RPO risk)Immediate
    Update DR config managementMedium (2 days)High (RTO impact)30 days
    Resize DR databaseMedium (4 hours)Medium (performance)30 days
    Version control DR docsHigh (5 days)Medium (efficiency)60 days
    Update runbook screenshotsLow (3 hours)Low (confusion)90 days
  2. Create remediation tickets with specific acceptance criteria:

TICKET: DR-2024-001
Title: Renew DR environment SSL certificates
Priority: Critical
Owner: Infrastructure Team
Due: 2024-11-23
Description:
DR testing revealed expired SSL certificate on DR load balancer
(expired 2024-10-01). Failover would have failed at user
connectivity step.
Acceptance criteria:
- DR load balancer certificate renewed (valid 12+ months)
- Certificate added to renewal monitoring
- DR certificate inventory documented
- Verification: curl -v https://dr-erp.internal shows valid cert
  1. Schedule remediation validation:

    After remediations complete, validate fixes before next scheduled test:

Terminal window
# Verify certificate remediation
echo | openssl s_client -servername dr-erp.internal \
-connect dr-erp.internal:443 2>/dev/null | \
openssl x509 -noout -dates
# Expected: notAfter at least 12 months future
# Verify backup verification job
cat /var/log/backup-verify/latest.log | tail -20
# Expected: Recent successful verification run
# Verify DR config matches production
diff /etc/app/config.prod.yaml /etc/app/config.dr.yaml
# Expected: Only environment-specific differences

Phase 6: Reporting

  1. Prepare executive summary within 5 working days:
DR TEST EXECUTIVE SUMMARY
Test: Q4 2024 ERP Disaster Recovery Simulation
Date: 16 November 2024
Classification: PARTIALLY SUCCESSFUL
Objectives tested:
- Failover to North Europe region
- Database restoration from replication
- Application recovery and user validation
Results:
- RTO: FAILED (3h05m vs 2h target, 54% over)
- RPO: PASSED (8 min vs 15 min target)
- Critical blocker found: Expired SSL certificate
Key findings requiring action:
1. [CRITICAL] DR certificates expired - remediated 2024-11-18
2. [HIGH] DR configuration management gap - remediation in progress
3. [HIGH] Backup verification disabled - remediated 2024-11-17
Recommendation:
Repeat simulation test in Q1 2025 after remediations complete
to validate RTO achievement before annual full failover.
Next scheduled test: Tabletop exercise, January 2025
  1. Present findings to leadership within 10 working days:

    Focus presentation on:

    • Pass/fail status against objectives
    • Business risk implications of gaps
    • Remediation status and timeline
    • Resource requirements for fixes
    • Recommended test frequency adjustments
  2. Update BCDR plan with lessons learned:

    Incorporate test findings into plan updates:

    • Corrected procedures based on walkthrough findings
    • Updated time estimates based on measured performance
    • New decision points identified during tabletop
    • Revised contact information and escalation paths
  3. Archive test documentation:

    Retain for compliance and future reference:

    • Test plan and scenario
    • Timing logs and observations
    • Findings and gap analysis
    • Remediation tracker
    • Executive summary
    • Participant list
Terminal window
# Archive test documentation
mkdir -p /archive/dr-tests/2024-Q4
cp -r /tests/dr-2024-11-16/* /archive/dr-tests/2024-Q4/
# Generate archive manifest
find /archive/dr-tests/2024-Q4 -type f -exec sha256sum {} \; > \
/archive/dr-tests/2024-Q4/manifest.sha256

Tabletop exercise template

The following template provides structure for conducting tabletop exercises. Adapt scenario details to organisational context and current risk profile.


DR TABLETOP EXERCISE

Exercise title: [Descriptive name, e.g., “Ransomware affecting finance systems”]

Date: [Exercise date]

Duration: [Planned duration, typically 2-3 hours]

Facilitator: [Name and role]

Participants:

NameRoleContact

SCENARIO

Background: [2-3 sentences describing normal operating context before incident]

Trigger event: [Specific incident trigger with date, time, and initial indicators]

Affected systems: [List of systems, data, and services affected]

Current state: [Data backup status, replication state, known constraints]


INJECT 1 (Present at T+0)

[Initial scenario description, 3-4 sentences]

Discussion questions:

  • What are your immediate actions?
  • Who do you notify and how?
  • What information do you need?
  • What decisions require escalation?

INJECT 2 (Present at T+20 minutes)

[Scenario development adding complexity]

Discussion questions:

  • How does this change your response?
  • What new resources do you need?
  • Who else needs to be involved?
  • What are you communicating to stakeholders?

INJECT 3 (Present at T+40 minutes)

[Further development, typically introducing a complication or difficult decision]

Discussion questions:

  • What is your decision and rationale?
  • What are the trade-offs?
  • How do you communicate this decision?
  • What could go wrong with this approach?

INJECT 4 (Present at T+60 minutes)

[Resolution phase begins, new decisions required]

Discussion questions:

  • When do you declare recovery complete?
  • What validation do you require?
  • What post-incident actions are needed?
  • What would you do differently?

HOTWASH QUESTIONS

  1. What worked well in our response?
  2. What would you do differently?
  3. What gaps or issues did we discover?
  4. What changes to plans or procedures do we need?
  5. What training or resources are needed?

FINDINGS LOG

FindingCategorySeverityOwnerAction required

EXERCISE EVALUATION

CriterionRating (1-5)Comments
Scenario realism
Participant engagement
Plan effectiveness
Communication clarity
Decision-making quality

Overall assessment: [Summary paragraph]

Recommended next test: [Type and timeframe]


Verification

Confirm test completion and value delivery through these checks:

Test execution verification:

  • All planned scenario phases completed
  • Required participants present throughout
  • Timing measurements recorded (for technical tests)
  • Observer notes captured

Documentation verification:

Terminal window
# Verify required documents exist
ls -la /tests/dr-$(date +%Y-%m-%d)/
# Expected files:
# - test-plan.md
# - timing-log.csv (simulation/failover only)
# - findings-raw.md
# - gap-analysis.md
# - executive-summary.md
# - remediation-tracker.md
# Verify findings logged
wc -l /tests/dr-$(date +%Y-%m-%d)/findings-raw.md
# Expected: Minimum 10 findings for meaningful test

Remediation verification:

  • All critical and high severity findings have assigned owners
  • Remediation tickets created in tracking system
  • Due dates set within policy timeframes (critical: 7 days, high: 30 days)
  • Validation criteria defined for each remediation

Reporting verification:

  • Executive summary delivered within 5 working days
  • Leadership briefing scheduled within 10 working days
  • BCDR plan update scheduled if material findings

Troubleshooting

SymptomCauseResolution
Key participant unavailable day of testInsufficient lead time, calendar conflictsRequire 4-week minimum notice; identify and brief deputies; postpone if no qualified substitute
Test environment not isolatedNetwork configuration error, shared resourcesVerify isolation before test start; use dedicated test subscription; postpone if isolation cannot be confirmed
Restore takes significantly longer than estimateDatabase growth, backup method change, infrastructure undersizedDocument actual time; update estimates; investigate cause post-test; resize DR infrastructure
Application fails to start after restoreConfiguration drift, dependency changes, missing componentsDocument error details; troubleshoot without time pressure; include config validation in DR procedures
Cannot access DR environmentExpired credentials, network path blocked, DNS not configuredVerify DR environment access monthly; include access check in test prerequisites
Participants treat exercise as box-tickingPoor scenario design, lack of engagement, unclear valueUse realistic scenarios from threat intelligence; include real-world injects; share findings that drove improvements
Test reveals fundamental plan inadequacyPlan not maintained, infrastructure changed, assumptions invalidTreat as critical finding; halt test if continuing would provide no value; schedule plan review before next test
Full failover causes production impactInsufficient isolation, shared dependencies, unexpected couplingImplement kill switch; document shared dependencies; increase isolation; delay full failover until resolved
Findings not remediated before next testCompeting priorities, unclear ownership, underestimated effortEscalate to leadership; include remediation status in test report; consider reducing test scope until backlog cleared
Observer cannot keep pace with activityToo few observers, unclear observation scope, poor documentation templateAssign multiple observers with divided focus; provide structured templates; record session for later review
Business owners cannot validate recoveryInsufficient test data, unfamiliar test environment, unclear validation criteriaPre-populate test environment with recognisable sample data; brief business owners before test; define specific validation checks
Vendor support unavailable during testVendor not notified, support hours mismatch, contract limitationsInclude vendor notification in test plan; verify support availability; consider vendor participation for critical systems

Full failover risks

Full failover tests carry inherent risk of production impact. Never conduct full failover without explicit management authorisation, verified rollback procedures, and communication to all affected stakeholders. Schedule full failovers during low-usage periods with extended maintenance windows.

Scheduling guidance

+---------------------------------------------------------------+
| ANNUAL DR TEST CALENDAR |
+---------------------------------------------------------------+
| |
| JAN FEB MAR APR MAY JUN |
| | | | | | | |
| v | v | | v |
| [TT] | [SIM] | | [TT] |
| | | | |
| JUL AUG SEP OCT NOV DEC |
| | | | | | | |
| | | v v | | |
| | | [TT] [FULL] | | |
| | | | | | |
+---------------------------------------------------------------+
| |
| TT = Tabletop exercise (2-3 hours, quarterly) |
| SIM = Simulation test (1-2 days, biannual) |
| FULL = Full failover (2-5 days, annual) |
| |
| Schedule full failover: |
| - After Q3 to allow Q4 remediation |
| - Avoid year-end, audit periods, major programme delivery |
| - Align with maintenance windows |
| |
+---------------------------------------------------------------+

Adjust frequency based on:

  • Regulatory requirements (some mandate specific test frequencies)
  • System criticality (more critical systems warrant more frequent testing)
  • Change rate (rapidly changing systems need more frequent validation)
  • Previous test results (failed tests warrant accelerated retesting)
  • Risk tolerance (lower tolerance requires more frequent assurance)

See also