DR Testing
Disaster recovery testing validates that documented recovery procedures work as intended and that recovery time objectives (RTO) and recovery point objectives (RPO) are achievable under realistic conditions. Testing ranges from discussion-based tabletop exercises that cost nothing but participant time, through technical simulations that verify individual recovery procedures, to full failover tests that prove end-to-end recovery capability. Perform DR testing at least annually for critical systems, with tabletop exercises quarterly and technical validation after any significant infrastructure change.
Prerequisites
| Requirement | Detail |
|---|---|
| BCDR plan | Current business continuity and disaster recovery plan documenting recovery procedures |
| Test schedule | Approved annual test calendar with dates, scope, and participants |
| Participant availability | Confirmed availability of key personnel for test duration |
| Test environment | Isolated environment for technical tests (not production) |
| Success criteria | Documented RTO/RPO targets for systems under test |
| Management approval | Written authorisation for test scope, especially for full failover tests |
| Communication plan | Stakeholder notification procedures to prevent confusion with real incidents |
Verify BCDR plan currency before scheduling tests. Plans older than 12 months or predating significant infrastructure changes require review first:
# Check BCDR plan metadatacat /docs/bcdr/plan-metadata.json | jq '.last_review, .infrastructure_baseline_date'
# Expected output shows dates within 12 months# "last_review": "2024-09-15"# "infrastructure_baseline_date": "2024-08-01"Confirm participant availability through calendar holds at least four weeks before tabletop exercises and eight weeks before technical tests. DR testing without key personnel invalidates results and wastes resources.
Test types
DR testing encompasses four distinct approaches, each serving different validation purposes. The progression from tabletop to full failover represents increasing realism, cost, and risk.
+------------------------------------------------------------------+| DR TEST TYPE PROGRESSION |+------------------------------------------------------------------+| || TABLETOP WALKTHROUGH SIMULATION || (Discussion) (Procedural) (Technical) || || +-------------+ +-------------+ +-------------+ || | | | | | | || | Scenario | | Step-by- | | Execute | || | discussion +------>| step review +------>| procedures | || | No systems | | No systems | | Test env | || | | | | | | || +-------------+ +-------------+ +-------------+ || || Cost: Low Cost: Low Cost: Medium || Risk: None Risk: None Risk: Low || Duration: 2-4h Duration: 4-8h Duration: 1-2 days || Frequency: Quarterly Frequency: Biannual Frequency: Annual || |+------------------------------------------------------------------+| || FULL FAILOVER || (Production) || || +-------------+ || | | || | Actual | || | failover | || | Production | || | | || +-------------+ || || Cost: High || Risk: Medium-High || Duration: 2-5 days || Frequency: Annual (critical systems) || |+------------------------------------------------------------------+Tabletop exercises gather stakeholders to walk through a disaster scenario verbally, testing decision-making, communication chains, and plan comprehension without touching any systems. A facilitator presents an evolving scenario while participants describe their responses. Tabletops expose gaps in understanding, unclear responsibilities, and missing procedures at minimal cost.
Walkthrough tests extend tabletops by having participants physically trace through documented procedures step by step, identifying missing steps, outdated references, and impractical sequences. Walkthroughs validate documentation accuracy without executing actual recovery actions.
Simulation tests execute recovery procedures against test environments, validating technical capability without risking production systems. Simulations verify backup restoration times, application startup sequences, and data integrity. Results provide measured RTO/RPO achievement against test data.
Full failover tests perform actual disaster recovery by failing over production systems to secondary infrastructure. Full failovers provide the only definitive validation that recovery works under real conditions but carry service disruption risk and require extensive planning.
Procedure
Phase 1: Test planning
Select test type and scope based on time since last test and recent changes:
Last test type Time elapsed Recommended next test Full failover < 6 months Tabletop for new scenarios Full failover 6-12 months Simulation of changed systems Full failover > 12 months Full failover Simulation < 3 months Tabletop Simulation 3-6 months Simulation of different systems Simulation > 6 months Full failover if critical Tabletop only Any Simulation minimum Define test scenario with specific parameters. Effective scenarios include:
- Scenario trigger (ransomware detected at 14:00 Tuesday, datacentre fire overnight, cloud region outage during peak hours)
- Systems affected (primary database server, entire Azure West Europe region, all on-premises infrastructure)
- Data state (last backup 4 hours old, replication lag of 15 minutes, offline for 6 hours before detection)
- Constraints (key staff on leave, weekend timing, concurrent programme delivery)
Document scenario in test plan:
SCENARIO: Regional cloud outage
Trigger: Azure West Europe region becomes unavailable at 09:30 Monday due to provider infrastructure failure
Affected systems: - Primary ERP instance (Azure West Europe) - Document management (SharePoint, same region) - Email (Exchange Online, multi-region but degraded)
Data state: - Database replication to North Europe: 5-minute lag - Document sync to North Europe: 30-minute lag - Email: No data loss expected
Constraints: - Finance month-end processing in progress - IT Manager travelling (available by phone only) - Donor site visit scheduled for 14:00
Success criteria: - ERP operational in North Europe within 2 hours (RTO) - Data loss under 15 minutes (RPO) - User communication within 30 minutesIdentify participants by role:
Role Responsibility Required for Test coordinator Overall facilitation, timekeeping, documentation All tests Technical lead System recovery execution, technical decisions Simulation, Full Business owner Priority decisions, user communication All tests IT Manager Resource allocation, escalation All tests Communications lead Stakeholder updates Tabletop, Full Observer Documentation, timing measurements All tests External vendor Vendor-specific recovery support As needed Schedule test with buffer time. Technical tests frequently exceed planned duration:
Test type Planned duration Schedule buffer Total calendar block Tabletop 2 hours 1 hour 3 hours Walkthrough 4 hours 2 hours 6 hours Simulation 8 hours 4 hours 12 hours Full failover 16 hours 8 hours 24 hours Prepare test environment for simulation and full failover tests:
# For simulation tests: verify test environment isolation # Test environment should not connect to production data sources
# Check network isolation nmap -sP 10.0.0.0/24 # Production network # Expected: No hosts reachable from test environment
# Verify test database is snapshot, not live replica psql -h testdb.internal -c "SELECT pg_is_in_recovery();" # Expected: f (false, indicating standalone not replica)
# Confirm test environment uses separate identity az account show --query "{sub:name, tenant:tenantId}" # Expected: Test subscription, not production- Distribute pre-reading to participants 5 working days before test:
- Current BCDR plan sections relevant to scenario
- System documentation for affected systems
- Contact lists and escalation procedures
- Previous test reports for context
Phase 2: Tabletop exercise execution
Open the exercise with ground rules (15 minutes):
- This is a learning exercise, not a performance evaluation
- No actual systems will be touched
- Respond as you would in a real incident
- Note uncertainties for later investigation rather than inventing answers
- Facilitator will inject new information as scenario evolves
Present initial scenario and gather immediate responses (30 minutes):
Facilitator reads scenario trigger. Each participant states:
- Their immediate actions in first 15 minutes
- Who they would contact and how
- What information they need
- What decisions require escalation
Document responses without judgement. Note gaps, conflicts, and uncertainties.
Inject scenario developments (60-90 minutes):
Advance scenario through 3-5 injects spaced 15-20 minutes apart. Each inject adds complexity:
INJECT 1 (T+30 min): Users report email working intermittently. Finance team cannot access ERP for month-end processing. CEO asks for status update.
INJECT 2 (T+60 min): Azure status page confirms regional outage, estimated resolution 4-6 hours. Partner organisation calls asking if shared data is affected. IT discovers DR runbook references server names that changed 6 months ago.
INJECT 3 (T+90 min): North Europe failover initiated but database restore shows corruption in last 3 backups. Must restore from backup 18 hours old. Finance asks about data loss implications.
INJECT 4 (T+120 min): Primary region begins recovery. Must decide whether to continue failover or wait for primary. Users complain about conflicting instructions from IT and management.After each inject, participants describe their responses, decisions, and communications.
Conduct hotwash immediately after scenario conclusion (30 minutes):
Three questions for each participant:
- What worked well in our response?
- What would you do differently?
- What gaps or issues did we discover?
Document all items raised without filtering. Categorisation comes later.
Phase 3: Technical test execution
For simulation and full failover tests, execute recovery procedures while measuring actual performance against objectives.
- Confirm pre-test checklist:
PRE-TEST VERIFICATION
[ ] Test environment isolated and ready [ ] All participants present or dialled in [ ] Communication channels established (separate from systems under test) [ ] Monitoring and timing tools ready [ ] Rollback procedures confirmed [ ] Production notification sent (if full failover) [ ] Management authorisation confirmed for scope [ ] Observer assigned with stopwatch and checklist- Execute failover sequence while observer records timings:
TIMING LOG TEMPLATE
Test: ERP Failover Simulation Date: 2024-11-16 Start time: 09:00
| Step | Planned | Actual | Variance | Notes | |------|---------|--------|----------|-------| | Declare DR event | 00:00 | 09:00 | - | Test start | | Assess impact | 00:15 | 09:18 | +3 min | | | Notify stakeholders | 00:30 | 09:35 | +5 min | Email template missing | | Begin failover | 00:45 | 09:52 | +7 min | Approval delayed | | Database restore | 01:30 | 10:58 | +16 min | Larger than expected | | App tier startup | 02:00 | 11:35 | +25 min | Config file error | | Connectivity test | 02:15 | 11:52 | +27 min | | | User validation | 02:30 | 12:15 | +35 min | | | DR complete | 02:30 | 12:15 | +35 min | RTO: 3h05m vs 2h target |- Validate data integrity after recovery:
# Compare record counts between backup and restored database psql -h dr-db.internal -c "SELECT COUNT(*) FROM transactions WHERE created_at < '2024-11-16 09:00:00';"
# Verify against pre-failover count (from test documentation) # Expected: 147,832 (matches baseline) # If mismatch, calculate data loss: # (Baseline - Restored) / Baseline * 100 = % data loss
# Check most recent record timestamp psql -h dr-db.internal -c "SELECT MAX(created_at) FROM transactions;" # Compare to scenario data state to confirm RPO achievementTest critical business functions with business owners:
Finance team validates:
- Can access system
- Can view recent transactions
- Can create new transactions
- Reports generate correctly
- Data appears correct (spot check 5-10 known records)
For full failover tests, operate from DR environment for minimum 4 hours to validate stability. Monitor for:
- Performance degradation
- Integration failures
- Capacity issues
- User-reported problems
Execute failback procedures and confirm production restoration:
# Post-failback validation # Compare production database to DR state
psql -h prod-db.internal -c "SELECT COUNT(*) FROM transactions;" # Should equal or exceed DR count (new transactions during DR)
# Verify no replication lag psql -h prod-db.internal -c "SELECT * FROM pg_stat_replication;" # Expected: sent_lsn = write_lsn = flush_lsnPhase 4: Findings documentation
- Compile raw findings within 48 hours while memory is fresh:
RAW FINDINGS LOG
Test: Q4 2024 ERP DR Simulation Date: 2024-11-16
OBSERVATIONS (factual, no judgement): 1. Database restore took 58 minutes vs 45 minute estimate 2. Application configuration file referenced old server name 3. Notification email template not found in documented location 4. Finance business owner unavailable; deputy authorised transactions 5. Network team not included in initial notification 6. DR runbook page 12 references deprecated Azure portal UI 7. Test database 40% larger than sizing estimate 8. SSL certificate on DR load balancer expired 2024-10-01 9. Backup verification job last ran 2024-09-15 10. User acceptance took 20 minutes vs 15 minute estimateClassify findings by type and severity:
Finding Type Severity RTO/RPO impact Database restore time Performance Medium +13 min to RTO Config file error Documentation High +25 min to RTO Missing email template Documentation Low +5 min to RTO Business owner absence Process Medium Acceptable with deputy Network team notification Process Low No direct impact Deprecated UI references Documentation Low Confusion, no delay Database sizing Planning Medium Future risk Expired certificate Technical Critical Would have blocked failover Backup verification gap Process High RPO risk User acceptance time Estimate Low Minor variance Calculate achieved vs target metrics:
METRICS SUMMARY
Recovery Time Objective (RTO) Target: 2 hours Achieved: 3 hours 5 minutes Status: FAILED (exceeded by 54%)
Recovery Point Objective (RPO) Target: 15 minutes data loss Achieved: 8 minutes data loss Status: PASSED
Communication SLA Target: Initial notification within 30 minutes Achieved: 35 minutes Status: FAILED (exceeded by 17%)
User Validation Target: Business owner sign-off within 30 minutes of recovery Achieved: 20 minutes Status: PASSEDPhase 5: Gap analysis and remediation
- Map findings to root causes:
+-------------------------------------------------------------------+ | GAP ANALYSIS MATRIX | +-------------------------------------------------------------------+ | | | Finding Root Cause Category | | -----------------------------------------------------------------| | Config file error No change control Process gap | | for DR configs | | | | Database restore Growth not tracked Capacity planning | | slow for DR sizing | | | | Expired certificate No DR certificate Maintenance gap | | in renewal scope | | | | Missing template Documentation not Documentation gap | | version controlled | | | | Backup verification Job disabled after Process gap | | gap false positives | | | +-------------------------------------------------------------------+Prioritise remediations using risk-based approach:
Remediation Effort Risk reduction Priority Renew DR certificates Low (2 hours) Critical (blocks recovery) Immediate Re-enable backup verification Low (1 hour) High (RPO risk) Immediate Update DR config management Medium (2 days) High (RTO impact) 30 days Resize DR database Medium (4 hours) Medium (performance) 30 days Version control DR docs High (5 days) Medium (efficiency) 60 days Update runbook screenshots Low (3 hours) Low (confusion) 90 days Create remediation tickets with specific acceptance criteria:
TICKET: DR-2024-001 Title: Renew DR environment SSL certificates Priority: Critical Owner: Infrastructure Team Due: 2024-11-23
Description: DR testing revealed expired SSL certificate on DR load balancer (expired 2024-10-01). Failover would have failed at user connectivity step.
Acceptance criteria: - DR load balancer certificate renewed (valid 12+ months) - Certificate added to renewal monitoring - DR certificate inventory documented - Verification: curl -v https://dr-erp.internal shows valid certSchedule remediation validation:
After remediations complete, validate fixes before next scheduled test:
# Verify certificate remediation echo | openssl s_client -servername dr-erp.internal \ -connect dr-erp.internal:443 2>/dev/null | \ openssl x509 -noout -dates # Expected: notAfter at least 12 months future
# Verify backup verification job cat /var/log/backup-verify/latest.log | tail -20 # Expected: Recent successful verification run
# Verify DR config matches production diff /etc/app/config.prod.yaml /etc/app/config.dr.yaml # Expected: Only environment-specific differencesPhase 6: Reporting
- Prepare executive summary within 5 working days:
DR TEST EXECUTIVE SUMMARY
Test: Q4 2024 ERP Disaster Recovery Simulation Date: 16 November 2024 Classification: PARTIALLY SUCCESSFUL
Objectives tested: - Failover to North Europe region - Database restoration from replication - Application recovery and user validation
Results: - RTO: FAILED (3h05m vs 2h target, 54% over) - RPO: PASSED (8 min vs 15 min target) - Critical blocker found: Expired SSL certificate
Key findings requiring action: 1. [CRITICAL] DR certificates expired - remediated 2024-11-18 2. [HIGH] DR configuration management gap - remediation in progress 3. [HIGH] Backup verification disabled - remediated 2024-11-17
Recommendation: Repeat simulation test in Q1 2025 after remediations complete to validate RTO achievement before annual full failover.
Next scheduled test: Tabletop exercise, January 2025Present findings to leadership within 10 working days:
Focus presentation on:
- Pass/fail status against objectives
- Business risk implications of gaps
- Remediation status and timeline
- Resource requirements for fixes
- Recommended test frequency adjustments
Update BCDR plan with lessons learned:
Incorporate test findings into plan updates:
- Corrected procedures based on walkthrough findings
- Updated time estimates based on measured performance
- New decision points identified during tabletop
- Revised contact information and escalation paths
Archive test documentation:
Retain for compliance and future reference:
- Test plan and scenario
- Timing logs and observations
- Findings and gap analysis
- Remediation tracker
- Executive summary
- Participant list
# Archive test documentation mkdir -p /archive/dr-tests/2024-Q4 cp -r /tests/dr-2024-11-16/* /archive/dr-tests/2024-Q4/
# Generate archive manifest find /archive/dr-tests/2024-Q4 -type f -exec sha256sum {} \; > \ /archive/dr-tests/2024-Q4/manifest.sha256Tabletop exercise template
The following template provides structure for conducting tabletop exercises. Adapt scenario details to organisational context and current risk profile.
DR TABLETOP EXERCISE
Exercise title: [Descriptive name, e.g., “Ransomware affecting finance systems”]
Date: [Exercise date]
Duration: [Planned duration, typically 2-3 hours]
Facilitator: [Name and role]
Participants:
| Name | Role | Contact |
|---|---|---|
SCENARIO
Background: [2-3 sentences describing normal operating context before incident]
Trigger event: [Specific incident trigger with date, time, and initial indicators]
Affected systems: [List of systems, data, and services affected]
Current state: [Data backup status, replication state, known constraints]
INJECT 1 (Present at T+0)
[Initial scenario description, 3-4 sentences]
Discussion questions:
- What are your immediate actions?
- Who do you notify and how?
- What information do you need?
- What decisions require escalation?
INJECT 2 (Present at T+20 minutes)
[Scenario development adding complexity]
Discussion questions:
- How does this change your response?
- What new resources do you need?
- Who else needs to be involved?
- What are you communicating to stakeholders?
INJECT 3 (Present at T+40 minutes)
[Further development, typically introducing a complication or difficult decision]
Discussion questions:
- What is your decision and rationale?
- What are the trade-offs?
- How do you communicate this decision?
- What could go wrong with this approach?
INJECT 4 (Present at T+60 minutes)
[Resolution phase begins, new decisions required]
Discussion questions:
- When do you declare recovery complete?
- What validation do you require?
- What post-incident actions are needed?
- What would you do differently?
HOTWASH QUESTIONS
- What worked well in our response?
- What would you do differently?
- What gaps or issues did we discover?
- What changes to plans or procedures do we need?
- What training or resources are needed?
FINDINGS LOG
| Finding | Category | Severity | Owner | Action required |
|---|---|---|---|---|
EXERCISE EVALUATION
| Criterion | Rating (1-5) | Comments |
|---|---|---|
| Scenario realism | ||
| Participant engagement | ||
| Plan effectiveness | ||
| Communication clarity | ||
| Decision-making quality |
Overall assessment: [Summary paragraph]
Recommended next test: [Type and timeframe]
Verification
Confirm test completion and value delivery through these checks:
Test execution verification:
- All planned scenario phases completed
- Required participants present throughout
- Timing measurements recorded (for technical tests)
- Observer notes captured
Documentation verification:
# Verify required documents existls -la /tests/dr-$(date +%Y-%m-%d)/# Expected files:# - test-plan.md# - timing-log.csv (simulation/failover only)# - findings-raw.md# - gap-analysis.md# - executive-summary.md# - remediation-tracker.md
# Verify findings loggedwc -l /tests/dr-$(date +%Y-%m-%d)/findings-raw.md# Expected: Minimum 10 findings for meaningful testRemediation verification:
- All critical and high severity findings have assigned owners
- Remediation tickets created in tracking system
- Due dates set within policy timeframes (critical: 7 days, high: 30 days)
- Validation criteria defined for each remediation
Reporting verification:
- Executive summary delivered within 5 working days
- Leadership briefing scheduled within 10 working days
- BCDR plan update scheduled if material findings
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| Key participant unavailable day of test | Insufficient lead time, calendar conflicts | Require 4-week minimum notice; identify and brief deputies; postpone if no qualified substitute |
| Test environment not isolated | Network configuration error, shared resources | Verify isolation before test start; use dedicated test subscription; postpone if isolation cannot be confirmed |
| Restore takes significantly longer than estimate | Database growth, backup method change, infrastructure undersized | Document actual time; update estimates; investigate cause post-test; resize DR infrastructure |
| Application fails to start after restore | Configuration drift, dependency changes, missing components | Document error details; troubleshoot without time pressure; include config validation in DR procedures |
| Cannot access DR environment | Expired credentials, network path blocked, DNS not configured | Verify DR environment access monthly; include access check in test prerequisites |
| Participants treat exercise as box-ticking | Poor scenario design, lack of engagement, unclear value | Use realistic scenarios from threat intelligence; include real-world injects; share findings that drove improvements |
| Test reveals fundamental plan inadequacy | Plan not maintained, infrastructure changed, assumptions invalid | Treat as critical finding; halt test if continuing would provide no value; schedule plan review before next test |
| Full failover causes production impact | Insufficient isolation, shared dependencies, unexpected coupling | Implement kill switch; document shared dependencies; increase isolation; delay full failover until resolved |
| Findings not remediated before next test | Competing priorities, unclear ownership, underestimated effort | Escalate to leadership; include remediation status in test report; consider reducing test scope until backlog cleared |
| Observer cannot keep pace with activity | Too few observers, unclear observation scope, poor documentation template | Assign multiple observers with divided focus; provide structured templates; record session for later review |
| Business owners cannot validate recovery | Insufficient test data, unfamiliar test environment, unclear validation criteria | Pre-populate test environment with recognisable sample data; brief business owners before test; define specific validation checks |
| Vendor support unavailable during test | Vendor not notified, support hours mismatch, contract limitations | Include vendor notification in test plan; verify support availability; consider vendor participation for critical systems |
Full failover risks
Full failover tests carry inherent risk of production impact. Never conduct full failover without explicit management authorisation, verified rollback procedures, and communication to all affected stakeholders. Schedule full failovers during low-usage periods with extended maintenance windows.
Scheduling guidance
+---------------------------------------------------------------+| ANNUAL DR TEST CALENDAR |+---------------------------------------------------------------+| || JAN FEB MAR APR MAY JUN || | | | | | | || v | v | | v || [TT] | [SIM] | | [TT] || | | | || JUL AUG SEP OCT NOV DEC || | | | | | | || | | v v | | || | | [TT] [FULL] | | || | | | | | |+---------------------------------------------------------------+| || TT = Tabletop exercise (2-3 hours, quarterly) || SIM = Simulation test (1-2 days, biannual) || FULL = Full failover (2-5 days, annual) || || Schedule full failover: || - After Q3 to allow Q4 remediation || - Avoid year-end, audit periods, major programme delivery || - Align with maintenance windows || |+---------------------------------------------------------------+Adjust frequency based on:
- Regulatory requirements (some mandate specific test frequencies)
- System criticality (more critical systems warrant more frequent testing)
- Change rate (rapidly changing systems need more frequent validation)
- Previous test results (failed tests warrant accelerated retesting)
- Risk tolerance (lower tolerance requires more frequent assurance)