Skip to main content

Cloud Failover

Cloud failover transfers workloads from a failed or degraded primary environment to a secondary environment when the primary cannot meet service requirements. This playbook covers three failover types: region failover when an entire cloud region becomes unavailable, application failover when specific services fail while infrastructure remains healthy, and DNS failover when traffic steering is required without infrastructure changes. Execute this playbook when monitoring indicates service degradation beyond defined thresholds and automated recovery has not restored service within the configured grace period.

Activation criteria

Invoke this playbook when any of the following conditions persist for the specified duration after automated recovery attempts have been exhausted.

Failover typeConditionThresholdGrace period
RegionCloud provider status page confirms regional outageProvider-declared outage0 minutes
RegionMultiple availability zones in region unreachable2+ zones unavailable15 minutes
RegionCross-zone latency exceeds baselineLatency > 500ms sustained30 minutes
ApplicationHealth check failures across all instances100% failure rate5 minutes
ApplicationError rate exceeds threshold> 50% 5xx responses10 minutes
ApplicationResponse time degradationp95 latency > 10× baseline15 minutes
DNSPrimary endpoint unreachable from multiple probe locations3+ probe failures3 minutes
DNSGeographic routing failureRegion-specific failures5 minutes

Automated failover precedence

If automated failover is configured and functioning, allow it to complete before manual intervention. This playbook applies when automation fails, is not configured, or when human judgment is required for complex failure scenarios.

Roles

RoleResponsibilityTypical assigneeBackup
Incident commanderAuthorises failover, coordinates communication, makes go/no-go decisionsIT Manager or designated on-call leadSenior infrastructure engineer
Technical leadExecutes failover procedures, validates success, troubleshoots issuesCloud infrastructure engineerPlatform engineer
Application ownerValidates application functionality post-failover, approves service restorationApplication team leadSenior developer
Communications leadStakeholder updates, status page management, user notificationService desk managerIT Manager

Phase 1: Assessment and decision

Objective: Confirm failure conditions, determine failover type, and obtain authorisation to proceed.

Timeframe: 5-15 minutes

  1. Verify the failure is genuine and not a monitoring false positive. Check the cloud provider’s status page directly at:

    Compare provider status with your own monitoring. A provider-acknowledged outage confirms regional failure. If the provider reports healthy but your monitoring shows failure, the issue is likely application-level or network path-specific.

  2. Determine failure scope by running diagnostic checks from a location outside the affected region. For Azure:

Terminal window
# Check resource health from Azure CLI (run from unaffected region or local machine)
az resource list --resource-group production-rg --query "[].{name:name, health:properties.healthState}" -o table
# Check VM availability
az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code=='PowerState/running']"

For AWS:

Terminal window
# Check instance status
aws ec2 describe-instance-status --region eu-west-1 --query "InstanceStatuses[*].{ID:InstanceId,State:InstanceState.Name,Status:InstanceStatus.Status}"
# Check service health events
aws health describe-events --region us-east-1 --filter "eventTypeCategories=issue"
  1. Classify the failure type using the decision tree:
+----------------------+
| Service unavailable |
+----------+-----------+
|
+----------v-----------+
| Provider confirms |
| regional outage? |
+----------+-----------+
|
+----------------+----------------+
| |
Yes No
| |
v v
+--------+--------+ +----------+-----------+
| REGION FAILOVER | | Multiple apps/ |
+-----------------+ | services affected? |
+----------+-----------+
|
+----------------+----------------+
| |
Yes No
| |
v v
+----------+-----------+ +----------+-----------+
| Infrastructure | | APPLICATION |
| healthy in region? | | FAILOVER |
+----------+-----------+ +----------------------+
|
+----------------+----------------+
| |
Yes No
| |
v v
+--------+--------+ +----------+-----------+
| DNS FAILOVER | | REGION FAILOVER |
+-----------------+ +----------------------+
  1. Calculate the impact of failover versus waiting for recovery. Failover incurs costs and risks:

    • Data synchronisation lag: Check replication status to determine potential data loss window
    • Failover execution time: 5-30 minutes depending on type
    • DNS propagation: 5-60 minutes depending on TTL settings
    • Application warm-up: Variable by application

    If estimated recovery time from the provider is less than failover execution time plus propagation time, waiting may be preferable.

  2. Obtain authorisation from the incident commander. Present:

    • Confirmed failure type
    • Estimated data loss (replication lag)
    • Estimated time to failover completion
    • Estimated time if waiting for recovery
    • Business impact of continued outage

Decision point: The incident commander authorises failover or decides to wait for primary recovery. Document the decision and reasoning in the incident record.

Checkpoint: Before proceeding to Phase 2, confirm:

  • Failure type is determined (region, application, or DNS)
  • Failover is authorised by incident commander
  • Decision is documented with timestamp

Phase 2: Pre-failover preparation

Objective: Verify secondary environment readiness and prepare for failover execution.

Timeframe: 5-20 minutes

  1. Verify the secondary environment is healthy and ready to receive traffic. Run health checks against the secondary region or standby instances:
Terminal window
# Azure - Check secondary region resources
az resource list --resource-group production-dr-rg --location northeurope --query "[].{name:name, provisioningState:provisioningState}" -o table
# AWS - Check standby instances in DR region
aws ec2 describe-instances --region eu-west-2 --filters "Name=tag:Environment,Values=dr" --query "Reservations[*].Instances[*].{ID:InstanceId,State:State.Name}"
  1. Check data replication status to determine the Recovery Point Objective (RPO) exposure. The replication lag indicates how much data could be lost.

    For Azure SQL with geo-replication:

-- Run on secondary database
SELECT
database_name,
replication_state_desc,
DATEDIFF(SECOND, last_received_time, GETUTCDATE()) AS replication_lag_seconds
FROM sys.dm_geo_replication_link_status;

For AWS RDS with read replicas:

Terminal window
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=production-replica \
--start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Average

Record the replication lag. If lag exceeds acceptable RPO (typically defined in the BCDR plan), notify the incident commander before proceeding.

  1. For region failover, verify network connectivity to the secondary region from client locations. Run traceroute and latency tests from representative locations:
Terminal window
# Test connectivity to secondary load balancer
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://dr.example.org/health
  1. Scale the secondary environment if running in reduced capacity. Many DR configurations run secondary at reduced scale to minimise costs:
Terminal window
# Azure - Scale up App Service plan
az appservice plan update --name dr-plan --resource-group production-dr-rg --sku P2v2
# AWS - Update Auto Scaling group desired capacity
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name dr-web-asg \
--desired-capacity 4 \
--min-size 2

Wait for scaling to complete before proceeding. Monitor instance health as new capacity comes online.

  1. Prepare DNS changes but do not execute them yet. Identify the records that require modification:
Terminal window
# List current DNS records
dig +short production.example.org
dig +short production.example.org CNAME
# Note current values for rollback
# Production: 203.0.113.10 (eu-west-1)
# DR target: 203.0.113.20 (eu-west-2)
  1. Notify the application owner that failover is imminent. They should prepare for post-failover validation.

Checkpoint: Before proceeding to Phase 3, confirm:

  • Secondary environment health verified
  • Replication lag recorded and acceptable
  • Secondary environment scaled to production capacity
  • DNS change prepared
  • Application owner notified

Phase 3: Failover execution

Objective: Execute the failover and redirect traffic to the secondary environment.

Timeframe: 5-30 minutes depending on failover type

Execute the section matching the failover type determined in Phase 1.

Region failover

Region failover redirects all traffic from a failed region to a healthy secondary region. This is the most comprehensive failover type and affects all services deployed in the primary region.

  1. If using Azure Traffic Manager or AWS Route 53 health checks with automatic failover, verify the automatic failover has triggered. If not, force the failover:

    For Azure Traffic Manager:

Terminal window
# Disable the primary endpoint to force failover
az network traffic-manager endpoint update \
--resource-group dns-rg \
--profile-name production-tm \
--name primary-endpoint \
--type azureEndpoints \
--endpoint-status Disabled

For AWS Route 53:

Terminal window
# Update health check to force failover (set to always unhealthy)
aws route53 update-health-check \
--health-check-id abc123-health-check-id \
--inverted
  1. For database failover with Azure SQL geo-replication, initiate forced failover (accepts potential data loss):
Terminal window
# Forced failover - use when primary is unreachable
az sql db replica set-primary \
--resource-group production-dr-rg \
--server dr-sql-server \
--name production-db \
--allow-data-loss

For AWS RDS Multi-AZ, the failover is automatic. For cross-region read replica promotion:

Terminal window
# Promote read replica to standalone (irreversible)
aws rds promote-read-replica \
--db-instance-identifier production-replica \
--backup-retention-period 7

Database promotion is irreversible

Promoting a read replica breaks replication permanently. After promotion, you must reconfigure replication from the new primary. Ensure this action is authorised and documented.

  1. Update application configuration to point to the new database endpoint if not using DNS-based database endpoints:
Terminal window
# Update environment variable or configuration
# Azure App Service
az webapp config appsettings set \
--resource-group production-dr-rg \
--name dr-webapp \
--settings DATABASE_HOST=dr-sql-server.database.windows.net
# Restart application to pick up new configuration
az webapp restart --resource-group production-dr-rg --name dr-webapp
  1. Verify the secondary application instances are serving traffic correctly by making direct requests bypassing DNS:
Terminal window
# Direct request to secondary load balancer IP
curl -H "Host: production.example.org" https://203.0.113.20/health
# Verify response indicates healthy secondary
  1. Update public DNS to point to the secondary environment. The method depends on your DNS provider:

    For Cloudflare:

Terminal window
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records/{record_id}" \
-H "Authorization: Bearer {api_token}" \
-H "Content-Type: application/json" \
--data '{"content":"203.0.113.20"}'

For Route 53 (if not using health-check-based failover):

Terminal window
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "production.example.org",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.20"}]
}
}]
}'
  1. Monitor DNS propagation. Changes propagate based on TTL settings. If TTL was 300 seconds (5 minutes), most resolvers will pick up the change within 10 minutes. Check propagation from multiple locations:
Terminal window
# Check from different DNS resolvers
dig @8.8.8.8 production.example.org +short
dig @1.1.1.1 production.example.org +short
dig @208.67.222.222 production.example.org +short

Application failover

Application failover redirects traffic for a specific application while leaving other regional infrastructure intact. Use this when a single application or service fails but the underlying infrastructure remains healthy.

  1. Identify the failing application components and their standby counterparts:
Terminal window
# List current application instances
kubectl get pods -n production -l app=web-frontend
# Check standby deployment status
kubectl get pods -n dr -l app=web-frontend
  1. Scale up the standby application deployment to match production capacity:
Terminal window
# Scale standby deployment
kubectl scale deployment web-frontend -n dr --replicas=4
# Wait for pods to be ready
kubectl rollout status deployment/web-frontend -n dr --timeout=300s
  1. Update the service routing to direct traffic to standby instances. For Kubernetes with service mesh:
Terminal window
# Update Istio VirtualService to route to DR
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-frontend
namespace: production
spec:
hosts:
- web-frontend
http:
- route:
- destination:
host: web-frontend.dr.svc.cluster.local
weight: 100
EOF

For load balancer-based routing:

Terminal window
# Remove primary backend from load balancer pool
az network lb address-pool address remove \
--resource-group production-rg \
--lb-name production-lb \
--pool-name backend-pool \
--name primary-backend
# Add DR backend to load balancer pool
az network lb address-pool address add \
--resource-group production-rg \
--lb-name production-lb \
--pool-name backend-pool \
--name dr-backend \
--ip-address 10.1.2.10
  1. Verify traffic is flowing to the standby application:
Terminal window
# Check request distribution
kubectl logs -n dr -l app=web-frontend --tail=10 | grep "GET /"
# Verify metrics show traffic on standby
curl -s http://standby-prometheus:9090/api/v1/query?query=http_requests_total | jq '.data.result'

DNS failover

DNS failover redirects traffic at the DNS layer without modifying infrastructure. Use this when you need rapid traffic steering or when infrastructure is healthy but network paths are degraded.

  1. Verify the target endpoint is healthy before redirecting traffic:
Terminal window
# Health check against secondary endpoint
curl -w "%{http_code}" -o /dev/null -s https://secondary.example.org/health
# Expected: 200
  1. Update DNS records. For weighted or failover record sets, adjust weights or failover status:
Terminal window
# Route 53 - Switch primary/secondary failover
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "production.example.org",
"Type": "A",
"SetIdentifier": "secondary",
"Failover": "PRIMARY",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.20"}]
}
}]
}'

For simple record update:

Terminal window
# Update A record to point to secondary
# Record current value first for rollback
dig +short production.example.org > /tmp/dns-rollback-$(date +%s).txt
# Apply change via DNS provider API
  1. Flush DNS caches on critical systems if immediate propagation is required:
Terminal window
# Windows
ipconfig /flushdns
# macOS
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Linux (systemd-resolved)
sudo systemd-resolve --flush-caches
  1. Monitor propagation using external DNS checking services. DNS changes follow this approximate timeline based on TTL:
+---------------------------------------------------------------------+
| DNS PROPAGATION TIMELINE |
+---------------------------------------------------------------------+
| |
| TTL: 60s |----| Full propagation: ~2-5 minutes |
| |
| TTL: 300s |--------| Full propagation: ~10-15 minutes |
| |
| TTL: 3600s |--------------------------------| Full: ~60-90 min |
| |
| 0 5 10 15 20 25 30 45 60 75 90 minutes |
+---------------------------------------------------------------------+

Checkpoint: Before proceeding to Phase 4, confirm:

  • Failover type executed successfully
  • Traffic is flowing to secondary environment
  • No error responses from secondary

Phase 4: Validation

Objective: Confirm the failover was successful and services are operating correctly.

Timeframe: 10-30 minutes

  1. Execute synthetic transactions against the production URL to verify end-to-end functionality:
Terminal window
# Health check
curl -w "\nHTTP Code: %{http_code}\nTotal Time: %{time_total}s\n" \
https://production.example.org/health
# Authentication flow (if applicable)
curl -X POST https://production.example.org/api/auth/test \
-H "Content-Type: application/json" \
-d '{"test": true}'
# Database connectivity (via application endpoint)
curl https://production.example.org/api/db-health
  1. Verify critical application functions with the application owner. Provide them access to run their validation checklist. Common validations include:

    • User authentication and authorisation
    • Data read operations (can users access their data?)
    • Data write operations (can users create/update records?)
    • Integration endpoints (are third-party integrations functional?)
    • Background job processing (are queues being processed?)
  2. Check monitoring dashboards for the secondary environment. Confirm:

    • Request rate matches expected traffic levels
    • Error rate is within normal bounds (typically < 1%)
    • Response times are acceptable (compare to baseline)
    • Resource utilisation is healthy (CPU < 80%, memory < 85%)
Terminal window
# Query Prometheus for error rate
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))/sum(rate(http_requests_total[5m]))" | jq '.data.result[0].value[1]'
# Should be < 0.01 (1%)
  1. Verify data integrity by checking recent records:
-- Check most recent records exist and are accessible
SELECT COUNT(*), MAX(created_at)
FROM transactions
WHERE created_at > NOW() - INTERVAL '1 hour';
-- Compare record counts with expected baseline
SELECT COUNT(*) FROM users;
  1. Test failback readiness by confirming you can still access the primary environment configuration (even if the environment itself is down):
Terminal window
# Verify access to primary region configuration
az account show
az group show --name production-rg 2>/dev/null || echo "Primary resource group unreachable - expected during outage"

Decision point: The application owner confirms the application is functioning correctly and users can perform their work.

Checkpoint: Before proceeding to Phase 5, confirm:

  • Synthetic transactions passing
  • Application owner validation complete
  • Monitoring shows healthy metrics
  • Data integrity verified

Phase 5: Stabilisation and failback planning

Objective: Stabilise operations on the secondary environment and prepare for eventual failback to primary.

Timeframe: Ongoing until primary recovery

  1. Scale the secondary environment for sustained operation if it was initially sized for temporary use:
Terminal window
# Review current resource utilisation
kubectl top pods -n dr
# Increase resources if utilisation exceeds 70%
kubectl set resources deployment/web-frontend -n dr \
--requests=cpu=500m,memory=512Mi \
--limits=cpu=1000m,memory=1Gi
  1. Enable full monitoring and alerting for the secondary environment. Update monitoring targets:
# Prometheus scrape config update
- job_name: 'dr-web-frontend'
static_configs:
- targets: ['dr-web-frontend:8080']
relabel_configs:
- target_label: environment
replacement: dr-active
  1. Update status page and internal communication channels to reflect current state:

    • Status page: “Operating from disaster recovery environment”
    • Include expected performance characteristics if different from normal
    • Provide estimated time for return to primary (if known)
  2. Monitor the primary environment for recovery. Set up alerts for when primary becomes healthy:

Terminal window
# Check primary region health periodically
watch -n 60 'az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code==\"PowerState/running\"]" 2>/dev/null && echo "Primary recovering"'
  1. Document data loss and recovery actions. Calculate actual data loss:
-- Find the last transaction before failover
SELECT MAX(created_at) as last_transaction_before_failover
FROM transactions
WHERE created_at < '2024-01-15 10:30:00'; -- Failover timestamp
-- Compare with replication lag recorded in Phase 2
-- Actual data loss = failover timestamp - last replicated transaction
  1. Prepare the failback plan. Failback is not simply reversing the failover; it requires:

    • Confirming primary environment is fully recovered
    • Ensuring data written to secondary is replicated back to primary
    • Testing primary environment before redirecting traffic
    • Planning for a maintenance window if data reconciliation is required

    Document the failback plan with specific steps and schedule a failback window once the primary is confirmed stable for at least 4 hours.

Checkpoint: Stabilisation complete when:

  • Secondary environment scaled appropriately
  • Full monitoring active
  • Communication updated
  • Primary recovery monitoring in place
  • Data loss documented
  • Failback plan drafted

Communications

Communicate with stakeholders throughout the failover process using the templates below.

StakeholderTimingChannelMessage ownerTemplate
IT leadershipWithin 15 minutes of activationDirect message or callIncident commanderInitial notification
All staffWithin 30 minutes of failover completionEmail and intranetCommunications leadService notification
External usersWithin 1 hour of failover completionStatus pageCommunications leadStatus page update
Donors/partnersWithin 4 hours if SLA-boundEmailCommunications leadPartner notification

Initial notification template

Subject: [INCIDENT] Service failover in progress - Initial notification
Service: [Service name]
Status: Failover in progress
Started: [Timestamp]
Summary:
We have detected [brief description of failure] affecting [services].
We are initiating failover to our disaster recovery environment.
Expected impact:
- Brief service interruption (estimated [X] minutes)
- Users may need to re-authenticate after failover
- [Any data loss window]
Current actions:
- Failover execution in progress
- Monitoring secondary environment
- Will provide update upon completion
Next update: [Time - within 30 minutes]
Incident commander: [Name]
Contact: [Phone/chat channel]

Service notification template

Subject: Service update - [Service name] operating from backup systems
Dear colleagues,
Following a technical issue with our primary systems, [service name] is
now operating from our disaster recovery environment.
What this means for you:
- The service is available and functioning normally
- You may notice [any performance differences]
- If you experience issues, please [action]
What we're doing:
- Monitoring service performance
- Working to restore primary systems
- Will notify you when we return to normal operations
If you have questions or experience problems, contact the service desk.
Thank you for your patience.

Status page update template

Title: Service operating from backup systems
[Timestamp] - RESOLVED (MONITORING)
[Service name] is now operating from our disaster recovery systems.
Users can access all functions normally.
We are monitoring the service and working to restore primary systems.
We will provide an update when we return to normal operations.
No user action is required.

Evidence preservation

Document the following throughout the failover process:

Evidence typeWhen to captureRetention
Provider status page screenshotsAt activation and hourly during outage90 days
Monitoring dashboard exportsAt activation, post-failover, post-validation90 days
Command history with timestampsThroughout execution90 days
Replication lag measurementsBefore failover90 days
DNS propagation checksPost DNS change30 days
Validation test resultsPhase 490 days
Communication sentAll phases1 year
Decision logAll phases1 year

Create the incident record immediately after Phase 4 validation completes:

INCIDENT RECORD
Incident ID: [Auto-generated or manual]
Date/Time: [Start] to [End]
Duration: [Total]
Classification: Business continuity - Cloud failover
Failover type: [Region / Application / DNS]
Root cause: [Provider outage / Application failure / Network issue]
Timeline:
- [Time]: Failure detected
- [Time]: Failover authorised
- [Time]: Phase 2 complete - secondary verified
- [Time]: Failover executed
- [Time]: Validation complete
- [Time]: Stable operations confirmed
Data impact:
- Replication lag at failover: [X seconds/minutes]
- Estimated transactions affected: [Count]
- Data recovery actions required: [Yes/No - details]
Actions for follow-up:
- [ ] Failback execution (scheduled: [date])
- [ ] Post-incident review (scheduled: [date])
- [ ] Update runbook with lessons learned

Regional failover architecture reference

The following diagram illustrates a typical multi-region architecture with failover capability:

+------------------------------------------------------------------+
| NORMAL OPERATION |
+------------------------------------------------------------------+
+------------------+
| DNS / CDN |
| (Cloudflare, |
| Route 53) |
+--------+---------+
|
+---------------+---------------+
| |
v v
+------------+------------+ +------------+------------+
| PRIMARY REGION | | SECONDARY REGION |
| (Active) | | (Standby) |
| | | |
| +------------------+ | | +------------------+ |
| | Load Balancer | | | | Load Balancer | |
| +--------+---------+ | | | (scaled down) | |
| | | | +--------+---------+ |
| +--------v---------+ | | | |
| | App Servers | | | +--------v---------+ |
| | (4 instances) | | | | App Servers | |
| +--------+---------+ | | | (1 instance) | |
| | | | +--------+---------+ |
| +--------v---------+ | | | |
| | Database | | | +--------v---------+ |
| | (Primary) +---+---->+ | Database | |
| +------------------+ | | | (Replica) | |
| | | +------------------+ |
+-------------------------+ +-------------------------+
100% traffic 0% traffic
(receives replication only)
+------------------------------------------------------------------+
| DURING FAILOVER |
+------------------------------------------------------------------+
+------------------+
| DNS / CDN |
| (redirecting) |
+--------+---------+
|
+---------------+---------------+
| |
v v
+------------+------------+ +------------+------------+
| PRIMARY REGION | | SECONDARY REGION |
| (Failed) | | (Activating) |
| | | |
| +------------------+ | | +------------------+ |
| | XXXXXX | | | | Load Balancer | |
| | UNAVAILABLE | | | | (scaling up) | |
| +------------------+ | | +--------+---------+ |
| | | | |
| | | +--------v---------+ |
| | | | App Servers | |
| | | | (scaling to 4) | |
| | | +--------+---------+ |
| | | | |
| | | +--------v---------+ |
| | | | Database | |
| | | | (promoting) | |
| | | +------------------+ |
+-------------------------+ +-------------------------+
0% traffic 100% traffic
(now serving production)

See also