Cloud Failover
Cloud failover transfers workloads from a failed or degraded primary environment to a secondary environment when the primary cannot meet service requirements. This playbook covers three failover types: region failover when an entire cloud region becomes unavailable, application failover when specific services fail while infrastructure remains healthy, and DNS failover when traffic steering is required without infrastructure changes. Execute this playbook when monitoring indicates service degradation beyond defined thresholds and automated recovery has not restored service within the configured grace period.
Activation criteria
Invoke this playbook when any of the following conditions persist for the specified duration after automated recovery attempts have been exhausted.
| Failover type | Condition | Threshold | Grace period |
|---|---|---|---|
| Region | Cloud provider status page confirms regional outage | Provider-declared outage | 0 minutes |
| Region | Multiple availability zones in region unreachable | 2+ zones unavailable | 15 minutes |
| Region | Cross-zone latency exceeds baseline | Latency > 500ms sustained | 30 minutes |
| Application | Health check failures across all instances | 100% failure rate | 5 minutes |
| Application | Error rate exceeds threshold | > 50% 5xx responses | 10 minutes |
| Application | Response time degradation | p95 latency > 10× baseline | 15 minutes |
| DNS | Primary endpoint unreachable from multiple probe locations | 3+ probe failures | 3 minutes |
| DNS | Geographic routing failure | Region-specific failures | 5 minutes |
Automated failover precedence
If automated failover is configured and functioning, allow it to complete before manual intervention. This playbook applies when automation fails, is not configured, or when human judgment is required for complex failure scenarios.
Roles
| Role | Responsibility | Typical assignee | Backup |
|---|---|---|---|
| Incident commander | Authorises failover, coordinates communication, makes go/no-go decisions | IT Manager or designated on-call lead | Senior infrastructure engineer |
| Technical lead | Executes failover procedures, validates success, troubleshoots issues | Cloud infrastructure engineer | Platform engineer |
| Application owner | Validates application functionality post-failover, approves service restoration | Application team lead | Senior developer |
| Communications lead | Stakeholder updates, status page management, user notification | Service desk manager | IT Manager |
Phase 1: Assessment and decision
Objective: Confirm failure conditions, determine failover type, and obtain authorisation to proceed.
Timeframe: 5-15 minutes
Verify the failure is genuine and not a monitoring false positive. Check the cloud provider’s status page directly at:
- Azure: https://status.azure.com
- AWS: https://health.aws.amazon.com
- Google Cloud: https://status.cloud.google.com
Compare provider status with your own monitoring. A provider-acknowledged outage confirms regional failure. If the provider reports healthy but your monitoring shows failure, the issue is likely application-level or network path-specific.
Determine failure scope by running diagnostic checks from a location outside the affected region. For Azure:
# Check resource health from Azure CLI (run from unaffected region or local machine) az resource list --resource-group production-rg --query "[].{name:name, health:properties.healthState}" -o table
# Check VM availability az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code=='PowerState/running']"For AWS:
# Check instance status aws ec2 describe-instance-status --region eu-west-1 --query "InstanceStatuses[*].{ID:InstanceId,State:InstanceState.Name,Status:InstanceStatus.Status}"
# Check service health events aws health describe-events --region us-east-1 --filter "eventTypeCategories=issue"- Classify the failure type using the decision tree:
+----------------------+ | Service unavailable | +----------+-----------+ | +----------v-----------+ | Provider confirms | | regional outage? | +----------+-----------+ | +----------------+----------------+ | | Yes No | | v v +--------+--------+ +----------+-----------+ | REGION FAILOVER | | Multiple apps/ | +-----------------+ | services affected? | +----------+-----------+ | +----------------+----------------+ | | Yes No | | v v +----------+-----------+ +----------+-----------+ | Infrastructure | | APPLICATION | | healthy in region? | | FAILOVER | +----------+-----------+ +----------------------+ | +----------------+----------------+ | | Yes No | | v v +--------+--------+ +----------+-----------+ | DNS FAILOVER | | REGION FAILOVER | +-----------------+ +----------------------+Calculate the impact of failover versus waiting for recovery. Failover incurs costs and risks:
- Data synchronisation lag: Check replication status to determine potential data loss window
- Failover execution time: 5-30 minutes depending on type
- DNS propagation: 5-60 minutes depending on TTL settings
- Application warm-up: Variable by application
If estimated recovery time from the provider is less than failover execution time plus propagation time, waiting may be preferable.
Obtain authorisation from the incident commander. Present:
- Confirmed failure type
- Estimated data loss (replication lag)
- Estimated time to failover completion
- Estimated time if waiting for recovery
- Business impact of continued outage
Decision point: The incident commander authorises failover or decides to wait for primary recovery. Document the decision and reasoning in the incident record.
Checkpoint: Before proceeding to Phase 2, confirm:
- Failure type is determined (region, application, or DNS)
- Failover is authorised by incident commander
- Decision is documented with timestamp
Phase 2: Pre-failover preparation
Objective: Verify secondary environment readiness and prepare for failover execution.
Timeframe: 5-20 minutes
- Verify the secondary environment is healthy and ready to receive traffic. Run health checks against the secondary region or standby instances:
# Azure - Check secondary region resources az resource list --resource-group production-dr-rg --location northeurope --query "[].{name:name, provisioningState:provisioningState}" -o table
# AWS - Check standby instances in DR region aws ec2 describe-instances --region eu-west-2 --filters "Name=tag:Environment,Values=dr" --query "Reservations[*].Instances[*].{ID:InstanceId,State:State.Name}"Check data replication status to determine the Recovery Point Objective (RPO) exposure. The replication lag indicates how much data could be lost.
For Azure SQL with geo-replication:
-- Run on secondary database SELECT database_name, replication_state_desc, DATEDIFF(SECOND, last_received_time, GETUTCDATE()) AS replication_lag_seconds FROM sys.dm_geo_replication_link_status;For AWS RDS with read replicas:
aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value=production-replica \ --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics AverageRecord the replication lag. If lag exceeds acceptable RPO (typically defined in the BCDR plan), notify the incident commander before proceeding.
- For region failover, verify network connectivity to the secondary region from client locations. Run traceroute and latency tests from representative locations:
# Test connectivity to secondary load balancer curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \ -o /dev/null -s https://dr.example.org/health- Scale the secondary environment if running in reduced capacity. Many DR configurations run secondary at reduced scale to minimise costs:
# Azure - Scale up App Service plan az appservice plan update --name dr-plan --resource-group production-dr-rg --sku P2v2
# AWS - Update Auto Scaling group desired capacity aws autoscaling update-auto-scaling-group \ --auto-scaling-group-name dr-web-asg \ --desired-capacity 4 \ --min-size 2Wait for scaling to complete before proceeding. Monitor instance health as new capacity comes online.
- Prepare DNS changes but do not execute them yet. Identify the records that require modification:
# List current DNS records dig +short production.example.org dig +short production.example.org CNAME
# Note current values for rollback # Production: 203.0.113.10 (eu-west-1) # DR target: 203.0.113.20 (eu-west-2)- Notify the application owner that failover is imminent. They should prepare for post-failover validation.
Checkpoint: Before proceeding to Phase 3, confirm:
- Secondary environment health verified
- Replication lag recorded and acceptable
- Secondary environment scaled to production capacity
- DNS change prepared
- Application owner notified
Phase 3: Failover execution
Objective: Execute the failover and redirect traffic to the secondary environment.
Timeframe: 5-30 minutes depending on failover type
Execute the section matching the failover type determined in Phase 1.
Region failover
Region failover redirects all traffic from a failed region to a healthy secondary region. This is the most comprehensive failover type and affects all services deployed in the primary region.
If using Azure Traffic Manager or AWS Route 53 health checks with automatic failover, verify the automatic failover has triggered. If not, force the failover:
For Azure Traffic Manager:
# Disable the primary endpoint to force failover az network traffic-manager endpoint update \ --resource-group dns-rg \ --profile-name production-tm \ --name primary-endpoint \ --type azureEndpoints \ --endpoint-status DisabledFor AWS Route 53:
# Update health check to force failover (set to always unhealthy) aws route53 update-health-check \ --health-check-id abc123-health-check-id \ --inverted- For database failover with Azure SQL geo-replication, initiate forced failover (accepts potential data loss):
# Forced failover - use when primary is unreachable az sql db replica set-primary \ --resource-group production-dr-rg \ --server dr-sql-server \ --name production-db \ --allow-data-lossFor AWS RDS Multi-AZ, the failover is automatic. For cross-region read replica promotion:
# Promote read replica to standalone (irreversible) aws rds promote-read-replica \ --db-instance-identifier production-replica \ --backup-retention-period 7Database promotion is irreversible
Promoting a read replica breaks replication permanently. After promotion, you must reconfigure replication from the new primary. Ensure this action is authorised and documented.
- Update application configuration to point to the new database endpoint if not using DNS-based database endpoints:
# Update environment variable or configuration # Azure App Service az webapp config appsettings set \ --resource-group production-dr-rg \ --name dr-webapp \ --settings DATABASE_HOST=dr-sql-server.database.windows.net
# Restart application to pick up new configuration az webapp restart --resource-group production-dr-rg --name dr-webapp- Verify the secondary application instances are serving traffic correctly by making direct requests bypassing DNS:
# Direct request to secondary load balancer IP curl -H "Host: production.example.org" https://203.0.113.20/health
# Verify response indicates healthy secondaryUpdate public DNS to point to the secondary environment. The method depends on your DNS provider:
For Cloudflare:
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records/{record_id}" \ -H "Authorization: Bearer {api_token}" \ -H "Content-Type: application/json" \ --data '{"content":"203.0.113.20"}'For Route 53 (if not using health-check-based failover):
aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890ABC \ --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "production.example.org", "Type": "A", "TTL": 60, "ResourceRecords": [{"Value": "203.0.113.20"}] } }] }'- Monitor DNS propagation. Changes propagate based on TTL settings. If TTL was 300 seconds (5 minutes), most resolvers will pick up the change within 10 minutes. Check propagation from multiple locations:
# Check from different DNS resolvers dig @8.8.8.8 production.example.org +short dig @1.1.1.1 production.example.org +short dig @208.67.222.222 production.example.org +shortApplication failover
Application failover redirects traffic for a specific application while leaving other regional infrastructure intact. Use this when a single application or service fails but the underlying infrastructure remains healthy.
- Identify the failing application components and their standby counterparts:
# List current application instances kubectl get pods -n production -l app=web-frontend
# Check standby deployment status kubectl get pods -n dr -l app=web-frontend- Scale up the standby application deployment to match production capacity:
# Scale standby deployment kubectl scale deployment web-frontend -n dr --replicas=4
# Wait for pods to be ready kubectl rollout status deployment/web-frontend -n dr --timeout=300s- Update the service routing to direct traffic to standby instances. For Kubernetes with service mesh:
# Update Istio VirtualService to route to DR kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: web-frontend namespace: production spec: hosts: - web-frontend http: - route: - destination: host: web-frontend.dr.svc.cluster.local weight: 100 EOFFor load balancer-based routing:
# Remove primary backend from load balancer pool az network lb address-pool address remove \ --resource-group production-rg \ --lb-name production-lb \ --pool-name backend-pool \ --name primary-backend
# Add DR backend to load balancer pool az network lb address-pool address add \ --resource-group production-rg \ --lb-name production-lb \ --pool-name backend-pool \ --name dr-backend \ --ip-address 10.1.2.10- Verify traffic is flowing to the standby application:
# Check request distribution kubectl logs -n dr -l app=web-frontend --tail=10 | grep "GET /"
# Verify metrics show traffic on standby curl -s http://standby-prometheus:9090/api/v1/query?query=http_requests_total | jq '.data.result'DNS failover
DNS failover redirects traffic at the DNS layer without modifying infrastructure. Use this when you need rapid traffic steering or when infrastructure is healthy but network paths are degraded.
- Verify the target endpoint is healthy before redirecting traffic:
# Health check against secondary endpoint curl -w "%{http_code}" -o /dev/null -s https://secondary.example.org/health # Expected: 200- Update DNS records. For weighted or failover record sets, adjust weights or failover status:
# Route 53 - Switch primary/secondary failover aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890ABC \ --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "production.example.org", "Type": "A", "SetIdentifier": "secondary", "Failover": "PRIMARY", "TTL": 60, "ResourceRecords": [{"Value": "203.0.113.20"}] } }] }'For simple record update:
# Update A record to point to secondary # Record current value first for rollback dig +short production.example.org > /tmp/dns-rollback-$(date +%s).txt
# Apply change via DNS provider API- Flush DNS caches on critical systems if immediate propagation is required:
# Windows ipconfig /flushdns
# macOS sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Linux (systemd-resolved) sudo systemd-resolve --flush-caches- Monitor propagation using external DNS checking services. DNS changes follow this approximate timeline based on TTL:
+---------------------------------------------------------------------+ | DNS PROPAGATION TIMELINE | +---------------------------------------------------------------------+ | | | TTL: 60s |----| Full propagation: ~2-5 minutes | | | | TTL: 300s |--------| Full propagation: ~10-15 minutes | | | | TTL: 3600s |--------------------------------| Full: ~60-90 min | | | | 0 5 10 15 20 25 30 45 60 75 90 minutes | +---------------------------------------------------------------------+Checkpoint: Before proceeding to Phase 4, confirm:
- Failover type executed successfully
- Traffic is flowing to secondary environment
- No error responses from secondary
Phase 4: Validation
Objective: Confirm the failover was successful and services are operating correctly.
Timeframe: 10-30 minutes
- Execute synthetic transactions against the production URL to verify end-to-end functionality:
# Health check curl -w "\nHTTP Code: %{http_code}\nTotal Time: %{time_total}s\n" \ https://production.example.org/health
# Authentication flow (if applicable) curl -X POST https://production.example.org/api/auth/test \ -H "Content-Type: application/json" \ -d '{"test": true}'
# Database connectivity (via application endpoint) curl https://production.example.org/api/db-healthVerify critical application functions with the application owner. Provide them access to run their validation checklist. Common validations include:
- User authentication and authorisation
- Data read operations (can users access their data?)
- Data write operations (can users create/update records?)
- Integration endpoints (are third-party integrations functional?)
- Background job processing (are queues being processed?)
Check monitoring dashboards for the secondary environment. Confirm:
- Request rate matches expected traffic levels
- Error rate is within normal bounds (typically < 1%)
- Response times are acceptable (compare to baseline)
- Resource utilisation is healthy (CPU < 80%, memory < 85%)
# Query Prometheus for error rate curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))/sum(rate(http_requests_total[5m]))" | jq '.data.result[0].value[1]' # Should be < 0.01 (1%)- Verify data integrity by checking recent records:
-- Check most recent records exist and are accessible SELECT COUNT(*), MAX(created_at) FROM transactions WHERE created_at > NOW() - INTERVAL '1 hour';
-- Compare record counts with expected baseline SELECT COUNT(*) FROM users;- Test failback readiness by confirming you can still access the primary environment configuration (even if the environment itself is down):
# Verify access to primary region configuration az account show az group show --name production-rg 2>/dev/null || echo "Primary resource group unreachable - expected during outage"Decision point: The application owner confirms the application is functioning correctly and users can perform their work.
Checkpoint: Before proceeding to Phase 5, confirm:
- Synthetic transactions passing
- Application owner validation complete
- Monitoring shows healthy metrics
- Data integrity verified
Phase 5: Stabilisation and failback planning
Objective: Stabilise operations on the secondary environment and prepare for eventual failback to primary.
Timeframe: Ongoing until primary recovery
- Scale the secondary environment for sustained operation if it was initially sized for temporary use:
# Review current resource utilisation kubectl top pods -n dr
# Increase resources if utilisation exceeds 70% kubectl set resources deployment/web-frontend -n dr \ --requests=cpu=500m,memory=512Mi \ --limits=cpu=1000m,memory=1Gi- Enable full monitoring and alerting for the secondary environment. Update monitoring targets:
# Prometheus scrape config update - job_name: 'dr-web-frontend' static_configs: - targets: ['dr-web-frontend:8080'] relabel_configs: - target_label: environment replacement: dr-activeUpdate status page and internal communication channels to reflect current state:
- Status page: “Operating from disaster recovery environment”
- Include expected performance characteristics if different from normal
- Provide estimated time for return to primary (if known)
Monitor the primary environment for recovery. Set up alerts for when primary becomes healthy:
# Check primary region health periodically watch -n 60 'az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code==\"PowerState/running\"]" 2>/dev/null && echo "Primary recovering"'- Document data loss and recovery actions. Calculate actual data loss:
-- Find the last transaction before failover SELECT MAX(created_at) as last_transaction_before_failover FROM transactions WHERE created_at < '2024-01-15 10:30:00'; -- Failover timestamp
-- Compare with replication lag recorded in Phase 2 -- Actual data loss = failover timestamp - last replicated transactionPrepare the failback plan. Failback is not simply reversing the failover; it requires:
- Confirming primary environment is fully recovered
- Ensuring data written to secondary is replicated back to primary
- Testing primary environment before redirecting traffic
- Planning for a maintenance window if data reconciliation is required
Document the failback plan with specific steps and schedule a failback window once the primary is confirmed stable for at least 4 hours.
Checkpoint: Stabilisation complete when:
- Secondary environment scaled appropriately
- Full monitoring active
- Communication updated
- Primary recovery monitoring in place
- Data loss documented
- Failback plan drafted
Communications
Communicate with stakeholders throughout the failover process using the templates below.
| Stakeholder | Timing | Channel | Message owner | Template |
|---|---|---|---|---|
| IT leadership | Within 15 minutes of activation | Direct message or call | Incident commander | Initial notification |
| All staff | Within 30 minutes of failover completion | Email and intranet | Communications lead | Service notification |
| External users | Within 1 hour of failover completion | Status page | Communications lead | Status page update |
| Donors/partners | Within 4 hours if SLA-bound | Communications lead | Partner notification |
Initial notification template
Subject: [INCIDENT] Service failover in progress - Initial notification
Service: [Service name]Status: Failover in progressStarted: [Timestamp]
Summary:We have detected [brief description of failure] affecting [services].We are initiating failover to our disaster recovery environment.
Expected impact:- Brief service interruption (estimated [X] minutes)- Users may need to re-authenticate after failover- [Any data loss window]
Current actions:- Failover execution in progress- Monitoring secondary environment- Will provide update upon completion
Next update: [Time - within 30 minutes]
Incident commander: [Name]Contact: [Phone/chat channel]Service notification template
Subject: Service update - [Service name] operating from backup systems
Dear colleagues,
Following a technical issue with our primary systems, [service name] isnow operating from our disaster recovery environment.
What this means for you:- The service is available and functioning normally- You may notice [any performance differences]- If you experience issues, please [action]
What we're doing:- Monitoring service performance- Working to restore primary systems- Will notify you when we return to normal operations
If you have questions or experience problems, contact the service desk.
Thank you for your patience.Status page update template
Title: Service operating from backup systems
[Timestamp] - RESOLVED (MONITORING)
[Service name] is now operating from our disaster recovery systems.Users can access all functions normally.
We are monitoring the service and working to restore primary systems.We will provide an update when we return to normal operations.
No user action is required.Evidence preservation
Document the following throughout the failover process:
| Evidence type | When to capture | Retention |
|---|---|---|
| Provider status page screenshots | At activation and hourly during outage | 90 days |
| Monitoring dashboard exports | At activation, post-failover, post-validation | 90 days |
| Command history with timestamps | Throughout execution | 90 days |
| Replication lag measurements | Before failover | 90 days |
| DNS propagation checks | Post DNS change | 30 days |
| Validation test results | Phase 4 | 90 days |
| Communication sent | All phases | 1 year |
| Decision log | All phases | 1 year |
Create the incident record immediately after Phase 4 validation completes:
INCIDENT RECORD
Incident ID: [Auto-generated or manual]Date/Time: [Start] to [End]Duration: [Total]
Classification: Business continuity - Cloud failoverFailover type: [Region / Application / DNS]
Root cause: [Provider outage / Application failure / Network issue]
Timeline:- [Time]: Failure detected- [Time]: Failover authorised- [Time]: Phase 2 complete - secondary verified- [Time]: Failover executed- [Time]: Validation complete- [Time]: Stable operations confirmed
Data impact:- Replication lag at failover: [X seconds/minutes]- Estimated transactions affected: [Count]- Data recovery actions required: [Yes/No - details]
Actions for follow-up:- [ ] Failback execution (scheduled: [date])- [ ] Post-incident review (scheduled: [date])- [ ] Update runbook with lessons learnedRegional failover architecture reference
The following diagram illustrates a typical multi-region architecture with failover capability:
+------------------------------------------------------------------+| NORMAL OPERATION |+------------------------------------------------------------------+
+------------------+ | DNS / CDN | | (Cloudflare, | | Route 53) | +--------+---------+ | +---------------+---------------+ | | v v +------------+------------+ +------------+------------+ | PRIMARY REGION | | SECONDARY REGION | | (Active) | | (Standby) | | | | | | +------------------+ | | +------------------+ | | | Load Balancer | | | | Load Balancer | | | +--------+---------+ | | | (scaled down) | | | | | | +--------+---------+ | | +--------v---------+ | | | | | | App Servers | | | +--------v---------+ | | | (4 instances) | | | | App Servers | | | +--------+---------+ | | | (1 instance) | | | | | | +--------+---------+ | | +--------v---------+ | | | | | | Database | | | +--------v---------+ | | | (Primary) +---+---->+ | Database | | | +------------------+ | | | (Replica) | | | | | +------------------+ | +-------------------------+ +-------------------------+ 100% traffic 0% traffic (receives replication only)
+------------------------------------------------------------------+| DURING FAILOVER |+------------------------------------------------------------------+
+------------------+ | DNS / CDN | | (redirecting) | +--------+---------+ | +---------------+---------------+ | | v v +------------+------------+ +------------+------------+ | PRIMARY REGION | | SECONDARY REGION | | (Failed) | | (Activating) | | | | | | +------------------+ | | +------------------+ | | | XXXXXX | | | | Load Balancer | | | | UNAVAILABLE | | | | (scaling up) | | | +------------------+ | | +--------+---------+ | | | | | | | | | +--------v---------+ | | | | | App Servers | | | | | | (scaling to 4) | | | | | +--------+---------+ | | | | | | | | | +--------v---------+ | | | | | Database | | | | | | (promoting) | | | | | +------------------+ | +-------------------------+ +-------------------------+ 0% traffic 100% traffic (now serving production)See also
- High Availability and Disaster Recovery for architecture concepts and design patterns
- DR Site Invocation for full disaster recovery activation including physical site failover
- Backup Recovery for data restoration procedures
- Major Service Outage for incident response coordination
- DR Testing for failover testing procedures
- Cloud Strategy and Platform Selection for multi-region architecture decisions