On this page

Cloud Failover

Cloud failover transfers workloads from a failed or degraded primary environment to a secondary environment when the primary cannot meet service requirements. This playbook covers three failover types: region failover when an entire cloud region becomes unavailable, application failover when specific services fail while infrastructure remains healthy, and DNS failover when traffic steering is required without infrastructure changes. Execute this playbook when monitoring indicates service degradation beyond defined thresholds and automated recovery has not restored service within the configured grace period.

Activation criteria

Invoke this playbook when any of the following conditions persist for the specified duration after automated recovery attempts have been exhausted.

Failover type	Condition	Threshold	Grace period
Region	Cloud provider status page confirms regional outage	Provider-declared outage	0 minutes
Region	Multiple availability zones in region unreachable	2+ zones unavailable	15 minutes
Region	Cross-zone latency exceeds baseline	Latency > 500ms sustained	30 minutes
Application	Health check failures across all instances	100% failure rate	5 minutes
Application	Error rate exceeds threshold	> 50% 5xx responses	10 minutes
Application	Response time degradation	p95 latency > 10× baseline	15 minutes
DNS	Primary endpoint unreachable from multiple probe locations	3+ probe failures	3 minutes
DNS	Geographic routing failure	Region-specific failures	5 minutes

Automated failover precedence

If automated failover is configured and functioning, allow it to complete before manual intervention. This playbook applies when automation fails, is not configured, or when human judgment is required for complex failure scenarios.

Roles

Role	Responsibility	Typical assignee	Backup
Incident commander	Authorises failover, coordinates communication, makes go/no-go decisions	IT Manager or designated on-call lead	Senior infrastructure engineer
Technical lead	Executes failover procedures, validates success, troubleshoots issues	Cloud infrastructure engineer	Platform engineer
Application owner	Validates application functionality post-failover, approves service restoration	Application team lead	Senior developer
Communications lead	Stakeholder updates, status page management, user notification	Service desk manager	IT Manager

Phase 1: Assessment and decision

Objective: Confirm failure conditions, determine failover type, and obtain authorisation to proceed.

Timeframe: 5-15 minutes

Verify the failure is genuine and not a monitoring false positive. Check the cloud provider’s status page directly at:
- Azure: https://status.azure.com
- AWS: https://health.aws.amazon.com
- Google Cloud: https://status.cloud.google.com
Compare provider status with your own monitoring. A provider-acknowledged outage confirms regional failure. If the provider reports healthy but your monitoring shows failure, the issue is likely application-level or network path-specific.
Determine failure scope by running diagnostic checks from a location outside the affected region. For Azure:

   # Check resource health from Azure CLI (run from unaffected region or local machine)
   az resource list --resource-group production-rg --query "[].{name:name, health:properties.healthState}" -o table

   # Check VM availability
   az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code=='PowerState/running']"

For AWS:

   # Check instance status
   aws ec2 describe-instance-status --region eu-west-1 --query "InstanceStatuses[*].{ID:InstanceId,State:InstanceState.Name,Status:InstanceStatus.Status}"

   # Check service health events
   aws health describe-events --region us-east-1 --filter "eventTypeCategories=issue"

Classify the failure type using the decision tree:

                       +----------------------+
                       | Service unavailable  |
                       +----------+-----------+
                                  |
                       +----------v-----------+
                       | Provider confirms    |
                       | regional outage?     |
                       +----------+-----------+
                                  |
                 +----------------+----------------+
                 |                                 |
                Yes                                No
                 |                                 |
                 v                                 v
        +--------+--------+             +----------+-----------+
        | REGION FAILOVER |             | Multiple apps/       |
        +-----------------+             | services affected?   |
                                        +----------+-----------+
                                                   |
                                  +----------------+----------------+
                                  |                                 |
                                 Yes                                No
                                  |                                 |
                                  v                                 v
                       +----------+-----------+          +----------+-----------+
                       | Infrastructure       |          | APPLICATION          |
                       | healthy in region?   |          | FAILOVER             |
                       +----------+-----------+          +----------------------+
                                  |
                 +----------------+----------------+
                 |                                 |
                Yes                                No
                 |                                 |
                 v                                 v
        +--------+--------+             +----------+-----------+
        | DNS FAILOVER    |             | REGION FAILOVER      |
        +-----------------+             +----------------------+

Calculate the impact of failover versus waiting for recovery. Failover incurs costs and risks:
- Data synchronisation lag: Check replication status to determine potential data loss window
- Failover execution time: 5-30 minutes depending on type
- DNS propagation: 5-60 minutes depending on TTL settings
- Application warm-up: Variable by application
If estimated recovery time from the provider is less than failover execution time plus propagation time, waiting may be preferable.
Obtain authorisation from the incident commander. Present:
- Confirmed failure type
- Estimated data loss (replication lag)
- Estimated time to failover completion
- Estimated time if waiting for recovery
- Business impact of continued outage

Decision point: The incident commander authorises failover or decides to wait for primary recovery. Document the decision and reasoning in the incident record.

Checkpoint: Before proceeding to Phase 2, confirm:

Failure type is determined (region, application, or DNS)
Failover is authorised by incident commander
Decision is documented with timestamp

Phase 2: Pre-failover preparation

Objective: Verify secondary environment readiness and prepare for failover execution.

Timeframe: 5-20 minutes

Verify the secondary environment is healthy and ready to receive traffic. Run health checks against the secondary region or standby instances:

   # Azure - Check secondary region resources
   az resource list --resource-group production-dr-rg --location northeurope --query "[].{name:name, provisioningState:provisioningState}" -o table

   # AWS - Check standby instances in DR region
   aws ec2 describe-instances --region eu-west-2 --filters "Name=tag:Environment,Values=dr" --query "Reservations[*].Instances[*].{ID:InstanceId,State:State.Name}"

Check data replication status to determine the Recovery Point Objective (RPO) exposure. The replication lag indicates how much data could be lost.
For Azure SQL with geo-replication:

   -- Run on secondary database
   SELECT
       database_name,
       replication_state_desc,
       DATEDIFF(SECOND, last_received_time, GETUTCDATE()) AS replication_lag_seconds
   FROM sys.dm_geo_replication_link_status;

For AWS RDS with read replicas:

   aws cloudwatch get-metric-statistics \
       --namespace AWS/RDS \
       --metric-name ReplicaLag \
       --dimensions Name=DBInstanceIdentifier,Value=production-replica \
       --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
       --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
       --period 60 \
       --statistics Average

Record the replication lag. If lag exceeds acceptable RPO (typically defined in the BCDR plan), notify the incident commander before proceeding.

For region failover, verify network connectivity to the secondary region from client locations. Run traceroute and latency tests from representative locations:

   # Test connectivity to secondary load balancer
   curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
        -o /dev/null -s https://dr.example.org/health

Scale the secondary environment if running in reduced capacity. Many DR configurations run secondary at reduced scale to minimise costs:

   # Azure - Scale up App Service plan
   az appservice plan update --name dr-plan --resource-group production-dr-rg --sku P2v2

   # AWS - Update Auto Scaling group desired capacity
   aws autoscaling update-auto-scaling-group \
       --auto-scaling-group-name dr-web-asg \
       --desired-capacity 4 \
       --min-size 2

Wait for scaling to complete before proceeding. Monitor instance health as new capacity comes online.

Prepare DNS changes but do not execute them yet. Identify the records that require modification:

   # List current DNS records
   dig +short production.example.org
   dig +short production.example.org CNAME

   # Note current values for rollback
   # Production: 203.0.113.10 (eu-west-1)
   # DR target:  203.0.113.20 (eu-west-2)

Notify the application owner that failover is imminent. They should prepare for post-failover validation.

Checkpoint: Before proceeding to Phase 3, confirm:

Secondary environment health verified
Replication lag recorded and acceptable
Secondary environment scaled to production capacity
DNS change prepared
Application owner notified

Phase 3: Failover execution

Objective: Execute the failover and redirect traffic to the secondary environment.

Timeframe: 5-30 minutes depending on failover type

Execute the section matching the failover type determined in Phase 1.

Region failover

Region failover redirects all traffic from a failed region to a healthy secondary region. This is the most comprehensive failover type and affects all services deployed in the primary region.

If using Azure Traffic Manager or AWS Route 53 health checks with automatic failover, verify the automatic failover has triggered. If not, force the failover:
For Azure Traffic Manager:

   # Disable the primary endpoint to force failover
   az network traffic-manager endpoint update \
       --resource-group dns-rg \
       --profile-name production-tm \
       --name primary-endpoint \
       --type azureEndpoints \
       --endpoint-status Disabled

For AWS Route 53:

   # Update health check to force failover (set to always unhealthy)
   aws route53 update-health-check \
       --health-check-id abc123-health-check-id \
       --inverted

For database failover with Azure SQL geo-replication, initiate forced failover (accepts potential data loss):

   # Forced failover - use when primary is unreachable
   az sql db replica set-primary \
       --resource-group production-dr-rg \
       --server dr-sql-server \
       --name production-db \
       --allow-data-loss

For AWS RDS Multi-AZ, the failover is automatic. For cross-region read replica promotion:

   # Promote read replica to standalone (irreversible)
   aws rds promote-read-replica \
       --db-instance-identifier production-replica \
       --backup-retention-period 7

Database promotion is irreversible

Promoting a read replica breaks replication permanently. After promotion, you must reconfigure replication from the new primary. Ensure this action is authorised and documented.

Update application configuration to point to the new database endpoint if not using DNS-based database endpoints:

   # Update environment variable or configuration
   # Azure App Service
   az webapp config appsettings set \
       --resource-group production-dr-rg \
       --name dr-webapp \
       --settings DATABASE_HOST=dr-sql-server.database.windows.net

   # Restart application to pick up new configuration
   az webapp restart --resource-group production-dr-rg --name dr-webapp

Verify the secondary application instances are serving traffic correctly by making direct requests bypassing DNS:

   # Direct request to secondary load balancer IP
   curl -H "Host: production.example.org" https://203.0.113.20/health

   # Verify response indicates healthy secondary

Update public DNS to point to the secondary environment. The method depends on your DNS provider:
For Cloudflare:

   curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records/{record_id}" \
        -H "Authorization: Bearer {api_token}" \
        -H "Content-Type: application/json" \
        --data '{"content":"203.0.113.20"}'

For Route 53 (if not using health-check-based failover):

   aws route53 change-resource-record-sets \
       --hosted-zone-id Z1234567890ABC \
       --change-batch '{
         "Changes": [{
           "Action": "UPSERT",
           "ResourceRecordSet": {
             "Name": "production.example.org",
             "Type": "A",
             "TTL": 60,
             "ResourceRecords": [{"Value": "203.0.113.20"}]
           }
         }]
       }'

Monitor DNS propagation. Changes propagate based on TTL settings. If TTL was 300 seconds (5 minutes), most resolvers will pick up the change within 10 minutes. Check propagation from multiple locations:

   # Check from different DNS resolvers
   dig @8.8.8.8 production.example.org +short
   dig @1.1.1.1 production.example.org +short
   dig @208.67.222.222 production.example.org +short

Application failover

Application failover redirects traffic for a specific application while leaving other regional infrastructure intact. Use this when a single application or service fails but the underlying infrastructure remains healthy.

Identify the failing application components and their standby counterparts:

   # List current application instances
   kubectl get pods -n production -l app=web-frontend

   # Check standby deployment status
   kubectl get pods -n dr -l app=web-frontend

Scale up the standby application deployment to match production capacity:

   # Scale standby deployment
   kubectl scale deployment web-frontend -n dr --replicas=4

   # Wait for pods to be ready
   kubectl rollout status deployment/web-frontend -n dr --timeout=300s

Update the service routing to direct traffic to standby instances. For Kubernetes with service mesh:

   # Update Istio VirtualService to route to DR
   kubectl apply -f - <<EOF
   apiVersion: networking.istio.io/v1beta1
   kind: VirtualService
   metadata:
     name: web-frontend
     namespace: production
   spec:
     hosts:
     - web-frontend
     http:
     - route:
       - destination:
           host: web-frontend.dr.svc.cluster.local
         weight: 100
   EOF

For load balancer-based routing:

   # Remove primary backend from load balancer pool
   az network lb address-pool address remove \
       --resource-group production-rg \
       --lb-name production-lb \
       --pool-name backend-pool \
       --name primary-backend

   # Add DR backend to load balancer pool
   az network lb address-pool address add \
       --resource-group production-rg \
       --lb-name production-lb \
       --pool-name backend-pool \
       --name dr-backend \
       --ip-address 10.1.2.10

Verify traffic is flowing to the standby application:

   # Check request distribution
   kubectl logs -n dr -l app=web-frontend --tail=10 | grep "GET /"

   # Verify metrics show traffic on standby
   curl -s http://standby-prometheus:9090/api/v1/query?query=http_requests_total | jq '.data.result'

DNS failover

DNS failover redirects traffic at the DNS layer without modifying infrastructure. Use this when you need rapid traffic steering or when infrastructure is healthy but network paths are degraded.

Verify the target endpoint is healthy before redirecting traffic:

   # Health check against secondary endpoint
   curl -w "%{http_code}" -o /dev/null -s https://secondary.example.org/health
   # Expected: 200

Update DNS records. For weighted or failover record sets, adjust weights or failover status:

   # Route 53 - Switch primary/secondary failover
   aws route53 change-resource-record-sets \
       --hosted-zone-id Z1234567890ABC \
       --change-batch '{
         "Changes": [{
           "Action": "UPSERT",
           "ResourceRecordSet": {
             "Name": "production.example.org",
             "Type": "A",
             "SetIdentifier": "secondary",
             "Failover": "PRIMARY",
             "TTL": 60,
             "ResourceRecords": [{"Value": "203.0.113.20"}]
           }
         }]
       }'

For simple record update:

   # Update A record to point to secondary
   # Record current value first for rollback
   dig +short production.example.org > /tmp/dns-rollback-$(date +%s).txt

   # Apply change via DNS provider API

Flush DNS caches on critical systems if immediate propagation is required:

   # Windows
   ipconfig /flushdns

   # macOS
   sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

   # Linux (systemd-resolved)
   sudo systemd-resolve --flush-caches

Monitor propagation using external DNS checking services. DNS changes follow this approximate timeline based on TTL:

   +---------------------------------------------------------------------+
   |                     DNS PROPAGATION TIMELINE                        |
   +---------------------------------------------------------------------+
   |                                                                     |
   |  TTL: 60s     |----| Full propagation: ~2-5 minutes                 |
   |                                                                     |
   |  TTL: 300s    |--------| Full propagation: ~10-15 minutes           |
   |                                                                     |
   |  TTL: 3600s   |--------------------------------| Full: ~60-90 min   |
   |                                                                     |
   |  0    5    10   15   20   25   30   45   60   75   90 minutes       |
   +---------------------------------------------------------------------+

Checkpoint: Before proceeding to Phase 4, confirm:

Failover type executed successfully
Traffic is flowing to secondary environment
No error responses from secondary

Phase 4: Validation

Objective: Confirm the failover was successful and services are operating correctly.

Timeframe: 10-30 minutes

Execute synthetic transactions against the production URL to verify end-to-end functionality:

   # Health check
   curl -w "\nHTTP Code: %{http_code}\nTotal Time: %{time_total}s\n" \
        https://production.example.org/health

   # Authentication flow (if applicable)
   curl -X POST https://production.example.org/api/auth/test \
        -H "Content-Type: application/json" \
        -d '{"test": true}'

   # Database connectivity (via application endpoint)
   curl https://production.example.org/api/db-health

Verify critical application functions with the application owner. Provide them access to run their validation checklist. Common validations include:
- User authentication and authorisation
- Data read operations (can users access their data?)
- Data write operations (can users create/update records?)
- Integration endpoints (are third-party integrations functional?)
- Background job processing (are queues being processed?)
Check monitoring dashboards for the secondary environment. Confirm:
- Request rate matches expected traffic levels
- Error rate is within normal bounds (typically < 1%)
- Response times are acceptable (compare to baseline)
- Resource utilisation is healthy (CPU < 80%, memory < 85%)

   # Query Prometheus for error rate
   curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))/sum(rate(http_requests_total[5m]))" | jq '.data.result[0].value[1]'
   # Should be < 0.01 (1%)

Verify data integrity by checking recent records:

   -- Check most recent records exist and are accessible
   SELECT COUNT(*), MAX(created_at)
   FROM transactions
   WHERE created_at > NOW() - INTERVAL '1 hour';

   -- Compare record counts with expected baseline
   SELECT COUNT(*) FROM users;

Test failback readiness by confirming you can still access the primary environment configuration (even if the environment itself is down):

   # Verify access to primary region configuration
   az account show
   az group show --name production-rg 2>/dev/null || echo "Primary resource group unreachable - expected during outage"

Decision point: The application owner confirms the application is functioning correctly and users can perform their work.

Checkpoint: Before proceeding to Phase 5, confirm:

Synthetic transactions passing
Application owner validation complete
Monitoring shows healthy metrics
Data integrity verified

Phase 5: Stabilisation and failback planning

Objective: Stabilise operations on the secondary environment and prepare for eventual failback to primary.

Timeframe: Ongoing until primary recovery

Scale the secondary environment for sustained operation if it was initially sized for temporary use:

   # Review current resource utilisation
   kubectl top pods -n dr

   # Increase resources if utilisation exceeds 70%
   kubectl set resources deployment/web-frontend -n dr \
       --requests=cpu=500m,memory=512Mi \
       --limits=cpu=1000m,memory=1Gi

Enable full monitoring and alerting for the secondary environment. Update monitoring targets:

   # Prometheus scrape config update
   - job_name: 'dr-web-frontend'
     static_configs:
       - targets: ['dr-web-frontend:8080']
     relabel_configs:
       - target_label: environment
         replacement: dr-active

Update status page and internal communication channels to reflect current state:
- Status page: “Operating from disaster recovery environment”
- Include expected performance characteristics if different from normal
- Provide estimated time for return to primary (if known)
Monitor the primary environment for recovery. Set up alerts for when primary becomes healthy:

   # Check primary region health periodically
   watch -n 60 'az vm get-instance-view --resource-group production-rg --name web-vm-01 --query "instanceView.statuses[?code==\"PowerState/running\"]" 2>/dev/null && echo "Primary recovering"'

Document data loss and recovery actions. Calculate actual data loss:

   -- Find the last transaction before failover
   SELECT MAX(created_at) as last_transaction_before_failover
   FROM transactions
   WHERE created_at < '2024-01-15 10:30:00';  -- Failover timestamp

   -- Compare with replication lag recorded in Phase 2
   -- Actual data loss = failover timestamp - last replicated transaction

Prepare the failback plan. Failback is not simply reversing the failover; it requires:
- Confirming primary environment is fully recovered
- Ensuring data written to secondary is replicated back to primary
- Testing primary environment before redirecting traffic
- Planning for a maintenance window if data reconciliation is required
Document the failback plan with specific steps and schedule a failback window once the primary is confirmed stable for at least 4 hours.

Checkpoint: Stabilisation complete when:

Communications

Communicate with stakeholders throughout the failover process using the templates below.

Stakeholder	Timing	Channel	Message owner	Template
IT leadership	Within 15 minutes of activation	Direct message or call	Incident commander	Initial notification
All staff	Within 30 minutes of failover completion	Email and intranet	Communications lead	Service notification
External users	Within 1 hour of failover completion	Status page	Communications lead	Status page update
Donors/partners	Within 4 hours if SLA-bound	Email	Communications lead	Partner notification

Initial notification template

Subject: [INCIDENT] Service failover in progress - Initial notification

Service: [Service name]
Status: Failover in progress
Started: [Timestamp]

Summary:
We have detected [brief description of failure] affecting [services].
We are initiating failover to our disaster recovery environment.

Expected impact:
- Brief service interruption (estimated [X] minutes)
- Users may need to re-authenticate after failover
- [Any data loss window]

Current actions:
- Failover execution in progress
- Monitoring secondary environment
- Will provide update upon completion

Next update: [Time - within 30 minutes]

Incident commander: [Name]
Contact: [Phone/chat channel]

Service notification template

Subject: Service update - [Service name] operating from backup systems

Dear colleagues,

Following a technical issue with our primary systems, [service name] is
now operating from our disaster recovery environment.

What this means for you:
- The service is available and functioning normally
- You may notice [any performance differences]
- If you experience issues, please [action]

What we're doing:
- Monitoring service performance
- Working to restore primary systems
- Will notify you when we return to normal operations

If you have questions or experience problems, contact the service desk.

Thank you for your patience.

Status page update template

Title: Service operating from backup systems

[Timestamp] - RESOLVED (MONITORING)

[Service name] is now operating from our disaster recovery systems.
Users can access all functions normally.

We are monitoring the service and working to restore primary systems.
We will provide an update when we return to normal operations.

No user action is required.

Evidence preservation

Document the following throughout the failover process:

Evidence type	When to capture	Retention
Provider status page screenshots	At activation and hourly during outage	90 days
Monitoring dashboard exports	At activation, post-failover, post-validation	90 days
Command history with timestamps	Throughout execution	90 days
Replication lag measurements	Before failover	90 days
DNS propagation checks	Post DNS change	30 days
Validation test results	Phase 4	90 days
Communication sent	All phases	1 year
Decision log	All phases	1 year

Create the incident record immediately after Phase 4 validation completes:

INCIDENT RECORD

Incident ID: [Auto-generated or manual]
Date/Time: [Start] to [End]
Duration: [Total]

Classification: Business continuity - Cloud failover
Failover type: [Region / Application / DNS]

Root cause: [Provider outage / Application failure / Network issue]

Timeline:
- [Time]: Failure detected
- [Time]: Failover authorised
- [Time]: Phase 2 complete - secondary verified
- [Time]: Failover executed
- [Time]: Validation complete
- [Time]: Stable operations confirmed

Data impact:
- Replication lag at failover: [X seconds/minutes]
- Estimated transactions affected: [Count]
- Data recovery actions required: [Yes/No - details]

Actions for follow-up:
- [ ] Failback execution (scheduled: [date])
- [ ] Post-incident review (scheduled: [date])
- [ ] Update runbook with lessons learned

Regional failover architecture reference

The following diagram illustrates a typical multi-region architecture with failover capability:

+------------------------------------------------------------------+
|                         NORMAL OPERATION                         |
+------------------------------------------------------------------+

                        +------------------+
                        |    DNS / CDN     |
                        |   (Cloudflare,   |
                        |    Route 53)     |
                        +--------+---------+
                                 |
                 +---------------+---------------+
                 |                               |
                 v                               v
    +------------+------------+     +------------+------------+
    |   PRIMARY REGION        |     |   SECONDARY REGION      |
    |   (Active)              |     |   (Standby)             |
    |                         |     |                         |
    |  +------------------+   |     |  +------------------+   |
    |  | Load Balancer    |   |     |  | Load Balancer    |   |
    |  +--------+---------+   |     |  | (scaled down)    |   |
    |           |             |     |  +--------+---------+   |
    |  +--------v---------+   |     |           |             |
    |  | App Servers      |   |     |  +--------v---------+   |
    |  | (4 instances)    |   |     |  | App Servers      |   |
    |  +--------+---------+   |     |  | (1 instance)     |   |
    |           |             |     |  +--------+---------+   |
    |  +--------v---------+   |     |           |             |
    |  | Database         |   |     |  +--------v---------+   |
    |  | (Primary)        +---+---->+  | Database         |   |
    |  +------------------+   |     |  | (Replica)        |   |
    |                         |     |  +------------------+   |
    +-------------------------+     +-------------------------+
              100% traffic                   0% traffic
                                    (receives replication only)


+------------------------------------------------------------------+
|                         DURING FAILOVER                          |
+------------------------------------------------------------------+

                        +------------------+
                        |    DNS / CDN     |
                        |   (redirecting)  |
                        +--------+---------+
                                 |
                 +---------------+---------------+
                 |                               |
                 v                               v
    +------------+------------+     +------------+------------+
    |   PRIMARY REGION        |     |   SECONDARY REGION      |
    |   (Failed)              |     |   (Activating)          |
    |                         |     |                         |
    |  +------------------+   |     |  +------------------+   |
    |  |     XXXXXX       |   |     |  | Load Balancer    |   |
    |  |   UNAVAILABLE    |   |     |  | (scaling up)     |   |
    |  +------------------+   |     |  +--------+---------+   |
    |                         |     |           |             |
    |                         |     |  +--------v---------+   |
    |                         |     |  | App Servers      |   |
    |                         |     |  | (scaling to 4)   |   |
    |                         |     |  +--------+---------+   |
    |                         |     |           |             |
    |                         |     |  +--------v---------+   |
    |                         |     |  | Database         |   |
    |                         |     |  | (promoting)      |   |
    |                         |     |  +------------------+   |
    +-------------------------+     +-------------------------+
              0% traffic                  100% traffic
                                    (now serving production)