On this page

Rollback Procedures

A rollback reverts a deployment to a known-good state when the deployed change causes service degradation, functional failure, or unacceptable risk. Rollback procedures execute pre-planned reversal steps defined before deployment, returning systems to their previous configuration within a target window that preserves service level commitments.

Rollback differs from disaster recovery in scope and trigger. Disaster recovery responds to infrastructure failure or data loss affecting entire systems. Rollback responds to change-induced problems where the underlying infrastructure remains functional but the deployed change introduces defects. The distinction matters because rollback assumes the pre-change state remains accessible and valid, while disaster recovery assumes that state may be compromised.

Rollback: Reverting a system to its pre-deployment state by restoring previous code, configuration, or data to eliminate problems introduced by a change.
Rollback plan: A documented procedure created before deployment that specifies the exact steps, responsible parties, decision criteria, and verification methods for reverting the change.
Rollback window: The maximum time permitted for rollback execution, typically 15-60 minutes depending on service criticality and SLA commitments.
Point of no return: The moment during deployment after which rollback becomes impossible or prohibitively complex, requiring forward-fix instead.

Prerequisites

Before executing any rollback, verify these conditions are met. Missing prerequisites transform a controlled rollback into an uncontrolled incident.

Rollback plan availability

Locate the rollback plan created during change planning. The plan exists as an attachment to the change request or in the release documentation. If no rollback plan exists, escalate to the change manager before proceeding. Never attempt rollback without a documented plan.

The rollback plan must contain:

Pre-deployment backup identifiers and locations
Exact commands or procedures for each rollback step
Expected duration for each phase
Verification criteria confirming successful rollback
Communication requirements and escalation contacts

Backup verification

Confirm that pre-deployment backups exist and are accessible. Execute verification commands before initiating rollback:

# Verify database backup exists and is readable
pg_restore --list /backups/pre-deploy/myapp-db-20240115-1430.dump | head -20

# Verify application artifact exists
ls -la /artifacts/releases/myapp-v2.3.1.tar.gz

# Verify configuration backup
cat /backups/configs/myapp/20240115-1430/app.conf | grep -c "^[^#]"

Expected output confirms file existence and content accessibility. Any “file not found” or permission errors halt rollback until resolved.

Authority confirmation

Rollback execution requires explicit authorisation. During business hours, obtain verbal or written approval from the change manager or service owner. Outside business hours, the on-call engineer holds delegated authority for emergency rollbacks when service impact exceeds defined thresholds.

Document the authorisation in the incident or change record:

Rollback authorised by: J. Smith (Change Manager)
Authorisation time: 2024-01-15 15:45 UTC
Reason: Transaction error rate exceeds 5% threshold (current: 12.3%)

Access and credentials

Verify you possess the access required for rollback execution:

System	Required access	Verification command
Application servers	SSH with sudo	`ssh appserver01 'sudo whoami'`
Database servers	DBA role or restore permissions	`psql -c "SELECT current_user, rolsuper FROM pg_roles WHERE rolname = current_user;"`
Load balancer	Configuration modification	`curl -u $LB_USER https://lb.example.org/api/v1/pools`
DNS management	Zone edit permissions	Portal login verification
Container orchestrator	Deployment rollback role	`kubectl auth can-i rollout undo deployment`

Failed access verification requires credential escalation before proceeding.

Rollback decision criteria

Rollback decisions balance service restoration speed against rollback complexity and risk. Not every deployment problem warrants rollback. Minor issues with available workarounds may be better addressed through forward-fix while maintaining service.

Automatic rollback triggers

These conditions mandate immediate rollback without further assessment:

Service availability drops below 95% and remains there for 5 minutes
Error rate exceeds 10% of transactions for 3 consecutive minutes
Data corruption detected in any form
Security vulnerability introduced by the change
Complete loss of critical business function

When automatic triggers fire, begin rollback immediately while notifying stakeholders in parallel. Do not wait for approval when these thresholds are breached.

Assessed rollback triggers

These conditions require assessment before deciding:

Error rate between 2% and 10% sustained for 10 minutes
Performance degradation exceeding 50% of baseline response time
Partial functionality loss affecting non-critical features
User-reported issues accumulating without automated detection

Assessment weighs rollback time against forward-fix time. If the deployment introduced a bug that engineering can patch within 30 minutes, and rollback requires 45 minutes, forward-fix is preferable. If rollback completes in 15 minutes but forward-fix requires investigation of unknown duration, rollback is preferable.

Decision flowchart

                              +----------------------+
                              | Problem detected     |
                              | post-deployment      |
                              +----------+-----------+
                                         |
                              +----------v-----------+
                              | Service availability |
                              | below 95%?           |
                              +----------+-----------+
                                         |
                        +----------------+----------------+
                        |                                 |
                        | Yes                             | No
                        v                                 v
               +--------+--------+             +----------+-----------+
               | IMMEDIATE       |             | Error rate           |
               | ROLLBACK        |             | above 10%?           |
               +-----------------+             +----------+-----------+
                                                          |
                                          +---------------+---------------+
                                          |                               |
                                          | Yes                           | No
                                          v                               v
                                 +--------+--------+           +----------+-----------+
                                 | IMMEDIATE       |           | Data corruption      |
                                 | ROLLBACK        |           | detected?            |
                                 +-----------------+           +----------+-----------+
                                                                          |
                                                          +---------------+---------------+
                                                          |                               |
                                                          | Yes                           | No
                                                          v                               v
                                                 +--------+--------+           +----------+-----------+
                                                 | IMMEDIATE       |           | Error rate 2-10%     |
                                                 | ROLLBACK        |           | for 10+ minutes?     |
                                                 +-----------------+           +----------+-----------+
                                                                                          |
                                                                          +---------------+---------------+
                                                                          |                               |
                                                                          | Yes                           | No
                                                                          v                               v
                                                                 +--------+--------+           +----------+-----------+
                                                                 | ASSESS:         |           | Performance >50%     |
                                                                 | Rollback vs     |           | degraded?            |
                                                                 | forward-fix     |           +----------+-----------+
                                                                 +-----------------+                      |
                                                                                              +-----------+-----------+
                                                                                              |                       |
                                                                                              | Yes                   | No
                                                                                              v                       v
                                                                                     +--------+--------+   +----------+-----------+
                                                                                     | ASSESS:         |   | Monitor and          |
                                                                                     | Rollback vs     |   | consider             |
                                                                                     | forward-fix     |   | forward-fix          |
                                                                                     +-----------------+   +----------------------+

Figure 1: Rollback decision tree based on service impact thresholds

Procedure

Rollback procedures vary by component type. Execute the procedures relevant to your deployment. A typical application deployment may require database rollback followed by application rollback. Infrastructure changes require infrastructure-specific procedures.

Communication initiation

Before executing technical rollback steps, establish communication channels. Stakeholders require notification that rollback is in progress, and the rollback team requires a coordination channel.

Create or join the incident bridge call using the standard incident line:

   Dial: +44 20 7946 0958
   Conference ID: 234567#

   Or join: https://meet.example.org/incident-bridge

Post initial notification to the operations channel:

   ROLLBACK IN PROGRESS
   Change: CHG0012345 - Payment service v2.4.0 deployment
   Trigger: Error rate 12.3% (threshold: 10%)
   Started: 2024-01-15 15:47 UTC
   Expected duration: 25 minutes
   Bridge: https://meet.example.org/incident-bridge
   Lead: @oncall-engineer

Notify the service owner and change manager directly via the defined escalation path. Do not rely solely on channel notifications for critical stakeholders.

Database rollback

Database rollback restores the database to its pre-deployment state. This procedure applies when the deployment included schema changes, data migrations, or stored procedure modifications.

Point of no return

Database rollback becomes impossible once new transactions write data that depends on the new schema. If the application has been processing live transactions for more than your defined point-of-no-return window (typically 15-30 minutes), forward-fix may be the only option. Assess data dependencies before proceeding.

Stop application connections to prevent new transactions during rollback:

   # On each application server
   sudo systemctl stop myapp

   # Verify no active connections remain
   psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'myapp' AND state = 'active';"

Expected output: count = 0. If connections remain, terminate them:

   psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"

Create a safety backup of the current (failed) state before rollback:

   pg_dump -Fc myapp > /backups/emergency/myapp-pre-rollback-$(date +%Y%m%d-%H%M).dump

This backup preserves any data that arrived between deployment and rollback, enabling potential data recovery if needed later.

Restore the pre-deployment database backup:

   # Drop current database and recreate from backup
   psql -c "DROP DATABASE myapp;"
   psql -c "CREATE DATABASE myapp OWNER myapp_user;"
   pg_restore -d myapp /backups/pre-deploy/myapp-db-20240115-1430.dump

For large databases where full restore exceeds rollback window, use point-in-time recovery if available:

   # PostgreSQL PITR to timestamp before deployment
   pg_restore --target-time="2024-01-15 14:25:00" -d myapp

Verify database state matches pre-deployment baseline:

   # Check schema version
   psql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"

   # Expected: version matching pre-deployment (e.g., 20240110120000)

   # Verify row counts on critical tables
   psql -c "SELECT 'users' as tbl, count(*) FROM users UNION ALL SELECT 'transactions', count(*) FROM transactions;"

   # Compare against pre-deployment baseline counts

Application rollback

Application rollback reverts deployed code to the previous version. The procedure varies by deployment mechanism.

Container orchestrator rollback (Kubernetes)

Identify the current and target revision:

   kubectl rollout history deployment/myapp -n production

Output shows revision history:

   REVISION  CHANGE-CAUSE
   3         Deploy v2.3.1 - CHG0012340
   4         Deploy v2.4.0 - CHG0012345

Target revision is 3 (previous stable version).

Execute rollback to the previous revision:

   kubectl rollout undo deployment/myapp -n production --to-revision=3

Monitor rollout progress:

   kubectl rollout status deployment/myapp -n production --timeout=300s

Expected output: deployment "myapp" successfully rolled out

Verify pods are running the correct version:

   kubectl get pods -n production -l app=myapp -o jsonpath='{.items[*].spec.containers[0].image}'

Expected output: registry.example.org/myapp:v2.3.1

Traditional server rollback

Stop the application service on all servers:

   # Execute on each application server or via configuration management
   for server in app01 app02 app03; do
     ssh $server 'sudo systemctl stop myapp'
   done

Replace the deployed artifact with the previous version:

   for server in app01 app02 app03; do
     ssh $server '
       sudo rm -rf /opt/myapp/current
       sudo ln -s /opt/myapp/releases/v2.3.1 /opt/myapp/current
     '
   done

If using artifact deployment rather than symlinks:

   for server in app01 app02 app03; do
     ssh $server '
       sudo tar -xzf /artifacts/releases/myapp-v2.3.1.tar.gz -C /opt/myapp/
     '
   done

Start the application service:

   for server in app01 app02 app03; do
     ssh $server 'sudo systemctl start myapp'
   done

Verify application health on each server:

   for server in app01 app02 app03; do
     curl -s http://$server:8080/health | jq '.status'
   done

Expected output for each: "healthy"

Blue-green rollback

Blue-green deployments enable instant rollback through traffic switching. The previous version remains running on the inactive environment throughout deployment.

+-------------------------------------------------------------+
|               BLUE-GREEN ROLLBACK MECHANISM                 |
+-------------------------------------------------------------+
|                                                             |
|   Concept: Instant recovery by switching traffic to the     |
|            previous stable environment.                     |
|                                                             |
|      1. BEFORE ROLLBACK            2. AFTER ROLLBACK        |
|      (v2.4.0 is faulty)            (Reverted to v2.3.1)     |
|        +-----------+                 +-----------+          |
|        +-----------+                 +-----------+          |
|        | Load Bal  |                 | Load Bal  |          |
|        +-----------+                 +-----------+          |
|          |                             |                    |
|       +--+      (Switch)               +--+                 |
|       |         Traffic                   |                 |
|       v            |                      v                 |
|   +--------+   +--------+     +--------+   +--------+       |
|   |  BLUE  |   | GREEN  |     |  BLUE  |   | GREEN  |       |
|   | v2.4.0 |   | v2.3.1 |     | v2.4.0 |   | v2.3.1 |       |
|   +--------+   +--------+     +--------+   +--------+       |
|    ACTIVE       STANDBY        STANDBY       ACTIVE         |
|    (Errors)     (Stable)       (Fixing)      (Live)         |
|                                                             |
+-------------------------------------------------------------+

Figure 2: Blue-green traffic switch for instant rollback

Verify the standby environment (green) health:

   curl -s http://green.internal:8080/health | jq '.status'

Expected output: "healthy". If unhealthy, the standby environment requires investigation before traffic switch.

Switch load balancer traffic to the standby environment:
For HAProxy:

   # Update backend weight to shift traffic
   echo "set server myapp-backend/green weight 100" | socat stdio /var/run/haproxy/admin.sock
   echo "set server myapp-backend/blue weight 0" | socat stdio /var/run/haproxy/admin.sock

For AWS ALB:

   aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
     --default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP_ARN

For Kubernetes with service mesh:

   kubectl patch virtualservice myapp -n production --type merge -p '
   spec:
     http:
     - route:
       - destination:
           host: myapp-green
         weight: 100
       - destination:
           host: myapp-blue
         weight: 0
   '

Monitor traffic shift and error rates:

   # Watch error rate during transition
   watch -n 5 'curl -s http://monitoring.internal/api/v1/query?query=rate(http_requests_total{status=~"5.."}[1m])'

Error rate should drop to pre-deployment baseline within 2-3 minutes of traffic switch.

Configuration rollback

Configuration rollback restores system or application configuration files to their pre-deployment state. This procedure applies when the deployment modified configuration without code changes.

Identify configuration files modified by the deployment from the change record:

   # List files in configuration backup
   ls -la /backups/configs/myapp/20240115-1430/

Output:

   app.conf
   database.yml
   nginx.conf

Stop services that use the configuration:

   sudo systemctl stop myapp nginx

Restore each configuration file:

   sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.conf
   sudo cp /backups/configs/myapp/20240115-1430/database.yml /etc/myapp/database.yml
   sudo cp /backups/configs/myapp/20240115-1430/nginx.conf /etc/nginx/nginx.conf

Validate configuration syntax before restart:

   # Application configuration validation
   /opt/myapp/bin/myapp --validate-config /etc/myapp/app.conf

   # Nginx configuration validation
   sudo nginx -t

Expected output: validation passes with no errors.

Restart services with restored configuration:

   sudo systemctl start nginx myapp

Infrastructure rollback

Infrastructure rollback reverts changes to cloud resources, network configuration, or platform components. Infrastructure-as-code deployments enable declarative rollback.

Terraform rollback

Identify the previous state version:

   # List state versions in remote backend
   terraform state list
   terraform show -json | jq '.values.root_module.resources | length'

   # For S3 backend, list state versions
   aws s3api list-object-versions --bucket terraform-state-bucket --prefix myapp/terraform.tfstate

Restore the previous state version:

   # Download previous state
   aws s3api get-object --bucket terraform-state-bucket \
     --key myapp/terraform.tfstate \
     --version-id "abc123previousversion" \
     terraform.tfstate.previous

   # Replace current state
   cp terraform.tfstate.previous terraform.tfstate
   terraform state push terraform.tfstate

Apply the previous configuration:

   # Checkout previous infrastructure code version
   git checkout v2.3.1 -- terraform/

   # Plan to verify changes
   terraform plan -out=rollback.plan

   # Review plan output, then apply
   terraform apply rollback.plan

Network configuration rollback

Restore network device configuration from backup:
For Cisco IOS devices:

   configure replace flash:backup-20240115-1430.cfg force

For Juniper devices:

   rollback 1
   commit

Verify network connectivity:

   # Test critical paths
   ping -c 5 gateway.internal
   ping -c 5 database.internal
   traceroute application.internal

Verify routing tables:

   # Compare current routes against expected baseline
   ip route show | diff - /backups/network/routes-baseline.txt

Verification

After completing rollback procedures, verify that systems have returned to the expected pre-deployment state and service has been restored.

Service health verification

Confirm the application responds correctly:

# Health endpoint check
curl -s -w "\nHTTP Status: %{http_code}\n" https://myapp.example.org/health

# Expected output:
# {"status":"healthy","version":"v2.3.1","database":"connected"}
# HTTP Status: 200

Execute synthetic transactions to verify business functionality:

# Test critical user journey
curl -X POST https://myapp.example.org/api/v1/test-transaction \
  -H "Content-Type: application/json" \
  -d '{"test": true, "amount": 1.00}'

# Expected: HTTP 200 with transaction confirmation

Metrics verification

Confirm that metrics have returned to baseline:

# Query current error rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])" | jq '.data.result[0].value[1]'

# Expected: value below 0.01 (1% error rate)

# Query response time
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket[5m]))" | jq '.data.result[0].value[1]'

# Expected: value below baseline (e.g., 0.250 for 250ms p95)

Version verification

Confirm the correct version is deployed:

# Application version endpoint
curl -s https://myapp.example.org/version

# Expected: {"version":"v2.3.1","build":"20240110-1234"}

# Database schema version
psql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"

# Expected: 20240110120000 (pre-deployment migration)

Rollback completion communication

Post rollback completion notification:

✅ ROLLBACK COMPLETE
Change: CHG0012345 - Payment service v2.4.0 deployment
Rollback completed: 2024-01-15 16:12 UTC
Duration: 25 minutes
Service status: Restored to v2.3.1
Error rate: 0.3% (baseline: 0.2%)
Response time: 180ms p95 (baseline: 175ms)

Incident record: INC0056789
Post-incident review: Scheduled 2024-01-17 10:00 UTC

Post-rollback activities

Rollback resolves the immediate service impact but requires follow-up actions to address the underlying deployment failure.

Incident record creation

Create an incident record if one does not exist:

Incident: INC0056789
Related change: CHG0012345
Summary: Payment service v2.4.0 deployment caused elevated error rate requiring rollback
Impact: 27 minutes of degraded service (12.3% error rate)
Resolution: Rolled back to v2.3.1
Root cause: Pending investigation (link to problem record)

Problem record linkage

Create or link to a problem record for root cause investigation:

Problem: PRB0003456
Related incident: INC0056789
Related change: CHG0012345
Summary: Payment service v2.4.0 deployment failure - root cause unknown
Status: Open - Assigned to development team

The problem record tracks investigation into why the deployment failed, preventing recurrence in subsequent deployment attempts.

Change record closure

Update the change record with rollback outcome:

Change: CHG0012345
Status: Failed - Rolled back
Rollback executed: 2024-01-15 15:47-16:12 UTC
Rollback reason: Error rate exceeded 10% threshold
Linked incident: INC0056789
Linked problem: PRB0003456
Post-implementation review: Required

Troubleshooting

Backup file not found or corrupted

Symptom: pg_restore: error: could not open input file: No such file or directory or pg_restore: error: invalid archive

Cause: Pre-deployment backup was not created, was moved, or was corrupted during storage.

Resolution: Check alternative backup locations. Query the backup catalogue for recent backups:

# List recent backups
ls -la /backups/*/myapp* | sort -k6,7

If no valid backup exists, rollback is not possible. Escalate to incident management and pursue forward-fix.

Database restore fails with active connections

Symptom: ERROR: database "myapp" is being accessed by other users

Cause: Application or connection pool maintains connections despite service stop.

Resolution: Force-terminate all connections:

psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"

If connections persist, identify the source:

psql -c "SELECT pid, usename, application_name, client_addr FROM pg_stat_activity WHERE datname = 'myapp';"

Container rollback shows “revision not found”

Symptom: error: unable to find specified revision 3 in history

Cause: Deployment history was purged or revisionHistoryLimit is set too low.

Resolution: Check history limit:

kubectl get deployment myapp -n production -o jsonpath='{.spec.revisionHistoryLimit}'

If history is insufficient, deploy the previous version explicitly:

kubectl set image deployment/myapp -n production myapp=registry.example.org/myapp:v2.3.1

Load balancer traffic switch has no effect

Symptom: Traffic continues flowing to the failed deployment after load balancer configuration change.

Cause: DNS caching, CDN caching, or client-side connection persistence.

Resolution: Verify the load balancer configuration took effect:

# Check backend status
echo "show stat" | socat stdio /var/run/haproxy/admin.sock | grep myapp

If configuration is correct but traffic persists, the issue is downstream caching:

# Purge CDN cache
curl -X POST https://api.cdn.example.org/purge -d '{"zone":"myapp.example.org"}'

# Wait for DNS TTL expiry (check current TTL)
dig +noall +answer myapp.example.org

Configuration file restore fails with permission denied

Symptom: cp: cannot create regular file '/etc/myapp/app.conf': Permission denied

Cause: Insufficient permissions or SELinux/AppArmor restrictions.

Resolution: Execute with appropriate privileges:

sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.conf

If sudo fails, check security context:

# SELinux context check
ls -Z /etc/myapp/app.conf
restorecon -v /etc/myapp/app.conf

Rollback completes but errors persist

Symptom: All rollback steps succeed, version is confirmed as previous, but error rate remains elevated.

Cause: The deployment was not the root cause, or rollback was incomplete (e.g., missing configuration file, cached data, external dependency).

Resolution: Investigate other potential causes:

# Check for configuration drift
diff /etc/myapp/app.conf /backups/configs/myapp/20240115-1430/app.conf

# Check external dependencies
curl -s https://payment-gateway.external.org/health

# Check for cached data issues
redis-cli KEYS "myapp:*" | head -20

If the original deployment was not the cause, create a new incident for the actual issue.

Terraform apply fails during infrastructure rollback

Symptom: Error: error creating resource: ConflictException: Resource already exists

Cause: State mismatch between Terraform state and actual infrastructure.

Resolution: Refresh state and retry:

terraform refresh
terraform plan -out=rollback.plan

If conflicts persist, manually import the conflicting resource or use terraform state rm to remove stale entries (with caution).

Network rollback causes connectivity loss

Symptom: SSH connection drops during network configuration rollback.

Cause: Rollback configuration removed the management network path.

Resolution: Access the device via out-of-band management (console, IPMI, iLO):

# Connect via serial console
screen /dev/ttyUSB0 9600

# Or via IPMI
ipmitool -I lanplus -H device-ipmi.internal -U admin sol activate

Restore connectivity configuration before completing rollback.

Rollback window exceeded

Symptom: Rollback procedures are taking longer than the defined window (e.g., 45 minutes vs 30-minute target).

Cause: Underestimated rollback complexity, unexpected issues, or resource constraints.

Resolution: Communicate timeline extension to stakeholders:

ROLLBACK EXTENSION
Original window: 30 minutes
New estimate: 60 minutes
Reason: Database restore taking longer than planned due to size
Current status: 65% complete

Assess whether continuing rollback or switching to forward-fix is more appropriate given the extended timeline.