Skip to main content

Rollback Procedures

A rollback reverts a deployment to a known-good state when the deployed change causes service degradation, functional failure, or unacceptable risk. Rollback procedures execute pre-planned reversal steps defined before deployment, returning systems to their previous configuration within a target window that preserves service level commitments.

Rollback differs from disaster recovery in scope and trigger. Disaster recovery responds to infrastructure failure or data loss affecting entire systems. Rollback responds to change-induced problems where the underlying infrastructure remains functional but the deployed change introduces defects. The distinction matters because rollback assumes the pre-change state remains accessible and valid, while disaster recovery assumes that state may be compromised.

Rollback
Reverting a system to its pre-deployment state by restoring previous code, configuration, or data to eliminate problems introduced by a change.
Rollback plan
A documented procedure created before deployment that specifies the exact steps, responsible parties, decision criteria, and verification methods for reverting the change.
Rollback window
The maximum time permitted for rollback execution, typically 15-60 minutes depending on service criticality and SLA commitments.
Point of no return
The moment during deployment after which rollback becomes impossible or prohibitively complex, requiring forward-fix instead.

Prerequisites

Before executing any rollback, verify these conditions are met. Missing prerequisites transform a controlled rollback into an uncontrolled incident.

Rollback plan availability

Locate the rollback plan created during change planning. The plan exists as an attachment to the change request or in the release documentation. If no rollback plan exists, escalate to the change manager before proceeding. Never attempt rollback without a documented plan.

The rollback plan must contain:

  • Pre-deployment backup identifiers and locations
  • Exact commands or procedures for each rollback step
  • Expected duration for each phase
  • Verification criteria confirming successful rollback
  • Communication requirements and escalation contacts

Backup verification

Confirm that pre-deployment backups exist and are accessible. Execute verification commands before initiating rollback:

Terminal window
# Verify database backup exists and is readable
pg_restore --list /backups/pre-deploy/myapp-db-20240115-1430.dump | head -20
# Verify application artifact exists
ls -la /artifacts/releases/myapp-v2.3.1.tar.gz
# Verify configuration backup
cat /backups/configs/myapp/20240115-1430/app.conf | grep -c "^[^#]"

Expected output confirms file existence and content accessibility. Any “file not found” or permission errors halt rollback until resolved.

Authority confirmation

Rollback execution requires explicit authorisation. During business hours, obtain verbal or written approval from the change manager or service owner. Outside business hours, the on-call engineer holds delegated authority for emergency rollbacks when service impact exceeds defined thresholds.

Document the authorisation in the incident or change record:

Rollback authorised by: J. Smith (Change Manager)
Authorisation time: 2024-01-15 15:45 UTC
Reason: Transaction error rate exceeds 5% threshold (current: 12.3%)

Access and credentials

Verify you possess the access required for rollback execution:

SystemRequired accessVerification command
Application serversSSH with sudossh appserver01 'sudo whoami'
Database serversDBA role or restore permissionspsql -c "SELECT current_user, rolsuper FROM pg_roles WHERE rolname = current_user;"
Load balancerConfiguration modificationcurl -u $LB_USER https://lb.example.org/api/v1/pools
DNS managementZone edit permissionsPortal login verification
Container orchestratorDeployment rollback rolekubectl auth can-i rollout undo deployment

Failed access verification requires credential escalation before proceeding.

Rollback decision criteria

Rollback decisions balance service restoration speed against rollback complexity and risk. Not every deployment problem warrants rollback. Minor issues with available workarounds may be better addressed through forward-fix while maintaining service.

Automatic rollback triggers

These conditions mandate immediate rollback without further assessment:

  • Service availability drops below 95% and remains there for 5 minutes
  • Error rate exceeds 10% of transactions for 3 consecutive minutes
  • Data corruption detected in any form
  • Security vulnerability introduced by the change
  • Complete loss of critical business function

When automatic triggers fire, begin rollback immediately while notifying stakeholders in parallel. Do not wait for approval when these thresholds are breached.

Assessed rollback triggers

These conditions require assessment before deciding:

  • Error rate between 2% and 10% sustained for 10 minutes
  • Performance degradation exceeding 50% of baseline response time
  • Partial functionality loss affecting non-critical features
  • User-reported issues accumulating without automated detection

Assessment weighs rollback time against forward-fix time. If the deployment introduced a bug that engineering can patch within 30 minutes, and rollback requires 45 minutes, forward-fix is preferable. If rollback completes in 15 minutes but forward-fix requires investigation of unknown duration, rollback is preferable.

Decision flowchart

+----------------------+
| Problem detected |
| post-deployment |
+----------+-----------+
|
+----------v-----------+
| Service availability |
| below 95%? |
+----------+-----------+
|
+----------------+----------------+
| |
| Yes | No
v v
+--------+--------+ +----------+-----------+
| IMMEDIATE | | Error rate |
| ROLLBACK | | above 10%? |
+-----------------+ +----------+-----------+
|
+---------------+---------------+
| |
| Yes | No
v v
+--------+--------+ +----------+-----------+
| IMMEDIATE | | Data corruption |
| ROLLBACK | | detected? |
+-----------------+ +----------+-----------+
|
+---------------+---------------+
| |
| Yes | No
v v
+--------+--------+ +----------+-----------+
| IMMEDIATE | | Error rate 2-10% |
| ROLLBACK | | for 10+ minutes? |
+-----------------+ +----------+-----------+
|
+---------------+---------------+
| |
| Yes | No
v v
+--------+--------+ +----------+-----------+
| ASSESS: | | Performance >50% |
| Rollback vs | | degraded? |
| forward-fix | +----------+-----------+
+-----------------+ |
+-----------+-----------+
| |
| Yes | No
v v
+--------+--------+ +----------+-----------+
| ASSESS: | | Monitor and |
| Rollback vs | | consider |
| forward-fix | | forward-fix |
+-----------------+ +----------------------+

Figure 1: Rollback decision tree based on service impact thresholds

Procedure

Rollback procedures vary by component type. Execute the procedures relevant to your deployment. A typical application deployment may require database rollback followed by application rollback. Infrastructure changes require infrastructure-specific procedures.

Communication initiation

Before executing technical rollback steps, establish communication channels. Stakeholders require notification that rollback is in progress, and the rollback team requires a coordination channel.

  1. Create or join the incident bridge call using the standard incident line:
Dial: +44 20 7946 0958
Conference ID: 234567#
Or join: https://meet.example.org/incident-bridge
  1. Post initial notification to the operations channel:
ROLLBACK IN PROGRESS
Change: CHG0012345 - Payment service v2.4.0 deployment
Trigger: Error rate 12.3% (threshold: 10%)
Started: 2024-01-15 15:47 UTC
Expected duration: 25 minutes
Bridge: https://meet.example.org/incident-bridge
Lead: @oncall-engineer
  1. Notify the service owner and change manager directly via the defined escalation path. Do not rely solely on channel notifications for critical stakeholders.

Database rollback

Database rollback restores the database to its pre-deployment state. This procedure applies when the deployment included schema changes, data migrations, or stored procedure modifications.

Point of no return

Database rollback becomes impossible once new transactions write data that depends on the new schema. If the application has been processing live transactions for more than your defined point-of-no-return window (typically 15-30 minutes), forward-fix may be the only option. Assess data dependencies before proceeding.

  1. Stop application connections to prevent new transactions during rollback:
Terminal window
# On each application server
sudo systemctl stop myapp
# Verify no active connections remain
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'myapp' AND state = 'active';"

Expected output: count = 0. If connections remain, terminate them:

Terminal window
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"
  1. Create a safety backup of the current (failed) state before rollback:
Terminal window
pg_dump -Fc myapp > /backups/emergency/myapp-pre-rollback-$(date +%Y%m%d-%H%M).dump

This backup preserves any data that arrived between deployment and rollback, enabling potential data recovery if needed later.

  1. Restore the pre-deployment database backup:
Terminal window
# Drop current database and recreate from backup
psql -c "DROP DATABASE myapp;"
psql -c "CREATE DATABASE myapp OWNER myapp_user;"
pg_restore -d myapp /backups/pre-deploy/myapp-db-20240115-1430.dump

For large databases where full restore exceeds rollback window, use point-in-time recovery if available:

Terminal window
# PostgreSQL PITR to timestamp before deployment
pg_restore --target-time="2024-01-15 14:25:00" -d myapp
  1. Verify database state matches pre-deployment baseline:
Terminal window
# Check schema version
psql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"
# Expected: version matching pre-deployment (e.g., 20240110120000)
# Verify row counts on critical tables
psql -c "SELECT 'users' as tbl, count(*) FROM users UNION ALL SELECT 'transactions', count(*) FROM transactions;"
# Compare against pre-deployment baseline counts

Application rollback

Application rollback reverts deployed code to the previous version. The procedure varies by deployment mechanism.

Container orchestrator rollback (Kubernetes)

  1. Identify the current and target revision:
Terminal window
kubectl rollout history deployment/myapp -n production

Output shows revision history:

REVISION CHANGE-CAUSE
3 Deploy v2.3.1 - CHG0012340
4 Deploy v2.4.0 - CHG0012345

Target revision is 3 (previous stable version).

  1. Execute rollback to the previous revision:
Terminal window
kubectl rollout undo deployment/myapp -n production --to-revision=3

Monitor rollout progress:

Terminal window
kubectl rollout status deployment/myapp -n production --timeout=300s

Expected output: deployment "myapp" successfully rolled out

  1. Verify pods are running the correct version:
Terminal window
kubectl get pods -n production -l app=myapp -o jsonpath='{.items[*].spec.containers[0].image}'

Expected output: registry.example.org/myapp:v2.3.1

Traditional server rollback

  1. Stop the application service on all servers:
Terminal window
# Execute on each application server or via configuration management
for server in app01 app02 app03; do
ssh $server 'sudo systemctl stop myapp'
done
  1. Replace the deployed artifact with the previous version:
Terminal window
for server in app01 app02 app03; do
ssh $server '
sudo rm -rf /opt/myapp/current
sudo ln -s /opt/myapp/releases/v2.3.1 /opt/myapp/current
'
done

If using artifact deployment rather than symlinks:

Terminal window
for server in app01 app02 app03; do
ssh $server '
sudo tar -xzf /artifacts/releases/myapp-v2.3.1.tar.gz -C /opt/myapp/
'
done
  1. Start the application service:
Terminal window
for server in app01 app02 app03; do
ssh $server 'sudo systemctl start myapp'
done
  1. Verify application health on each server:
Terminal window
for server in app01 app02 app03; do
curl -s http://$server:8080/health | jq '.status'
done

Expected output for each: "healthy"

Blue-green rollback

Blue-green deployments enable instant rollback through traffic switching. The previous version remains running on the inactive environment throughout deployment.

+-------------------------------------------------------------+
| BLUE-GREEN ROLLBACK MECHANISM |
+-------------------------------------------------------------+
| |
| Concept: Instant recovery by switching traffic to the |
| previous stable environment. |
| |
| 1. BEFORE ROLLBACK 2. AFTER ROLLBACK |
| (v2.4.0 is faulty) (Reverted to v2.3.1) |
| +-----------+ +-----------+ |
| +-----------+ +-----------+ |
| | Load Bal | | Load Bal | |
| +-----------+ +-----------+ |
| | | |
| +--+ (Switch) +--+ |
| | Traffic | |
| v | v |
| +--------+ +--------+ +--------+ +--------+ |
| | BLUE | | GREEN | | BLUE | | GREEN | |
| | v2.4.0 | | v2.3.1 | | v2.4.0 | | v2.3.1 | |
| +--------+ +--------+ +--------+ +--------+ |
| ACTIVE STANDBY STANDBY ACTIVE |
| (Errors) (Stable) (Fixing) (Live) |
| |
+-------------------------------------------------------------+

Figure 2: Blue-green traffic switch for instant rollback

  1. Verify the standby environment (green) health:
Terminal window
curl -s http://green.internal:8080/health | jq '.status'

Expected output: "healthy". If unhealthy, the standby environment requires investigation before traffic switch.

  1. Switch load balancer traffic to the standby environment:

    For HAProxy:

Terminal window
# Update backend weight to shift traffic
echo "set server myapp-backend/green weight 100" | socat stdio /var/run/haproxy/admin.sock
echo "set server myapp-backend/blue weight 0" | socat stdio /var/run/haproxy/admin.sock

For AWS ALB:

Terminal window
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP_ARN

For Kubernetes with service mesh:

Terminal window
kubectl patch virtualservice myapp -n production --type merge -p '
spec:
http:
- route:
- destination:
host: myapp-green
weight: 100
- destination:
host: myapp-blue
weight: 0
'
  1. Monitor traffic shift and error rates:
Terminal window
# Watch error rate during transition
watch -n 5 'curl -s http://monitoring.internal/api/v1/query?query=rate(http_requests_total{status=~"5.."}[1m])'

Error rate should drop to pre-deployment baseline within 2-3 minutes of traffic switch.

Configuration rollback

Configuration rollback restores system or application configuration files to their pre-deployment state. This procedure applies when the deployment modified configuration without code changes.

  1. Identify configuration files modified by the deployment from the change record:
Terminal window
# List files in configuration backup
ls -la /backups/configs/myapp/20240115-1430/

Output:

app.conf
database.yml
nginx.conf
  1. Stop services that use the configuration:
Terminal window
sudo systemctl stop myapp nginx
  1. Restore each configuration file:
Terminal window
sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.conf
sudo cp /backups/configs/myapp/20240115-1430/database.yml /etc/myapp/database.yml
sudo cp /backups/configs/myapp/20240115-1430/nginx.conf /etc/nginx/nginx.conf
  1. Validate configuration syntax before restart:
Terminal window
# Application configuration validation
/opt/myapp/bin/myapp --validate-config /etc/myapp/app.conf
# Nginx configuration validation
sudo nginx -t

Expected output: validation passes with no errors.

  1. Restart services with restored configuration:
Terminal window
sudo systemctl start nginx myapp

Infrastructure rollback

Infrastructure rollback reverts changes to cloud resources, network configuration, or platform components. Infrastructure-as-code deployments enable declarative rollback.

Terraform rollback

  1. Identify the previous state version:
Terminal window
# List state versions in remote backend
terraform state list
terraform show -json | jq '.values.root_module.resources | length'
# For S3 backend, list state versions
aws s3api list-object-versions --bucket terraform-state-bucket --prefix myapp/terraform.tfstate
  1. Restore the previous state version:
Terminal window
# Download previous state
aws s3api get-object --bucket terraform-state-bucket \
--key myapp/terraform.tfstate \
--version-id "abc123previousversion" \
terraform.tfstate.previous
# Replace current state
cp terraform.tfstate.previous terraform.tfstate
terraform state push terraform.tfstate
  1. Apply the previous configuration:
Terminal window
# Checkout previous infrastructure code version
git checkout v2.3.1 -- terraform/
# Plan to verify changes
terraform plan -out=rollback.plan
# Review plan output, then apply
terraform apply rollback.plan

Network configuration rollback

  1. Restore network device configuration from backup:

    For Cisco IOS devices:

configure replace flash:backup-20240115-1430.cfg force

For Juniper devices:

rollback 1
commit
  1. Verify network connectivity:
Terminal window
# Test critical paths
ping -c 5 gateway.internal
ping -c 5 database.internal
traceroute application.internal
  1. Verify routing tables:
Terminal window
# Compare current routes against expected baseline
ip route show | diff - /backups/network/routes-baseline.txt

Verification

After completing rollback procedures, verify that systems have returned to the expected pre-deployment state and service has been restored.

Service health verification

Confirm the application responds correctly:

Terminal window
# Health endpoint check
curl -s -w "\nHTTP Status: %{http_code}\n" https://myapp.example.org/health
# Expected output:
# {"status":"healthy","version":"v2.3.1","database":"connected"}
# HTTP Status: 200

Execute synthetic transactions to verify business functionality:

Terminal window
# Test critical user journey
curl -X POST https://myapp.example.org/api/v1/test-transaction \
-H "Content-Type: application/json" \
-d '{"test": true, "amount": 1.00}'
# Expected: HTTP 200 with transaction confirmation

Metrics verification

Confirm that metrics have returned to baseline:

Terminal window
# Query current error rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])" | jq '.data.result[0].value[1]'
# Expected: value below 0.01 (1% error rate)
# Query response time
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket[5m]))" | jq '.data.result[0].value[1]'
# Expected: value below baseline (e.g., 0.250 for 250ms p95)

Version verification

Confirm the correct version is deployed:

Terminal window
# Application version endpoint
curl -s https://myapp.example.org/version
# Expected: {"version":"v2.3.1","build":"20240110-1234"}
# Database schema version
psql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"
# Expected: 20240110120000 (pre-deployment migration)

Rollback completion communication

Post rollback completion notification:

✅ ROLLBACK COMPLETE
Change: CHG0012345 - Payment service v2.4.0 deployment
Rollback completed: 2024-01-15 16:12 UTC
Duration: 25 minutes
Service status: Restored to v2.3.1
Error rate: 0.3% (baseline: 0.2%)
Response time: 180ms p95 (baseline: 175ms)
Incident record: INC0056789
Post-incident review: Scheduled 2024-01-17 10:00 UTC

Post-rollback activities

Rollback resolves the immediate service impact but requires follow-up actions to address the underlying deployment failure.

Incident record creation

Create an incident record if one does not exist:

Incident: INC0056789
Related change: CHG0012345
Summary: Payment service v2.4.0 deployment caused elevated error rate requiring rollback
Impact: 27 minutes of degraded service (12.3% error rate)
Resolution: Rolled back to v2.3.1
Root cause: Pending investigation (link to problem record)

Problem record linkage

Create or link to a problem record for root cause investigation:

Problem: PRB0003456
Related incident: INC0056789
Related change: CHG0012345
Summary: Payment service v2.4.0 deployment failure - root cause unknown
Status: Open - Assigned to development team

The problem record tracks investigation into why the deployment failed, preventing recurrence in subsequent deployment attempts.

Change record closure

Update the change record with rollback outcome:

Change: CHG0012345
Status: Failed - Rolled back
Rollback executed: 2024-01-15 15:47-16:12 UTC
Rollback reason: Error rate exceeded 10% threshold
Linked incident: INC0056789
Linked problem: PRB0003456
Post-implementation review: Required

Troubleshooting

Backup file not found or corrupted

Symptom: pg_restore: error: could not open input file: No such file or directory or pg_restore: error: invalid archive

Cause: Pre-deployment backup was not created, was moved, or was corrupted during storage.

Resolution: Check alternative backup locations. Query the backup catalogue for recent backups:

Terminal window
# List recent backups
ls -la /backups/*/myapp* | sort -k6,7

If no valid backup exists, rollback is not possible. Escalate to incident management and pursue forward-fix.

Database restore fails with active connections

Symptom: ERROR: database "myapp" is being accessed by other users

Cause: Application or connection pool maintains connections despite service stop.

Resolution: Force-terminate all connections:

Terminal window
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"

If connections persist, identify the source:

Terminal window
psql -c "SELECT pid, usename, application_name, client_addr FROM pg_stat_activity WHERE datname = 'myapp';"

Container rollback shows “revision not found”

Symptom: error: unable to find specified revision 3 in history

Cause: Deployment history was purged or revisionHistoryLimit is set too low.

Resolution: Check history limit:

Terminal window
kubectl get deployment myapp -n production -o jsonpath='{.spec.revisionHistoryLimit}'

If history is insufficient, deploy the previous version explicitly:

Terminal window
kubectl set image deployment/myapp -n production myapp=registry.example.org/myapp:v2.3.1

Load balancer traffic switch has no effect

Symptom: Traffic continues flowing to the failed deployment after load balancer configuration change.

Cause: DNS caching, CDN caching, or client-side connection persistence.

Resolution: Verify the load balancer configuration took effect:

Terminal window
# Check backend status
echo "show stat" | socat stdio /var/run/haproxy/admin.sock | grep myapp

If configuration is correct but traffic persists, the issue is downstream caching:

Terminal window
# Purge CDN cache
curl -X POST https://api.cdn.example.org/purge -d '{"zone":"myapp.example.org"}'
# Wait for DNS TTL expiry (check current TTL)
dig +noall +answer myapp.example.org

Configuration file restore fails with permission denied

Symptom: cp: cannot create regular file '/etc/myapp/app.conf': Permission denied

Cause: Insufficient permissions or SELinux/AppArmor restrictions.

Resolution: Execute with appropriate privileges:

Terminal window
sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.conf

If sudo fails, check security context:

Terminal window
# SELinux context check
ls -Z /etc/myapp/app.conf
restorecon -v /etc/myapp/app.conf

Rollback completes but errors persist

Symptom: All rollback steps succeed, version is confirmed as previous, but error rate remains elevated.

Cause: The deployment was not the root cause, or rollback was incomplete (e.g., missing configuration file, cached data, external dependency).

Resolution: Investigate other potential causes:

Terminal window
# Check for configuration drift
diff /etc/myapp/app.conf /backups/configs/myapp/20240115-1430/app.conf
# Check external dependencies
curl -s https://payment-gateway.external.org/health
# Check for cached data issues
redis-cli KEYS "myapp:*" | head -20

If the original deployment was not the cause, create a new incident for the actual issue.

Terraform apply fails during infrastructure rollback

Symptom: Error: error creating resource: ConflictException: Resource already exists

Cause: State mismatch between Terraform state and actual infrastructure.

Resolution: Refresh state and retry:

Terminal window
terraform refresh
terraform plan -out=rollback.plan

If conflicts persist, manually import the conflicting resource or use terraform state rm to remove stale entries (with caution).

Network rollback causes connectivity loss

Symptom: SSH connection drops during network configuration rollback.

Cause: Rollback configuration removed the management network path.

Resolution: Access the device via out-of-band management (console, IPMI, iLO):

Terminal window
# Connect via serial console
screen /dev/ttyUSB0 9600
# Or via IPMI
ipmitool -I lanplus -H device-ipmi.internal -U admin sol activate

Restore connectivity configuration before completing rollback.

Rollback window exceeded

Symptom: Rollback procedures are taking longer than the defined window (e.g., 45 minutes vs 30-minute target).

Cause: Underestimated rollback complexity, unexpected issues, or resource constraints.

Resolution: Communicate timeline extension to stakeholders:

ROLLBACK EXTENSION
Original window: 30 minutes
New estimate: 60 minutes
Reason: Database restore taking longer than planned due to size
Current status: 65% complete

Assess whether continuing rollback or switching to forward-fix is more appropriate given the extended timeline.

See also