Rollback Procedures
A rollback reverts a deployment to a known-good state when the deployed change causes service degradation, functional failure, or unacceptable risk. Rollback procedures execute pre-planned reversal steps defined before deployment, returning systems to their previous configuration within a target window that preserves service level commitments.
Rollback differs from disaster recovery in scope and trigger. Disaster recovery responds to infrastructure failure or data loss affecting entire systems. Rollback responds to change-induced problems where the underlying infrastructure remains functional but the deployed change introduces defects. The distinction matters because rollback assumes the pre-change state remains accessible and valid, while disaster recovery assumes that state may be compromised.
- Rollback
- Reverting a system to its pre-deployment state by restoring previous code, configuration, or data to eliminate problems introduced by a change.
- Rollback plan
- A documented procedure created before deployment that specifies the exact steps, responsible parties, decision criteria, and verification methods for reverting the change.
- Rollback window
- The maximum time permitted for rollback execution, typically 15-60 minutes depending on service criticality and SLA commitments.
- Point of no return
- The moment during deployment after which rollback becomes impossible or prohibitively complex, requiring forward-fix instead.
Prerequisites
Before executing any rollback, verify these conditions are met. Missing prerequisites transform a controlled rollback into an uncontrolled incident.
Rollback plan availability
Locate the rollback plan created during change planning. The plan exists as an attachment to the change request or in the release documentation. If no rollback plan exists, escalate to the change manager before proceeding. Never attempt rollback without a documented plan.
The rollback plan must contain:
- Pre-deployment backup identifiers and locations
- Exact commands or procedures for each rollback step
- Expected duration for each phase
- Verification criteria confirming successful rollback
- Communication requirements and escalation contacts
Backup verification
Confirm that pre-deployment backups exist and are accessible. Execute verification commands before initiating rollback:
# Verify database backup exists and is readablepg_restore --list /backups/pre-deploy/myapp-db-20240115-1430.dump | head -20
# Verify application artifact existsls -la /artifacts/releases/myapp-v2.3.1.tar.gz
# Verify configuration backupcat /backups/configs/myapp/20240115-1430/app.conf | grep -c "^[^#]"Expected output confirms file existence and content accessibility. Any “file not found” or permission errors halt rollback until resolved.
Authority confirmation
Rollback execution requires explicit authorisation. During business hours, obtain verbal or written approval from the change manager or service owner. Outside business hours, the on-call engineer holds delegated authority for emergency rollbacks when service impact exceeds defined thresholds.
Document the authorisation in the incident or change record:
Rollback authorised by: J. Smith (Change Manager)Authorisation time: 2024-01-15 15:45 UTCReason: Transaction error rate exceeds 5% threshold (current: 12.3%)Access and credentials
Verify you possess the access required for rollback execution:
| System | Required access | Verification command |
|---|---|---|
| Application servers | SSH with sudo | ssh appserver01 'sudo whoami' |
| Database servers | DBA role or restore permissions | psql -c "SELECT current_user, rolsuper FROM pg_roles WHERE rolname = current_user;" |
| Load balancer | Configuration modification | curl -u $LB_USER https://lb.example.org/api/v1/pools |
| DNS management | Zone edit permissions | Portal login verification |
| Container orchestrator | Deployment rollback role | kubectl auth can-i rollout undo deployment |
Failed access verification requires credential escalation before proceeding.
Rollback decision criteria
Rollback decisions balance service restoration speed against rollback complexity and risk. Not every deployment problem warrants rollback. Minor issues with available workarounds may be better addressed through forward-fix while maintaining service.
Automatic rollback triggers
These conditions mandate immediate rollback without further assessment:
- Service availability drops below 95% and remains there for 5 minutes
- Error rate exceeds 10% of transactions for 3 consecutive minutes
- Data corruption detected in any form
- Security vulnerability introduced by the change
- Complete loss of critical business function
When automatic triggers fire, begin rollback immediately while notifying stakeholders in parallel. Do not wait for approval when these thresholds are breached.
Assessed rollback triggers
These conditions require assessment before deciding:
- Error rate between 2% and 10% sustained for 10 minutes
- Performance degradation exceeding 50% of baseline response time
- Partial functionality loss affecting non-critical features
- User-reported issues accumulating without automated detection
Assessment weighs rollback time against forward-fix time. If the deployment introduced a bug that engineering can patch within 30 minutes, and rollback requires 45 minutes, forward-fix is preferable. If rollback completes in 15 minutes but forward-fix requires investigation of unknown duration, rollback is preferable.
Decision flowchart
+----------------------+ | Problem detected | | post-deployment | +----------+-----------+ | +----------v-----------+ | Service availability | | below 95%? | +----------+-----------+ | +----------------+----------------+ | | | Yes | No v v +--------+--------+ +----------+-----------+ | IMMEDIATE | | Error rate | | ROLLBACK | | above 10%? | +-----------------+ +----------+-----------+ | +---------------+---------------+ | | | Yes | No v v +--------+--------+ +----------+-----------+ | IMMEDIATE | | Data corruption | | ROLLBACK | | detected? | +-----------------+ +----------+-----------+ | +---------------+---------------+ | | | Yes | No v v +--------+--------+ +----------+-----------+ | IMMEDIATE | | Error rate 2-10% | | ROLLBACK | | for 10+ minutes? | +-----------------+ +----------+-----------+ | +---------------+---------------+ | | | Yes | No v v +--------+--------+ +----------+-----------+ | ASSESS: | | Performance >50% | | Rollback vs | | degraded? | | forward-fix | +----------+-----------+ +-----------------+ | +-----------+-----------+ | | | Yes | No v v +--------+--------+ +----------+-----------+ | ASSESS: | | Monitor and | | Rollback vs | | consider | | forward-fix | | forward-fix | +-----------------+ +----------------------+Figure 1: Rollback decision tree based on service impact thresholds
Procedure
Rollback procedures vary by component type. Execute the procedures relevant to your deployment. A typical application deployment may require database rollback followed by application rollback. Infrastructure changes require infrastructure-specific procedures.
Communication initiation
Before executing technical rollback steps, establish communication channels. Stakeholders require notification that rollback is in progress, and the rollback team requires a coordination channel.
- Create or join the incident bridge call using the standard incident line:
Dial: +44 20 7946 0958 Conference ID: 234567#
Or join: https://meet.example.org/incident-bridge- Post initial notification to the operations channel:
ROLLBACK IN PROGRESS Change: CHG0012345 - Payment service v2.4.0 deployment Trigger: Error rate 12.3% (threshold: 10%) Started: 2024-01-15 15:47 UTC Expected duration: 25 minutes Bridge: https://meet.example.org/incident-bridge Lead: @oncall-engineer- Notify the service owner and change manager directly via the defined escalation path. Do not rely solely on channel notifications for critical stakeholders.
Database rollback
Database rollback restores the database to its pre-deployment state. This procedure applies when the deployment included schema changes, data migrations, or stored procedure modifications.
Point of no return
Database rollback becomes impossible once new transactions write data that depends on the new schema. If the application has been processing live transactions for more than your defined point-of-no-return window (typically 15-30 minutes), forward-fix may be the only option. Assess data dependencies before proceeding.
- Stop application connections to prevent new transactions during rollback:
# On each application server sudo systemctl stop myapp
# Verify no active connections remain psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'myapp' AND state = 'active';"Expected output: count = 0. If connections remain, terminate them:
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"- Create a safety backup of the current (failed) state before rollback:
pg_dump -Fc myapp > /backups/emergency/myapp-pre-rollback-$(date +%Y%m%d-%H%M).dumpThis backup preserves any data that arrived between deployment and rollback, enabling potential data recovery if needed later.
- Restore the pre-deployment database backup:
# Drop current database and recreate from backup psql -c "DROP DATABASE myapp;" psql -c "CREATE DATABASE myapp OWNER myapp_user;" pg_restore -d myapp /backups/pre-deploy/myapp-db-20240115-1430.dumpFor large databases where full restore exceeds rollback window, use point-in-time recovery if available:
# PostgreSQL PITR to timestamp before deployment pg_restore --target-time="2024-01-15 14:25:00" -d myapp- Verify database state matches pre-deployment baseline:
# Check schema version psql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"
# Expected: version matching pre-deployment (e.g., 20240110120000)
# Verify row counts on critical tables psql -c "SELECT 'users' as tbl, count(*) FROM users UNION ALL SELECT 'transactions', count(*) FROM transactions;"
# Compare against pre-deployment baseline countsApplication rollback
Application rollback reverts deployed code to the previous version. The procedure varies by deployment mechanism.
Container orchestrator rollback (Kubernetes)
- Identify the current and target revision:
kubectl rollout history deployment/myapp -n productionOutput shows revision history:
REVISION CHANGE-CAUSE 3 Deploy v2.3.1 - CHG0012340 4 Deploy v2.4.0 - CHG0012345Target revision is 3 (previous stable version).
- Execute rollback to the previous revision:
kubectl rollout undo deployment/myapp -n production --to-revision=3Monitor rollout progress:
kubectl rollout status deployment/myapp -n production --timeout=300sExpected output: deployment "myapp" successfully rolled out
- Verify pods are running the correct version:
kubectl get pods -n production -l app=myapp -o jsonpath='{.items[*].spec.containers[0].image}'Expected output: registry.example.org/myapp:v2.3.1
Traditional server rollback
- Stop the application service on all servers:
# Execute on each application server or via configuration management for server in app01 app02 app03; do ssh $server 'sudo systemctl stop myapp' done- Replace the deployed artifact with the previous version:
for server in app01 app02 app03; do ssh $server ' sudo rm -rf /opt/myapp/current sudo ln -s /opt/myapp/releases/v2.3.1 /opt/myapp/current ' doneIf using artifact deployment rather than symlinks:
for server in app01 app02 app03; do ssh $server ' sudo tar -xzf /artifacts/releases/myapp-v2.3.1.tar.gz -C /opt/myapp/ ' done- Start the application service:
for server in app01 app02 app03; do ssh $server 'sudo systemctl start myapp' done- Verify application health on each server:
for server in app01 app02 app03; do curl -s http://$server:8080/health | jq '.status' doneExpected output for each: "healthy"
Blue-green rollback
Blue-green deployments enable instant rollback through traffic switching. The previous version remains running on the inactive environment throughout deployment.
+-------------------------------------------------------------+| BLUE-GREEN ROLLBACK MECHANISM |+-------------------------------------------------------------+| || Concept: Instant recovery by switching traffic to the || previous stable environment. || || 1. BEFORE ROLLBACK 2. AFTER ROLLBACK || (v2.4.0 is faulty) (Reverted to v2.3.1) || +-----------+ +-----------+ || +-----------+ +-----------+ || | Load Bal | | Load Bal | || +-----------+ +-----------+ || | | || +--+ (Switch) +--+ || | Traffic | || v | v || +--------+ +--------+ +--------+ +--------+ || | BLUE | | GREEN | | BLUE | | GREEN | || | v2.4.0 | | v2.3.1 | | v2.4.0 | | v2.3.1 | || +--------+ +--------+ +--------+ +--------+ || ACTIVE STANDBY STANDBY ACTIVE || (Errors) (Stable) (Fixing) (Live) || |+-------------------------------------------------------------+Figure 2: Blue-green traffic switch for instant rollback
- Verify the standby environment (green) health:
curl -s http://green.internal:8080/health | jq '.status'Expected output: "healthy". If unhealthy, the standby environment requires investigation before traffic switch.
Switch load balancer traffic to the standby environment:
For HAProxy:
# Update backend weight to shift traffic echo "set server myapp-backend/green weight 100" | socat stdio /var/run/haproxy/admin.sock echo "set server myapp-backend/blue weight 0" | socat stdio /var/run/haproxy/admin.sockFor AWS ALB:
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \ --default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP_ARNFor Kubernetes with service mesh:
kubectl patch virtualservice myapp -n production --type merge -p ' spec: http: - route: - destination: host: myapp-green weight: 100 - destination: host: myapp-blue weight: 0 '- Monitor traffic shift and error rates:
# Watch error rate during transition watch -n 5 'curl -s http://monitoring.internal/api/v1/query?query=rate(http_requests_total{status=~"5.."}[1m])'Error rate should drop to pre-deployment baseline within 2-3 minutes of traffic switch.
Configuration rollback
Configuration rollback restores system or application configuration files to their pre-deployment state. This procedure applies when the deployment modified configuration without code changes.
- Identify configuration files modified by the deployment from the change record:
# List files in configuration backup ls -la /backups/configs/myapp/20240115-1430/Output:
app.conf database.yml nginx.conf- Stop services that use the configuration:
sudo systemctl stop myapp nginx- Restore each configuration file:
sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.conf sudo cp /backups/configs/myapp/20240115-1430/database.yml /etc/myapp/database.yml sudo cp /backups/configs/myapp/20240115-1430/nginx.conf /etc/nginx/nginx.conf- Validate configuration syntax before restart:
# Application configuration validation /opt/myapp/bin/myapp --validate-config /etc/myapp/app.conf
# Nginx configuration validation sudo nginx -tExpected output: validation passes with no errors.
- Restart services with restored configuration:
sudo systemctl start nginx myappInfrastructure rollback
Infrastructure rollback reverts changes to cloud resources, network configuration, or platform components. Infrastructure-as-code deployments enable declarative rollback.
Terraform rollback
- Identify the previous state version:
# List state versions in remote backend terraform state list terraform show -json | jq '.values.root_module.resources | length'
# For S3 backend, list state versions aws s3api list-object-versions --bucket terraform-state-bucket --prefix myapp/terraform.tfstate- Restore the previous state version:
# Download previous state aws s3api get-object --bucket terraform-state-bucket \ --key myapp/terraform.tfstate \ --version-id "abc123previousversion" \ terraform.tfstate.previous
# Replace current state cp terraform.tfstate.previous terraform.tfstate terraform state push terraform.tfstate- Apply the previous configuration:
# Checkout previous infrastructure code version git checkout v2.3.1 -- terraform/
# Plan to verify changes terraform plan -out=rollback.plan
# Review plan output, then apply terraform apply rollback.planNetwork configuration rollback
Restore network device configuration from backup:
For Cisco IOS devices:
configure replace flash:backup-20240115-1430.cfg forceFor Juniper devices:
rollback 1 commit- Verify network connectivity:
# Test critical paths ping -c 5 gateway.internal ping -c 5 database.internal traceroute application.internal- Verify routing tables:
# Compare current routes against expected baseline ip route show | diff - /backups/network/routes-baseline.txtVerification
After completing rollback procedures, verify that systems have returned to the expected pre-deployment state and service has been restored.
Service health verification
Confirm the application responds correctly:
# Health endpoint checkcurl -s -w "\nHTTP Status: %{http_code}\n" https://myapp.example.org/health
# Expected output:# {"status":"healthy","version":"v2.3.1","database":"connected"}# HTTP Status: 200Execute synthetic transactions to verify business functionality:
# Test critical user journeycurl -X POST https://myapp.example.org/api/v1/test-transaction \ -H "Content-Type: application/json" \ -d '{"test": true, "amount": 1.00}'
# Expected: HTTP 200 with transaction confirmationMetrics verification
Confirm that metrics have returned to baseline:
# Query current error ratecurl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])" | jq '.data.result[0].value[1]'
# Expected: value below 0.01 (1% error rate)
# Query response timecurl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket[5m]))" | jq '.data.result[0].value[1]'
# Expected: value below baseline (e.g., 0.250 for 250ms p95)Version verification
Confirm the correct version is deployed:
# Application version endpointcurl -s https://myapp.example.org/version
# Expected: {"version":"v2.3.1","build":"20240110-1234"}
# Database schema versionpsql -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"
# Expected: 20240110120000 (pre-deployment migration)Rollback completion communication
Post rollback completion notification:
✅ ROLLBACK COMPLETEChange: CHG0012345 - Payment service v2.4.0 deploymentRollback completed: 2024-01-15 16:12 UTCDuration: 25 minutesService status: Restored to v2.3.1Error rate: 0.3% (baseline: 0.2%)Response time: 180ms p95 (baseline: 175ms)
Incident record: INC0056789Post-incident review: Scheduled 2024-01-17 10:00 UTCPost-rollback activities
Rollback resolves the immediate service impact but requires follow-up actions to address the underlying deployment failure.
Incident record creation
Create an incident record if one does not exist:
Incident: INC0056789Related change: CHG0012345Summary: Payment service v2.4.0 deployment caused elevated error rate requiring rollbackImpact: 27 minutes of degraded service (12.3% error rate)Resolution: Rolled back to v2.3.1Root cause: Pending investigation (link to problem record)Problem record linkage
Create or link to a problem record for root cause investigation:
Problem: PRB0003456Related incident: INC0056789Related change: CHG0012345Summary: Payment service v2.4.0 deployment failure - root cause unknownStatus: Open - Assigned to development teamThe problem record tracks investigation into why the deployment failed, preventing recurrence in subsequent deployment attempts.
Change record closure
Update the change record with rollback outcome:
Change: CHG0012345Status: Failed - Rolled backRollback executed: 2024-01-15 15:47-16:12 UTCRollback reason: Error rate exceeded 10% thresholdLinked incident: INC0056789Linked problem: PRB0003456Post-implementation review: RequiredTroubleshooting
Backup file not found or corrupted
Symptom: pg_restore: error: could not open input file: No such file or directory or pg_restore: error: invalid archive
Cause: Pre-deployment backup was not created, was moved, or was corrupted during storage.
Resolution: Check alternative backup locations. Query the backup catalogue for recent backups:
# List recent backupsls -la /backups/*/myapp* | sort -k6,7If no valid backup exists, rollback is not possible. Escalate to incident management and pursue forward-fix.
Database restore fails with active connections
Symptom: ERROR: database "myapp" is being accessed by other users
Cause: Application or connection pool maintains connections despite service stop.
Resolution: Force-terminate all connections:
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'myapp' AND pid <> pg_backend_pid();"If connections persist, identify the source:
psql -c "SELECT pid, usename, application_name, client_addr FROM pg_stat_activity WHERE datname = 'myapp';"Container rollback shows “revision not found”
Symptom: error: unable to find specified revision 3 in history
Cause: Deployment history was purged or revisionHistoryLimit is set too low.
Resolution: Check history limit:
kubectl get deployment myapp -n production -o jsonpath='{.spec.revisionHistoryLimit}'If history is insufficient, deploy the previous version explicitly:
kubectl set image deployment/myapp -n production myapp=registry.example.org/myapp:v2.3.1Load balancer traffic switch has no effect
Symptom: Traffic continues flowing to the failed deployment after load balancer configuration change.
Cause: DNS caching, CDN caching, or client-side connection persistence.
Resolution: Verify the load balancer configuration took effect:
# Check backend statusecho "show stat" | socat stdio /var/run/haproxy/admin.sock | grep myappIf configuration is correct but traffic persists, the issue is downstream caching:
# Purge CDN cachecurl -X POST https://api.cdn.example.org/purge -d '{"zone":"myapp.example.org"}'
# Wait for DNS TTL expiry (check current TTL)dig +noall +answer myapp.example.orgConfiguration file restore fails with permission denied
Symptom: cp: cannot create regular file '/etc/myapp/app.conf': Permission denied
Cause: Insufficient permissions or SELinux/AppArmor restrictions.
Resolution: Execute with appropriate privileges:
sudo cp /backups/configs/myapp/20240115-1430/app.conf /etc/myapp/app.confIf sudo fails, check security context:
# SELinux context checkls -Z /etc/myapp/app.confrestorecon -v /etc/myapp/app.confRollback completes but errors persist
Symptom: All rollback steps succeed, version is confirmed as previous, but error rate remains elevated.
Cause: The deployment was not the root cause, or rollback was incomplete (e.g., missing configuration file, cached data, external dependency).
Resolution: Investigate other potential causes:
# Check for configuration driftdiff /etc/myapp/app.conf /backups/configs/myapp/20240115-1430/app.conf
# Check external dependenciescurl -s https://payment-gateway.external.org/health
# Check for cached data issuesredis-cli KEYS "myapp:*" | head -20If the original deployment was not the cause, create a new incident for the actual issue.
Terraform apply fails during infrastructure rollback
Symptom: Error: error creating resource: ConflictException: Resource already exists
Cause: State mismatch between Terraform state and actual infrastructure.
Resolution: Refresh state and retry:
terraform refreshterraform plan -out=rollback.planIf conflicts persist, manually import the conflicting resource or use terraform state rm to remove stale entries (with caution).
Network rollback causes connectivity loss
Symptom: SSH connection drops during network configuration rollback.
Cause: Rollback configuration removed the management network path.
Resolution: Access the device via out-of-band management (console, IPMI, iLO):
# Connect via serial consolescreen /dev/ttyUSB0 9600
# Or via IPMIipmitool -I lanplus -H device-ipmi.internal -U admin sol activateRestore connectivity configuration before completing rollback.
Rollback window exceeded
Symptom: Rollback procedures are taking longer than the defined window (e.g., 45 minutes vs 30-minute target).
Cause: Underestimated rollback complexity, unexpected issues, or resource constraints.
Resolution: Communicate timeline extension to stakeholders:
ROLLBACK EXTENSIONOriginal window: 30 minutesNew estimate: 60 minutesReason: Database restore taking longer than planned due to sizeCurrent status: 65% completeAssess whether continuing rollback or switching to forward-fix is more appropriate given the extended timeline.