Backup Verification
Backup verification confirms that backup data exists, remains intact, and can be restored within required timeframes. Verification operates at three levels: integrity checking validates that backup files are uncorrupted, restore testing proves that data can be recovered to a usable state, and compliance documentation demonstrates verification activities to auditors and regulators. Without systematic verification, organisations discover backup failures only when attempting recovery during an incident.
Prerequisites
Before beginning backup verification activities, confirm the following requirements are in place.
| Requirement | Detail |
|---|---|
| Backup system access | Administrative access to backup platform (Restic, Veeam, Azure Backup, AWS Backup, or equivalent) |
| Test restore environment | Isolated infrastructure for restore testing that does not affect production systems |
| Storage capacity | Sufficient space for test restores: minimum 1.5x the size of largest backup set |
| Documentation access | Write access to backup verification log repository |
| Scheduling authority | Ability to schedule verification jobs during maintenance windows |
| Time allocation | 4 hours weekly for routine verification; 8 hours monthly for full restore tests |
Verify that backup jobs are completing successfully before beginning verification. A backup that fails to run produces nothing to verify:
# Check recent backup job status (Restic example)restic -r /path/to/repo snapshots --latest 5
# Expected output shows recent snapshots with timestamps# ID Time Host Tags Paths# a1b2c3d4 2024-11-15 02:00:01 prod-db-01 /var/lib/postgresql# e5f6g7h8 2024-11-14 02:00:01 prod-db-01 /var/lib/postgresqlIf no recent snapshots appear, investigate the backup job before proceeding with verification.
Procedure
Backup verification follows a tiered approach: automated integrity checks run continuously, scheduled restore tests occur weekly and monthly, and compliance evidence collection happens quarterly or as required by audit schedules.
Configuring automated integrity verification
Automated verification catches corruption, storage failures, and incomplete backups without manual intervention. Configure these checks to run after each backup job completes.
- Enable backup integrity checking in your backup configuration. For Restic-based backups, add verification to the post-backup script:
#!/bin/bash REPO="/mnt/backup/restic-repo" BACKUP_PATH="/var/lib/postgresql"
# Run backup restic -r "$REPO" backup "$BACKUP_PATH" --tag postgresql BACKUP_EXIT=$?
# Verify backup integrity if [ $BACKUP_EXIT -eq 0 ]; then restic -r "$REPO" check --read-data-subset=5% VERIFY_EXIT=$? else VERIFY_EXIT=1 fi
# Report results if [ $VERIFY_EXIT -ne 0 ]; then echo "CRITICAL: Backup verification failed" | \ mail -s "Backup Alert: $(hostname)" backup-alerts@example.org exit 1 fi
echo "Backup and verification completed successfully" exit 0The --read-data-subset=5% flag verifies a random 5% sample of backup data on each run. Over 20 backup cycles, this provides statistical coverage of the entire repository while keeping verification time under 30 minutes for repositories up to 2TB.
- Configure verification scheduling in your job scheduler. For systemd-based systems, create a timer that runs verification independently of backup jobs:
[Unit] Description=Weekly full backup verification
[Timer] OnCalendar=Sun 04:00 Persistent=true RandomizedDelaySec=1800
[Install] WantedBy=timers.target [Unit] Description=Full backup repository verification
[Service] Type=oneshot ExecStart=/opt/backup/scripts/full-verify.sh User=backup StandardOutput=journal StandardError=journalEnable the timer:
sudo systemctl enable --now backup-verify.timer- Create the full verification script that performs comprehensive integrity checking:
#!/bin/bash REPO="/mnt/backup/restic-repo" LOG_DIR="/var/log/backup-verify" DATE=$(date +%Y%m%d)
mkdir -p "$LOG_DIR"
echo "Starting full verification at $(date)" | tee "$LOG_DIR/verify-$DATE.log"
# Full repository check (reads all data) restic -r "$REPO" check --read-data 2>&1 | tee -a "$LOG_DIR/verify-$DATE.log" CHECK_EXIT=${PIPESTATUS[0]}
# Verify snapshot consistency restic -r "$REPO" snapshots --json | \ jq -r '.[] | "\(.time) \(.hostname) \(.paths[])"' | \ tee -a "$LOG_DIR/verify-$DATE.log"
# Check for stale locks restic -r "$REPO" unlock 2>&1 | tee -a "$LOG_DIR/verify-$DATE.log"
echo "Verification completed with exit code $CHECK_EXIT at $(date)" | \ tee -a "$LOG_DIR/verify-$DATE.log"
exit $CHECK_EXIT- Configure alerting for verification failures. Create an alert rule that triggers when verification jobs fail or do not run:
groups: - name: backup_verification rules: - alert: BackupVerificationFailed expr: backup_verify_success == 0 for: 5m labels: severity: critical annotations: summary: "Backup verification failed on {{ $labels.instance }}" description: "Backup verification has been failing for more than 5 minutes."
- alert: BackupVerificationMissing expr: time() - backup_verify_last_success > 604800 for: 1h labels: severity: warning annotations: summary: "No successful backup verification in 7 days" description: "Instance {{ $labels.instance }} has not had a successful verification in over 7 days."For environments without Prometheus, configure email alerts through cron or your backup platform’s native alerting.
Performing scheduled restore tests
Automated integrity checks confirm that backup data is uncorrupted, but only restore tests prove that data can actually be recovered. Schedule restore tests according to the criticality of each data type.
+--------------------------------------------------------------------+| RESTORE TEST SCHEDULE |+--------------------------------------------------------------------+| || DATA CRITICALITY TEST FREQUENCY RESTORE SCOPE || +-----------------+ +---------------+ +---------------+ || | Critical | | Weekly | | Sample files | || | (databases, +------>| (every Sunday)| | + full DB | || | financial) | | | | monthly | || +-----------------+ +---------------+ +---------------+ || || +-----------------+ +---------------+ +---------------+ || | Important | | Monthly | | Representative| || | (documents, +------>| (1st Sunday) | | sample | || | email) | | | | | || +-----------------+ +---------------+ +---------------+ || || +-----------------+ +---------------+ +---------------+ || | Standard | | Quarterly | | Random sample | || | (user files, +------>| (Jan/Apr/ | | | || | archives) | | Jul/Oct) | | | || +-----------------+ +---------------+ +---------------+ || |+--------------------------------------------------------------------+Figure 1: Restore test frequency aligned to data criticality
- Prepare your test restore environment. The environment must be isolated from production to prevent test data from contaminating live systems:
# Create isolated restore target directory sudo mkdir -p /mnt/restore-test sudo chown backup:backup /mnt/restore-test
# For database restores, prepare a test database instance # (PostgreSQL example) sudo -u postgres createdb restore_test_dbFor cloud-based backup systems, provision a separate resource group or project for restore testing:
# Azure example az group create --name rg-restore-test --location uksouth
# AWS example aws ec2 create-vpc --cidr-block 10.99.0.0/16 \ --tag-specifications 'ResourceType=vpc,Tags=[{Key=Purpose,Value=restore-test}]'- Execute a file-level restore test. Select files from different time periods to verify both recent and older backups:
# Restore specific files from most recent backup restic -r /mnt/backup/restic-repo restore latest \ --target /mnt/restore-test \ --include "/var/lib/postgresql/data/base" \ --verify
# Restore from backup 30 days ago SNAPSHOT_30D=$(restic -r /mnt/backup/restic-repo snapshots \ --json | jq -r '[.[] | select(.time < (now - 2592000 | todate))] | .[0].id')
restic -r /mnt/backup/restic-repo restore "$SNAPSHOT_30D" \ --target /mnt/restore-test-30d \ --include "/etc" \ --verifyThe --verify flag compares restored file checksums against the backup repository, confirming data integrity through the entire restore chain.
- Execute a database restore test. Database restores require bringing the restored data online to confirm usability:
# PostgreSQL restore test # 1. Restore the data directory restic -r /mnt/backup/restic-repo restore latest \ --target /mnt/restore-test/pg-data \ --include "/var/lib/postgresql/14/main"
# 2. Start PostgreSQL against restored data sudo -u postgres /usr/lib/postgresql/14/bin/pg_ctl \ -D /mnt/restore-test/pg-data/var/lib/postgresql/14/main \ -o "-p 5433" \ start
# 3. Verify database accessibility and run consistency check psql -h localhost -p 5433 -U postgres -c "SELECT count(*) FROM pg_tables;" psql -h localhost -p 5433 -U postgres -d production_db \ -c "SELECT schemaname, tablename FROM pg_tables WHERE schemaname = 'public';"
# 4. Run application-specific verification queries psql -h localhost -p 5433 -U postgres -d production_db \ -c "SELECT COUNT(*) as beneficiary_count FROM beneficiaries;"
# Expected: Count matches or closely matches production # Deviation greater than 1% indicates potential backup issue
# 5. Shutdown test instance and clean up sudo -u postgres /usr/lib/postgresql/14/bin/pg_ctl \ -D /mnt/restore-test/pg-data/var/lib/postgresql/14/main \ stop
rm -rf /mnt/restore-test/pg-data- Document restore test results. Create a verification record for each test:
# Generate restore test report cat > /var/log/backup-verify/restore-test-$(date +%Y%m%d).md << EOF # Restore Test Report
Date: $(date +%Y-%m-%d) Tester: $(whoami) Backup Repository: /mnt/backup/restic-repo
## File Restore Test
- Snapshot ID: $(restic -r /mnt/backup/restic-repo snapshots --latest 1 --json | jq -r '.[0].id') - Snapshot Date: $(restic -r /mnt/backup/restic-repo snapshots --latest 1 --json | jq -r '.[0].time') - Files Restored: 847 - Verification: PASSED (checksums match) - Restore Duration: 4m 23s
## Database Restore Test
- Database: production_db - Table Count: 42 (matches production) - Row Count Check: beneficiaries: 15,847 (production: 15,851, delta: 0.03%) - Verification: PASSED - Restore Duration: 12m 07s
## Issues Identified
None
## Sign-off
Test completed successfully. Backups verified recoverable. EOF- Measure and record restore performance. Recovery Time Objective (RTO) compliance depends on knowing actual restore speeds:
# Timed full restore test (run quarterly) START=$(date +%s)
restic -r /mnt/backup/restic-repo restore latest \ --target /mnt/restore-test/full \ --verify
END=$(date +%s) DURATION=$((END - START)) SIZE=$(du -sh /mnt/restore-test/full | cut -f1)
echo "Full restore: $SIZE in $DURATION seconds" echo "Restore rate: $(echo "scale=2; $(du -sb /mnt/restore-test/full | cut -f1) / $DURATION / 1048576" | bc) MB/s"
# Compare against RTO # If RTO is 4 hours (14400 seconds) and restore took 3600 seconds, # you have 75% marginConfiguring verification for cloud backup services
Cloud backup services provide built-in verification capabilities that differ from self-managed backup tools. Configure these features to provide equivalent assurance.
- Enable Azure Backup verification features:
# Enable soft delete and cross-region restore for Recovery Services vault az backup vault update \ --resource-group rg-backup \ --name vault-backup-prod \ --soft-delete-state Enabled \ --cross-region-restore Enabled
# Configure backup policy with verification az backup policy create \ --resource-group rg-backup \ --vault-name vault-backup-prod \ --name policy-daily-verified \ --backup-management-type AzureIaasVM \ --policy '{ "schedulePolicy": { "schedulePolicyType": "SimpleSchedulePolicy", "scheduleRunFrequency": "Daily", "scheduleRunTimes": ["2024-01-01T02:00:00Z"] }, "retentionPolicy": { "retentionPolicyType": "LongTermRetentionPolicy", "dailySchedule": { "retentionDuration": {"count": 30, "durationType": "Days"} } }, "instantRpRetentionRangeInDays": 5 }'- Configure AWS Backup verification:
# Create backup plan with verification aws backup create-backup-plan --backup-plan '{ "BackupPlanName": "daily-verified", "Rules": [{ "RuleName": "daily-backup", "TargetBackupVaultName": "backup-vault-prod", "ScheduleExpression": "cron(0 2 * * ? *)", "StartWindowMinutes": 60, "CompletionWindowMinutes": 180, "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 }, "EnableContinuousBackup": true }] }'
# Create restore testing plan aws backup create-restore-testing-plan --restore-testing-plan '{ "RestoreTestingPlanName": "weekly-restore-test", "ScheduleExpression": "cron(0 4 ? * SUN *)", "StartWindowHours": 4, "RecoveryPointSelection": { "Algorithm": "LATEST_WITHIN_WINDOW", "IncludeVaults": ["arn:aws:backup:eu-west-1:123456789:backup-vault:backup-vault-prod"], "RecoveryPointTypes": ["CONTINUOUS", "SNAPSHOT"] } }'- Schedule and monitor cloud restore tests:
# AWS: Check restore testing job status aws backup list-restore-testing-plans aws backup list-restore-jobs --by-restore-testing-plan-arn \ "arn:aws:backup:eu-west-1:123456789:restore-testing-plan:weekly-restore-test"
# Azure: Check restore job history az backup job list \ --resource-group rg-backup \ --vault-name vault-backup-prod \ --operation Restore \ --output tableEstablishing verification metrics and reporting
Verification activities generate data that demonstrates backup health over time and provides evidence for compliance requirements.
- Define key verification metrics. Track these indicators to identify trends before they become failures:
# Create metrics collection script cat > /opt/backup/scripts/collect-metrics.sh << 'EOF' #!/bin/bash
REPO="/mnt/backup/restic-repo" METRICS_FILE="/var/lib/prometheus/node-exporter/backup_metrics.prom"
# Metric: Last successful verification timestamp LAST_VERIFY=$(stat -c %Y /var/log/backup-verify/verify-*.log 2>/dev/null | sort -rn | head -1) echo "backup_verify_last_success ${LAST_VERIFY:-0}" > "$METRICS_FILE"
# Metric: Repository size REPO_SIZE=$(restic -r "$REPO" stats --json 2>/dev/null | jq -r '.total_size // 0') echo "backup_repository_size_bytes $REPO_SIZE" >> "$METRICS_FILE"
# Metric: Snapshot count SNAPSHOT_COUNT=$(restic -r "$REPO" snapshots --json 2>/dev/null | jq -r 'length // 0') echo "backup_snapshot_count $SNAPSHOT_COUNT" >> "$METRICS_FILE"
# Metric: Latest snapshot age (seconds) LATEST_TIME=$(restic -r "$REPO" snapshots --latest 1 --json 2>/dev/null | \ jq -r '.[0].time // "1970-01-01T00:00:00Z"') LATEST_EPOCH=$(date -d "$LATEST_TIME" +%s 2>/dev/null || echo 0) NOW_EPOCH=$(date +%s) SNAPSHOT_AGE=$((NOW_EPOCH - LATEST_EPOCH)) echo "backup_latest_snapshot_age_seconds $SNAPSHOT_AGE" >> "$METRICS_FILE"
# Metric: Verification success (1=success, 0=failure) if grep -q "exit code 0" /var/log/backup-verify/verify-$(date +%Y%m%d).log 2>/dev/null; then echo "backup_verify_success 1" >> "$METRICS_FILE" else echo "backup_verify_success 0" >> "$METRICS_FILE" fi EOF
chmod +x /opt/backup/scripts/collect-metrics.sh- Create a verification dashboard or report template:
# Generate weekly verification report cat > /opt/backup/scripts/weekly-report.sh << 'EOF' #!/bin/bash
REPORT_DIR="/var/log/backup-verify/reports" mkdir -p "$REPORT_DIR"
WEEK=$(date +%Y-W%V) REPORT="$REPORT_DIR/weekly-$WEEK.md"
cat > "$REPORT" << HEADER # Backup Verification Weekly Report
Week: $WEEK Generated: $(date +%Y-%m-%d)
## Summary
| Metric | Value | Status | |--------|-------|--------| HEADER
# Calculate metrics VERIFY_COUNT=$(ls -1 /var/log/backup-verify/verify-*.log 2>/dev/null | \ xargs -I {} sh -c 'date -d "$(basename {} .log | cut -d- -f2)" +%s' | \ awk -v week_start=$(date -d "last sunday" +%s) '$1 >= week_start' | wc -l)
RESTORE_TESTS=$(grep -l "Restore Test Report" /var/log/backup-verify/*.md 2>/dev/null | \ xargs -I {} sh -c 'date -d "$(basename {} .md | cut -d- -f3)" +%s' | \ awk -v week_start=$(date -d "last sunday" +%s) '$1 >= week_start' | wc -l)
FAILURES=$(grep -l "FAILED\|exit code [1-9]" /var/log/backup-verify/verify-*.log 2>/dev/null | wc -l)
echo "| Integrity checks completed | $VERIFY_COUNT | $([ $VERIFY_COUNT -ge 7 ] && echo '✓' || echo '⚠') |" >> "$REPORT" echo "| Restore tests completed | $RESTORE_TESTS | $([ $RESTORE_TESTS -ge 1 ] && echo '✓' || echo '⚠') |" >> "$REPORT" echo "| Verification failures | $FAILURES | $([ $FAILURES -eq 0 ] && echo '✓' || echo '✗') |" >> "$REPORT"
cat >> "$REPORT" << FOOTER
## Restore Test Results
$(cat /var/log/backup-verify/restore-test-*.md 2>/dev/null | grep -A 20 "^## " | head -40)
## Issues Requiring Attention
$(grep -h "FAILED\|ERROR\|WARNING" /var/log/backup-verify/verify-*.log 2>/dev/null | sort -u | head -10)
---
*Report generated automatically. Review and archive for compliance.* FOOTER
echo "Report generated: $REPORT" EOF
chmod +x /opt/backup/scripts/weekly-report.sh- Archive verification evidence for compliance. Retain verification records according to your data retention policy, typically matching or exceeding backup retention:
# Archive verification logs monthly cat > /opt/backup/scripts/archive-verification.sh << 'EOF' #!/bin/bash
ARCHIVE_DIR="/mnt/archive/backup-verification" LOG_DIR="/var/log/backup-verify" MONTH=$(date -d "last month" +%Y-%m)
mkdir -p "$ARCHIVE_DIR"
# Create compressed archive of month's verification logs tar -czf "$ARCHIVE_DIR/verification-$MONTH.tar.gz" \ -C "$LOG_DIR" \ $(find "$LOG_DIR" -name "*$MONTH*" -type f -printf "%f\n")
# Generate SHA256 checksum for archive integrity sha256sum "$ARCHIVE_DIR/verification-$MONTH.tar.gz" > \ "$ARCHIVE_DIR/verification-$MONTH.tar.gz.sha256"
# Remove archived logs from active directory (keep 3 months online) find "$LOG_DIR" -name "*.log" -mtime +90 -delete find "$LOG_DIR" -name "*.md" -mtime +90 -delete EOF
chmod +x /opt/backup/scripts/archive-verification.shVerification
After configuring backup verification, confirm that the system operates correctly through these checks.
Check that automated verification runs on schedule:
# Verify timer is active and shows next run timesystemctl list-timers | grep backup-verify
# Expected output:# Sun 2024-11-17 04:00:00 GMT 6 days left Sun 2024-11-10 04:12:33 GMT 23h ago backup-verify.timer backup-verify.service
# Check recent verification job logsjournalctl -u backup-verify.service --since "1 week ago" | tail -20Confirm alerting functions correctly by triggering a test alert:
# Temporarily create a failing verification to test alertingecho "backup_verify_success 0" > /var/lib/prometheus/node-exporter/backup_metrics.prom
# Wait for alert to fire (check Prometheus/Alertmanager)# Then restore normal metric/opt/backup/scripts/collect-metrics.shVerify restore test documentation exists and contains required elements:
# Check for recent restore test reportsls -la /var/log/backup-verify/restore-test-*.md
# Verify report contains required sectionsgrep -E "^## (File|Database) Restore Test" /var/log/backup-verify/restore-test-*.mdConfirm metrics are being collected and exported:
# Check Prometheus metrics endpoint (if using node_exporter textfile collector)curl -s localhost:9100/metrics | grep backup_
# Expected output includes:# backup_verify_last_success 1731234567# backup_verify_success 1# backup_repository_size_bytes 1234567890# backup_snapshot_count 30+------------------------------------------------------------------+| VERIFICATION CONFIRMATION FLOW |+------------------------------------------------------------------+| || 1. AUTOMATED CHECKS || +-------------------+ || | Timer active? +---> NO ---> Enable timer || +--------+----------+ || | YES || v || +-------------------+ || | Jobs completing? +---> NO ---> Check logs for errors || +--------+----------+ || | YES || v || 2. ALERTING || +-------------------+ || | Test alert fires? +---> NO ---> Check alert config || +--------+----------+ || | YES || v || 3. DOCUMENTATION || +-------------------+ || | Reports exist? +---> NO ---> Run manual restore test || +--------+----------+ || | YES || v || +-------------------+ || | VERIFICATION | || | COMPLETE | || +-------------------+ || |+------------------------------------------------------------------+Figure 2: Verification system confirmation workflow
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
repository is locked error during verification | Previous job did not complete cleanly, or concurrent access | Run restic unlock to clear stale locks. If lock persists, check for running processes: ps aux | grep restic. Kill orphaned processes if safe. |
Verification completes but reports 0 files checked | Incorrect repository path or empty repository | Verify repository path with restic -r /path/to/repo snapshots. Check that backup jobs are writing to the expected location. |
check: data blob not found errors | Repository corruption or incomplete backup | Run restic check --read-data to identify affected snapshots. May require restoring from an earlier snapshot or rebuilding from source. |
| Restore test fails with permission denied | Test restore environment not configured correctly | Verify restore target directory ownership: ls -la /mnt/restore-test. Ensure backup service account has write permissions. |
| Database restore succeeds but application reports missing data | Point-in-time recovery needed; snapshot taken during transaction | For transactional consistency, use database-native backup tools (pg_dump, mysqldump) rather than filesystem snapshots, or enable continuous archiving. |
| Restore test takes longer than RTO | Insufficient restore infrastructure, network bottleneck, or storage IOPS limits | Profile restore with time command. Check network throughput during restore: iftop. Consider faster restore targets or parallel restore streams. |
| Verification timer not running | Timer not enabled, or service file has errors | Check timer status: systemctl status backup-verify.timer. Verify service file syntax: systemd-analyze verify backup-verify.service. |
| Metrics not appearing in Prometheus | Textfile collector path incorrect, or script not executable | Verify collector path in node_exporter config. Check script permissions: ls -la /opt/backup/scripts/collect-metrics.sh. Run script manually to test. |
| Alert fires but no notification received | Alertmanager routing misconfigured, or receiver not set up | Test Alertmanager directly: amtool alert add alertname=test. Check receiver configuration in alertmanager.yml. |
| Verification passes but restore fails | Integrity check verifies structure, not application-level consistency | Always include actual restore tests, not just integrity checks. Database restores must include application verification queries. |
| Cloud backup verification shows success but restore quota exceeded | Cloud provider limits on restore operations or egress | Check provider quotas and limits. Azure: Recovery Services vault limits. AWS: Backup vault restore limits. Plan restore tests within quota. |
| Restore test database conflicts with production | Test environment not sufficiently isolated | Use separate ports, separate hosts, or containerised test environments. Never restore to production database server without explicit isolation. |
Verification is not backup
Successful verification confirms that existing backups are recoverable. It does not guarantee that backups contain the right data, cover all required systems, or meet retention requirements. Verification is one component of backup assurance alongside backup policy, coverage audits, and retention compliance checks.
See also
- Backup Systems for backup architecture and design concepts
- Backup Recovery for restore procedures during incidents
- Data Backup and Recovery for data-specific backup procedures
- Backup and Recovery Standard for policy requirements
- DR Testing for broader disaster recovery testing including backup restore tests