On this page

Data Archiving

Data archiving transfers data from primary storage to long-term repositories where it remains accessible but at lower cost and with different performance characteristics than operational systems. You perform archiving to meet retention obligations, reduce storage costs on high-performance systems, and preserve historical records for future reference. Unlike backup, which creates copies for disaster recovery, archiving moves data to a different tier with the expectation that retrieval will be infrequent.

Archive: A collection of data removed from operational systems and stored in long-term repositories with metadata sufficient for future retrieval and interpretation.
Cold storage: Storage tier optimised for infrequently accessed data, offering lowest cost per gigabyte but higher retrieval latency and often retrieval fees.
Legal hold: A directive requiring preservation of specified data regardless of normal retention schedules, typically issued in anticipation of litigation or regulatory investigation.
Archive package: A self-contained unit combining data files, metadata, checksums, and documentation sufficient to interpret the archive contents without external dependencies.

Prerequisites

Before beginning archiving operations, confirm the following requirements are satisfied.

You need an approved retention policy that specifies which data categories require archiving, the retention periods for each category, and the conditions under which archived data may be disposed. Without this policy, you cannot determine what to archive or for how long. The retention policy should reference your data classification scheme so you can identify sensitive data requiring encryption in archives. Obtain the current retention schedule from your records management function or data governance team.

You need access to archive storage infrastructure with capacity for the data volume you intend to archive plus 20% headroom for packaging overhead. Calculate expected archive size by querying source systems for data volumes matching your archive criteria. For cloud storage, confirm you have credentials with write permissions to the target bucket or container and that network egress costs are budgeted.

You need data classification labels applied to source data so you can determine encryption requirements and access restrictions for archived data. Unclassified data cannot be archived because you cannot verify appropriate handling. If source data lacks classification, complete classification before proceeding.

You need checksums or hash values for source data to verify archive integrity. If source systems do not provide checksums, you must generate them before extraction. SHA-256 provides sufficient collision resistance for integrity verification.

You need documentation of the source system schema, field definitions, and any encoding or compression applied to source data. Archives without schema documentation become uninterpretable when source systems are decommissioned. Export schema definitions and data dictionaries as part of archive preparation.

For database archives, you need credentials with SELECT permissions on tables to be archived and, if performing destructive archiving, DELETE permissions. Coordinate with database administrators to schedule archive operations during low-usage periods if source systems cannot tolerate concurrent archiving load.

For grant or project closeout archiving, you need confirmation from the programme manager that the project has completed final reporting and that no outstanding data requests exist. Premature archiving of project data can obstruct audit responses.

Procedure

Identifying archive candidates

You identify data for archiving by applying retention policy criteria to source systems. The selection criteria derive from your retention schedule and typically include age thresholds, project status, or explicit archive flags set by data owners.

Query the source system to identify records meeting archive criteria. For date-based selection, calculate the cutoff date from your retention policy. If your policy specifies archiving data older than 24 months from last modification, calculate the threshold:

   -- PostgreSQL: identify records for archiving
   SELECT
     id,
     created_at,
     modified_at,
     pg_size_pretty(pg_column_size(t.*)) as record_size
   FROM programme_data t
   WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
     AND archived_at IS NULL
     AND legal_hold = FALSE;

Record the count and total size of candidate records. For this example, assume the query returns 145,000 records totalling 12.3 GB.

Verify no legal holds apply to candidate records. Query your legal hold register or check the legal hold flag on individual records. Any record under legal hold must be excluded regardless of age:

   -- Verify no legal holds in candidate set
   SELECT COUNT(*) as held_records
   FROM programme_data
   WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
     AND archived_at IS NULL
     AND legal_hold = TRUE;

If this returns any non-zero count, investigate each hold before proceeding. Do not archive records under legal hold.

Classify the archive candidate set by data classification level. This determines encryption requirements and storage tier:

   SELECT
     data_classification,
     COUNT(*) as record_count,
     SUM(pg_column_size(t.*)) as total_bytes
   FROM programme_data t
   WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
     AND archived_at IS NULL
     AND legal_hold = FALSE
   GROUP BY data_classification;

Records classified as CONFIDENTIAL or higher require encryption before transfer to archive storage. Records classified as INTERNAL or PUBLIC may be archived without encryption depending on your security policy.

Obtain approval from the data owner for the identified archive set. Send the record counts, date ranges, and classification breakdown to the responsible data steward. Document their approval with date and any conditions they specify.

Preparing archive packages

Archive packages bundle data with metadata to ensure future interpretability. A well-constructed archive package contains everything needed to understand and use the archived data without access to the original source system.

Create a working directory for archive assembly with subdirectories for data, metadata, and checksums:

   ARCHIVE_ID="prog-data-$(date +%Y%m%d)-001"
   mkdir -p /archive/staging/${ARCHIVE_ID}/{data,metadata,checksums}
   cd /archive/staging/${ARCHIVE_ID}

Export the data from the source system in a format that preserves data types and handles special characters correctly. For relational data, use the database’s native export format or a portable format like CSV with explicit quoting:

   # PostgreSQL export with explicit format controls
   psql -h dbserver -U archive_user -d programme_db -c "\COPY (
     SELECT * FROM programme_data
     WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
       AND archived_at IS NULL
       AND legal_hold = FALSE
   ) TO '/archive/staging/${ARCHIVE_ID}/data/programme_data.csv'
   WITH (FORMAT CSV, HEADER TRUE, ENCODING 'UTF8', NULL '')"

For binary data or documents, export to the data subdirectory preserving original filenames or using systematic naming:

   # Export documents with preserved directory structure
   rsync -av --files-from=<(
     psql -t -c "SELECT file_path FROM documents
       WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'"
   ) /documents/ /archive/staging/${ARCHIVE_ID}/data/documents/

Generate the schema documentation capturing table structure, field definitions, data types, and constraints:

   # Export PostgreSQL schema
   pg_dump -h dbserver -U archive_user -d programme_db \
     --schema-only \
     --table=programme_data \
     > metadata/schema.sql

   # Generate human-readable data dictionary
   psql -h dbserver -U archive_user -d programme_db -c "
     SELECT
       column_name,
       data_type,
       character_maximum_length,
       is_nullable,
       column_default
     FROM information_schema.columns
     WHERE table_name = 'programme_data'
     ORDER BY ordinal_position
   " > metadata/data_dictionary.txt

Create an archive manifest documenting the archive contents, source system, extraction date, and retention requirements:

   cat > metadata/MANIFEST.json << EOF
   {
     "archive_id": "${ARCHIVE_ID}",
     "created_at": "$(date -Iseconds)",
     "created_by": "$(whoami)",
     "source_system": "programme_db.example.org",
     "source_table": "programme_data",
     "record_count": 145000,
     "date_range": {
       "earliest_record": "2019-03-15",
       "latest_record": "2022-12-31"
     },
     "data_classification": "INTERNAL",
     "retention_policy": "PROG-RET-001",
     "retain_until": "2032-01-01",
     "extraction_criteria": "modified_at < 2023-01-01",
     "files": [
       {"name": "data/programme_data.csv", "format": "CSV/UTF-8", "rows": 145000},
       {"name": "metadata/schema.sql", "format": "PostgreSQL DDL"},
       {"name": "metadata/data_dictionary.txt", "format": "text"}
     ]
   }
   EOF

Generate checksums for all files in the archive package to enable integrity verification:

   find data metadata -type f -exec sha256sum {} \; > checksums/SHA256SUMS

   # Generate a checksum of the checksum file for tamper detection
   sha256sum checksums/SHA256SUMS > checksums/SHA256SUMS.sig

For data classified as CONFIDENTIAL or higher, encrypt the data files before packaging. Use AES-256 encryption with a key managed through your secrets management system:

   # Encrypt data files (for CONFIDENTIAL classification)
   gpg --symmetric --cipher-algo AES256 \
     --output data/programme_data.csv.gpg \
     data/programme_data.csv

   # Remove unencrypted file after successful encryption
   rm data/programme_data.csv

   # Update checksums to reflect encrypted file
   sha256sum data/programme_data.csv.gpg >> checksums/SHA256SUMS

Store the encryption key identifier in the manifest. Do not store the key itself in the archive.

Package the archive directory into a single archive file for transfer:

   cd /archive/staging
   tar -cvf ${ARCHIVE_ID}.tar ${ARCHIVE_ID}/

   # Generate checksum of the complete package
   sha256sum ${ARCHIVE_ID}.tar > ${ARCHIVE_ID}.tar.sha256

The resulting package for this example measures approximately 4.1 GB after CSV compression inherent in the data.

Selecting storage tier

Archive storage tiers trade access speed against cost. You select the tier based on expected retrieval frequency and acceptable retrieval latency.

                    +----------------------------------+
                    | How often will this data be      |
                    | retrieved in the next 5 years?   |
                    +----------------+-----------------+
                                     |
              +----------------------+----------------------+
              |                      |                      |
              v                      v                      v
    +---------+--------+   +---------+--------+   +---------+--------+
    | More than        |   | 1-10 times       |   | Likely never,    |
    | 10 times         |   | (annual audits,  |   | compliance       |
    |                  |   | occasional       |   | retention only   |
    |                  |   | reference)       |   |                  |
    +--------+---------+   +--------+---------+   +--------+---------+
             |                      |                      |
             v                      v                      v
    +---------+--------+   +---------+--------+   +---------+--------+
    |    WARM TIER     |   |    COLD TIER     |   |   ARCHIVE TIER   |
    |                  |   |                  |   |                  |
    | - S3 Standard-IA |   | - S3 Glacier     |   | - S3 Glacier     |
    | - Azure Cool     |   |   Instant        |   |   Deep Archive   |
    | - GCS Nearline   |   | - Azure Cold     |   | - Azure Archive  |
    |                  |   | - GCS Coldline   |   | - GCS Archive    |
    | Retrieval: mins  |   | Retrieval: mins  |   | Retrieval: hours |
    | Cost: ~$0.01/GB  |   | Cost: ~$0.004/GB |   | Cost: ~$0.001/GB |
    +------------------+   +------------------+   +------------------+

Figure 1: Storage tier selection based on retrieval frequency

The cost figures represent approximate monthly storage costs; retrieval costs add $0.01-0.03 per GB for cold tiers and $0.02-0.05 per GB for archive tiers. A 12 GB archive stored for 10 years in deep archive tier costs approximately $14.40 in storage plus retrieval costs when accessed. The same archive in warm tier costs approximately $144 over 10 years but with immediate access.

For the programme data archive in this example, annual audit access is expected, making cold tier appropriate. Configure the storage class during upload.

Transferring to archive storage

Transfer the packaged archive to its designated storage tier with integrity verification at each stage.

Upload the archive package to cloud storage with the appropriate storage class:

   # AWS S3 with Glacier Instant Retrieval
   aws s3 cp ${ARCHIVE_ID}.tar \
     s3://org-archives/programme-data/${ARCHIVE_ID}.tar \
     --storage-class GLACIER_IR \
     --metadata "retention-until=2032-01-01,classification=INTERNAL"

   # Upload checksum file to standard storage for quick verification
   aws s3 cp ${ARCHIVE_ID}.tar.sha256 \
     s3://org-archives/programme-data/${ARCHIVE_ID}.tar.sha256

For Azure Blob Storage:

   # Azure with Cool tier
   az storage blob upload \
     --account-name orgarchives \
     --container-name programme-data \
     --name ${ARCHIVE_ID}.tar \
     --file ${ARCHIVE_ID}.tar \
     --tier Cool \
     --metadata "retention-until=2032-01-01" "classification=INTERNAL"

Verify the upload completed successfully by comparing checksums:

   # Get the ETag (MD5 for single-part uploads) from S3
   aws s3api head-object \
     --bucket org-archives \
     --key programme-data/${ARCHIVE_ID}.tar \
     --query 'ETag' --output text

   # For large files uploaded in parts, download and verify
   aws s3 cp s3://org-archives/programme-data/${ARCHIVE_ID}.tar - | \
     sha256sum | cut -d' ' -f1

   # Compare with local checksum
   cat ${ARCHIVE_ID}.tar.sha256

The checksums must match exactly. Any discrepancy indicates transfer corruption requiring re-upload.

Apply retention locks if your storage platform supports them. Retention locks prevent deletion until the retention period expires, protecting against accidental or malicious deletion:

   # AWS S3 Object Lock (requires bucket with Object Lock enabled)
   aws s3api put-object-retention \
     --bucket org-archives \
     --key programme-data/${ARCHIVE_ID}.tar \
     --retention '{"Mode":"GOVERNANCE","RetainUntilDate":"2032-01-01T00:00:00Z"}'

Governance mode allows users with specific permissions to override the lock; compliance mode prevents any deletion until expiry including by the root account.

   INSERT INTO archive_registry (
     archive_id,
     source_system,
     source_table,
     record_count,
     date_range_start,
     date_range_end,
     storage_location,
     storage_tier,
     checksum_sha256,
     retain_until,
     data_classification,
     created_at,
     created_by
   ) VALUES (
     'prog-data-20250101-001',
     'programme_db.example.org',
     'programme_data',
     145000,
     '2019-03-15',
     '2022-12-31',
     's3://org-archives/programme-data/prog-data-20250101-001.tar',
     'GLACIER_IR',
     'a3f2b8c9d4e5f6...',
     '2032-01-01',
     'INTERNAL',
     CURRENT_TIMESTAMP,
     'archive_operator'
   );

Marking source data as archived

After successful transfer and verification, update source records to indicate archival status. This prevents duplicate archiving and enables source data deletion if your retention policy permits.

Update the archived flag on source records within a transaction:

   BEGIN;

   UPDATE programme_data
   SET
     archived_at = CURRENT_TIMESTAMP,
     archive_reference = 'prog-data-20250101-001'
   WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
     AND archived_at IS NULL
     AND legal_hold = FALSE;

   -- Verify expected count was updated
   -- Should match earlier candidate count: 145000
   SELECT COUNT(*) FROM programme_data
   WHERE archive_reference = 'prog-data-20250101-001';

   COMMIT;

If your retention policy permits deletion of archived source data, schedule the deletion for a grace period after archiving. A 30-day grace period allows discovery of archive problems before source data is removed:

   -- Schedule deletion 30 days after archiving
   INSERT INTO scheduled_deletions (
     table_name,
     selection_criteria,
     scheduled_for,
     archive_reference,
     approved_by
   ) VALUES (
     'programme_data',
     'archive_reference = ''prog-data-20250101-001''',
     CURRENT_DATE + INTERVAL '30 days',
     'prog-data-20250101-001',
     'data_steward_name'
   );

Do not delete source data immediately after archiving. The grace period enables retrieval testing and discovery of any packaging errors.

Remove the staging copy of the archive package after confirming successful upload:

   # Verify cloud copy exists and matches
   CLOUD_EXISTS=$(aws s3api head-object \
     --bucket org-archives \
     --key programme-data/${ARCHIVE_ID}.tar \
     2>&1 && echo "yes" || echo "no")

   if [ "$CLOUD_EXISTS" = "yes" ]; then
     rm -rf /archive/staging/${ARCHIVE_ID}
     rm /archive/staging/${ARCHIVE_ID}.tar
     rm /archive/staging/${ARCHIVE_ID}.tar.sha256
     echo "Staging files removed"
   else
     echo "ERROR: Cloud copy not verified, retaining staging files"
     exit 1
   fi

Legal hold procedures

When a legal hold is issued, you must immediately suspend normal retention processing for affected data and preserve all copies including archives.

Upon receiving a legal hold notice, identify all data within scope by querying source systems and archive registries:

   -- Identify active records under hold scope
   UPDATE programme_data
   SET legal_hold = TRUE,
       legal_hold_reference = 'LH-2025-003',
       legal_hold_date = CURRENT_DATE
   WHERE project_id IN ('PRJ-2021-045', 'PRJ-2021-046')
     OR beneficiary_region = 'Eastern Province';

   -- Identify archives containing potentially relevant data
   SELECT archive_id, storage_location, date_range_start, date_range_end
   FROM archive_registry
   WHERE source_table = 'programme_data'
     AND date_range_end >= '2021-01-01';

Disable any automated deletion or lifecycle policies that could affect held data:

   # Suspend S3 lifecycle rule for affected archives
   aws s3api put-bucket-lifecycle-configuration \
     --bucket org-archives \
     --lifecycle-configuration '{
       "Rules": [{
         "ID": "archive-expiry",
         "Status": "Disabled",
         "Filter": {"Prefix": "programme-data/"},
         "Expiration": {"Days": 3650}
       }]
     }'

Document the hold in your legal hold register with scope, date issued, issuing authority, and affected data locations:

   INSERT INTO legal_holds (
     hold_reference,
     issued_date,
     issued_by,
     matter_description,
     data_scope,
     affected_systems,
     affected_archives,
     status
   ) VALUES (
     'LH-2025-003',
     '2025-01-15',
     'General Counsel',
     'Employment dispute - Eastern Province operations',
     'All programme data for projects PRJ-2021-045, PRJ-2021-046 and Eastern Province beneficiary records',
     'programme_db',
     'prog-data-20230101-001, prog-data-20240101-001',
     'ACTIVE'
   );

When the legal hold is released, restore normal retention processing and update records:

   -- Release hold on active records
   UPDATE programme_data
   SET legal_hold = FALSE,
       legal_hold_released = CURRENT_DATE
   WHERE legal_hold_reference = 'LH-2025-003';

   -- Update hold register
   UPDATE legal_holds
   SET status = 'RELEASED',
       released_date = CURRENT_DATE,
       released_by = 'General Counsel'
   WHERE hold_reference = 'LH-2025-003';

Data that passed its retention date during the hold period becomes immediately eligible for disposal upon release.

Project and grant closeout archiving

Grant closeout archiving follows a compressed timeline with specific documentation requirements. Donor agreements typically require data preservation for 3-7 years after project end, with some requiring longer retention.

Obtain the grant closeout checklist from the programme manager confirming all deliverables are submitted and final reports accepted. Do not begin archiving until programme confirms readiness.
Query all project-related data across systems using the project or grant identifier:

   -- Identify all data linked to closing grant
   SELECT
     'programme_data' as source_table,
     COUNT(*) as records,
     MIN(created_at) as earliest,
     MAX(modified_at) as latest
   FROM programme_data WHERE grant_id = 'GRANT-2020-123'
   UNION ALL
   SELECT
     'beneficiary_records',
     COUNT(*),
     MIN(created_at),
     MAX(modified_at)
   FROM beneficiary_records WHERE grant_id = 'GRANT-2020-123'
   UNION ALL
   SELECT
     'financial_transactions',
     COUNT(*),
     MIN(created_at),
     MAX(modified_at)
   FROM financial_transactions WHERE grant_id = 'GRANT-2020-123';

Create a consolidated archive package containing all grant-related data with explicit donor retention requirements in the manifest:

   GRANT_ID="GRANT-2020-123"
   ARCHIVE_ID="grant-closeout-${GRANT_ID}-$(date +%Y%m%d)"

   # Include donor retention requirements in manifest
   cat > metadata/MANIFEST.json << EOF
   {
     "archive_id": "${ARCHIVE_ID}",
     "archive_type": "grant_closeout",
     "grant_id": "${GRANT_ID}",
     "donor": "USAID",
     "project_end_date": "2024-12-31",
     "donor_retention_requirement": "7 years from project end",
     "retain_until": "2031-12-31",
     "created_at": "$(date -Iseconds)",
     ...
   }
   EOF

Transfer the closeout archive to storage with retention lock matching donor requirements. USAID requires 7-year retention; FCDO typically requires 6 years; EU requires 5 years from final payment:

   # Calculate retention date based on donor requirement
   RETAIN_UNTIL="2031-12-31"  # 7 years from 2024 project end

   aws s3 cp ${ARCHIVE_ID}.tar \
     s3://org-archives/grant-closeouts/${ARCHIVE_ID}.tar \
     --storage-class GLACIER_IR

   aws s3api put-object-retention \
     --bucket org-archives \
     --key grant-closeouts/${ARCHIVE_ID}.tar \
     --retention "{\"Mode\":\"COMPLIANCE\",\"RetainUntilDate\":\"${RETAIN_UNTIL}T00:00:00Z\"}"

Use COMPLIANCE mode for donor-mandated retention to prevent any override.

Notify the programme manager and finance team that archiving is complete, providing the archive reference for inclusion in closeout documentation.

Verification

After completing archiving operations, verify the archive is complete, intact, and retrievable.

Test archive integrity by downloading and verifying checksums. Perform this test on every archive before marking source data for deletion:

# Download archive to verification location
aws s3 cp s3://org-archives/programme-data/${ARCHIVE_ID}.tar \
  /archive/verify/${ARCHIVE_ID}.tar

# Verify package checksum
cd /archive/verify
sha256sum -c ${ARCHIVE_ID}.tar.sha256
# Expected output: prog-data-20250101-001.tar: OK

# Extract and verify internal checksums
tar -xf ${ARCHIVE_ID}.tar
cd ${ARCHIVE_ID}
sha256sum -c checksums/SHA256SUMS
# Expected output: all files show OK

If any checksum fails, the archive is corrupted. Do not proceed with source data deletion. Investigate the failure, regenerate the archive from source data, and re-upload.

Test data retrievability by extracting sample records and comparing with source. Select 10-20 records at random and verify field values match:

# Extract specific records from archived CSV
head -1 data/programme_data.csv > /tmp/verify_sample.csv
grep -E "^(12345|67890|11111)," data/programme_data.csv >> /tmp/verify_sample.csv

# Compare with source (before deletion)
psql -h dbserver -U archive_user -d programme_db -c "
  SELECT * FROM programme_data
  WHERE id IN (12345, 67890, 11111)
" > /tmp/source_sample.txt

Visually compare the extracted records with source records. Any discrepancy indicates extraction or encoding problems requiring investigation.

Verify metadata completeness by confirming the archive package contains all required documentation:

# Check required files exist
REQUIRED_FILES="metadata/MANIFEST.json metadata/schema.sql metadata/data_dictionary.txt checksums/SHA256SUMS"
for file in $REQUIRED_FILES; do
  if [ ! -f "$file" ]; then
    echo "MISSING: $file"
    exit 1
  fi
done
echo "All required metadata files present"

Verify archive registry entry exists and contains correct information:

SELECT
  archive_id,
  storage_location,
  checksum_sha256,
  retain_until
FROM archive_registry
WHERE archive_id = 'prog-data-20250101-001';

The query must return exactly one row with values matching the archive package.

Troubleshooting

Symptom	Cause	Resolution
Upload fails with “Access Denied”	IAM policy missing write permission to target bucket or path	Verify IAM policy includes `s3:PutObject` for the target path; check bucket policy does not deny the operation
Checksum mismatch after upload	Network corruption during transfer or multipart upload reassembly error	Re-upload the file; for files over 5 GB, use `--expected-size` flag to ensure consistent multipart chunking
Archive extraction fails with “Unexpected EOF”	Incomplete upload or corrupted package	Verify source tar file integrity locally; re-upload if local file is intact; regenerate if local file corrupted
Cannot set retention lock	Bucket not configured for Object Lock or insufficient permissions	Object Lock must be enabled at bucket creation; cannot be added later; create new bucket with Object Lock enabled and migrate archives
Retrieval from Glacier takes longer than expected	Deep Archive tier selected instead of Instant Retrieval	Check storage class with `head-object`; for urgent retrieval, use Expedited retrieval option at higher cost
Encrypted archive cannot be decrypted	Key not recorded or key rotated	Verify key identifier in manifest matches available keys; restore key from backup if available; encrypted data is unrecoverable without key
Archive query returns no results	Archive not registered in catalogue or search criteria mismatch	Query archive_registry directly by known archive_id; verify registration completed; re-register if entry missing
”Insufficient storage” during packaging	Staging volume full	Extend staging volume or clean old staging files; archive packages require approximately 1.2x source data size during creation
CSV export contains corrupted characters	Encoding mismatch between source and export	Specify UTF-8 encoding explicitly in export command; verify source data encoding; convert if necessary
Legal hold records still being archived	Hold flag not checked in selection query	Add `AND legal_hold = FALSE` to archive candidate selection; update any records incorrectly archived
Lifecycle policy deleting archives early	Retention lock not applied or policy override	Verify retention lock status; reapply with COMPLIANCE mode; check IAM policies for lifecycle override permissions
Grant closeout archive rejected by donor	Missing required documentation or incorrect retention period	Review donor requirements; regenerate archive with complete documentation; adjust retention to match donor specification

Automation

For organisations with regular archiving requirements, automate the identification and packaging steps while retaining manual approval for transfer and source deletion.

Create an archiving script that identifies candidates and prepares packages:

#!/bin/bash
# archive_prepare.sh - Identify and package archive candidates

set -e

ARCHIVE_DATE=$(date +%Y%m%d)
STAGING_DIR="/archive/staging"
LOG_FILE="/var/log/archive/prepare-${ARCHIVE_DATE}.log"

log() {
  echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}

# Identify candidates
log "Identifying archive candidates"
CANDIDATE_COUNT=$(psql -t -c "
  SELECT COUNT(*) FROM programme_data
  WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
    AND archived_at IS NULL
    AND legal_hold = FALSE
")

if [ "$CANDIDATE_COUNT" -eq 0 ]; then
  log "No candidates found, exiting"
  exit 0
fi

log "Found ${CANDIDATE_COUNT} candidates"

# Generate archive package
ARCHIVE_ID="prog-data-${ARCHIVE_DATE}-001"
log "Creating archive package: ${ARCHIVE_ID}"

mkdir -p "${STAGING_DIR}/${ARCHIVE_ID}"/{data,metadata,checksums}
cd "${STAGING_DIR}/${ARCHIVE_ID}"

# Export data
psql -c "\COPY (SELECT * FROM programme_data
  WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
    AND archived_at IS NULL AND legal_hold = FALSE)
  TO 'data/programme_data.csv' WITH CSV HEADER"

# Generate metadata and checksums
# ... (remaining steps from procedure)

log "Archive package ready for review: ${STAGING_DIR}/${ARCHIVE_ID}"
log "Manual approval required before transfer"

# Send notification
echo "Archive ${ARCHIVE_ID} ready for review" | \
  mail -s "Archive Ready for Approval" data-steward@example.org

Schedule the preparation script to run monthly:

0 2 1 * * archive_user /opt/scripts/archive_prepare.sh

The script prepares archives but does not transfer them. A data steward reviews the prepared package and manually executes transfer after verification.