Skip to main content

Data Archiving

Data archiving transfers data from primary storage to long-term repositories where it remains accessible but at lower cost and with different performance characteristics than operational systems. You perform archiving to meet retention obligations, reduce storage costs on high-performance systems, and preserve historical records for future reference. Unlike backup, which creates copies for disaster recovery, archiving moves data to a different tier with the expectation that retrieval will be infrequent.

Archive
A collection of data removed from operational systems and stored in long-term repositories with metadata sufficient for future retrieval and interpretation.
Cold storage
Storage tier optimised for infrequently accessed data, offering lowest cost per gigabyte but higher retrieval latency and often retrieval fees.
Legal hold
A directive requiring preservation of specified data regardless of normal retention schedules, typically issued in anticipation of litigation or regulatory investigation.
Archive package
A self-contained unit combining data files, metadata, checksums, and documentation sufficient to interpret the archive contents without external dependencies.

Prerequisites

Before beginning archiving operations, confirm the following requirements are satisfied.

You need an approved retention policy that specifies which data categories require archiving, the retention periods for each category, and the conditions under which archived data may be disposed. Without this policy, you cannot determine what to archive or for how long. The retention policy should reference your data classification scheme so you can identify sensitive data requiring encryption in archives. Obtain the current retention schedule from your records management function or data governance team.

You need access to archive storage infrastructure with capacity for the data volume you intend to archive plus 20% headroom for packaging overhead. Calculate expected archive size by querying source systems for data volumes matching your archive criteria. For cloud storage, confirm you have credentials with write permissions to the target bucket or container and that network egress costs are budgeted.

You need data classification labels applied to source data so you can determine encryption requirements and access restrictions for archived data. Unclassified data cannot be archived because you cannot verify appropriate handling. If source data lacks classification, complete classification before proceeding.

You need checksums or hash values for source data to verify archive integrity. If source systems do not provide checksums, you must generate them before extraction. SHA-256 provides sufficient collision resistance for integrity verification.

You need documentation of the source system schema, field definitions, and any encoding or compression applied to source data. Archives without schema documentation become uninterpretable when source systems are decommissioned. Export schema definitions and data dictionaries as part of archive preparation.

For database archives, you need credentials with SELECT permissions on tables to be archived and, if performing destructive archiving, DELETE permissions. Coordinate with database administrators to schedule archive operations during low-usage periods if source systems cannot tolerate concurrent archiving load.

For grant or project closeout archiving, you need confirmation from the programme manager that the project has completed final reporting and that no outstanding data requests exist. Premature archiving of project data can obstruct audit responses.

Procedure

Identifying archive candidates

You identify data for archiving by applying retention policy criteria to source systems. The selection criteria derive from your retention schedule and typically include age thresholds, project status, or explicit archive flags set by data owners.

  1. Query the source system to identify records meeting archive criteria. For date-based selection, calculate the cutoff date from your retention policy. If your policy specifies archiving data older than 24 months from last modification, calculate the threshold:
-- PostgreSQL: identify records for archiving
SELECT
id,
created_at,
modified_at,
pg_size_pretty(pg_column_size(t.*)) as record_size
FROM programme_data t
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = FALSE;

Record the count and total size of candidate records. For this example, assume the query returns 145,000 records totalling 12.3 GB.

  1. Verify no legal holds apply to candidate records. Query your legal hold register or check the legal hold flag on individual records. Any record under legal hold must be excluded regardless of age:
-- Verify no legal holds in candidate set
SELECT COUNT(*) as held_records
FROM programme_data
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = TRUE;

If this returns any non-zero count, investigate each hold before proceeding. Do not archive records under legal hold.

  1. Classify the archive candidate set by data classification level. This determines encryption requirements and storage tier:
SELECT
data_classification,
COUNT(*) as record_count,
SUM(pg_column_size(t.*)) as total_bytes
FROM programme_data t
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = FALSE
GROUP BY data_classification;

Records classified as CONFIDENTIAL or higher require encryption before transfer to archive storage. Records classified as INTERNAL or PUBLIC may be archived without encryption depending on your security policy.

  1. Obtain approval from the data owner for the identified archive set. Send the record counts, date ranges, and classification breakdown to the responsible data steward. Document their approval with date and any conditions they specify.

Preparing archive packages

Archive packages bundle data with metadata to ensure future interpretability. A well-constructed archive package contains everything needed to understand and use the archived data without access to the original source system.

  1. Create a working directory for archive assembly with subdirectories for data, metadata, and checksums:
Terminal window
ARCHIVE_ID="prog-data-$(date +%Y%m%d)-001"
mkdir -p /archive/staging/${ARCHIVE_ID}/{data,metadata,checksums}
cd /archive/staging/${ARCHIVE_ID}
  1. Export the data from the source system in a format that preserves data types and handles special characters correctly. For relational data, use the database’s native export format or a portable format like CSV with explicit quoting:
Terminal window
# PostgreSQL export with explicit format controls
psql -h dbserver -U archive_user -d programme_db -c "\COPY (
SELECT * FROM programme_data
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = FALSE
) TO '/archive/staging/${ARCHIVE_ID}/data/programme_data.csv'
WITH (FORMAT CSV, HEADER TRUE, ENCODING 'UTF8', NULL '')"

For binary data or documents, export to the data subdirectory preserving original filenames or using systematic naming:

Terminal window
# Export documents with preserved directory structure
rsync -av --files-from=<(
psql -t -c "SELECT file_path FROM documents
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'"
) /documents/ /archive/staging/${ARCHIVE_ID}/data/documents/
  1. Generate the schema documentation capturing table structure, field definitions, data types, and constraints:
Terminal window
# Export PostgreSQL schema
pg_dump -h dbserver -U archive_user -d programme_db \
--schema-only \
--table=programme_data \
> metadata/schema.sql
# Generate human-readable data dictionary
psql -h dbserver -U archive_user -d programme_db -c "
SELECT
column_name,
data_type,
character_maximum_length,
is_nullable,
column_default
FROM information_schema.columns
WHERE table_name = 'programme_data'
ORDER BY ordinal_position
" > metadata/data_dictionary.txt
  1. Create an archive manifest documenting the archive contents, source system, extraction date, and retention requirements:
Terminal window
cat > metadata/MANIFEST.json << EOF
{
"archive_id": "${ARCHIVE_ID}",
"created_at": "$(date -Iseconds)",
"created_by": "$(whoami)",
"source_system": "programme_db.example.org",
"source_table": "programme_data",
"record_count": 145000,
"date_range": {
"earliest_record": "2019-03-15",
"latest_record": "2022-12-31"
},
"data_classification": "INTERNAL",
"retention_policy": "PROG-RET-001",
"retain_until": "2032-01-01",
"extraction_criteria": "modified_at < 2023-01-01",
"files": [
{"name": "data/programme_data.csv", "format": "CSV/UTF-8", "rows": 145000},
{"name": "metadata/schema.sql", "format": "PostgreSQL DDL"},
{"name": "metadata/data_dictionary.txt", "format": "text"}
]
}
EOF
  1. Generate checksums for all files in the archive package to enable integrity verification:
Terminal window
find data metadata -type f -exec sha256sum {} \; > checksums/SHA256SUMS
# Generate a checksum of the checksum file for tamper detection
sha256sum checksums/SHA256SUMS > checksums/SHA256SUMS.sig
  1. For data classified as CONFIDENTIAL or higher, encrypt the data files before packaging. Use AES-256 encryption with a key managed through your secrets management system:
Terminal window
# Encrypt data files (for CONFIDENTIAL classification)
gpg --symmetric --cipher-algo AES256 \
--output data/programme_data.csv.gpg \
data/programme_data.csv
# Remove unencrypted file after successful encryption
rm data/programme_data.csv
# Update checksums to reflect encrypted file
sha256sum data/programme_data.csv.gpg >> checksums/SHA256SUMS

Store the encryption key identifier in the manifest. Do not store the key itself in the archive.

  1. Package the archive directory into a single archive file for transfer:
Terminal window
cd /archive/staging
tar -cvf ${ARCHIVE_ID}.tar ${ARCHIVE_ID}/
# Generate checksum of the complete package
sha256sum ${ARCHIVE_ID}.tar > ${ARCHIVE_ID}.tar.sha256

The resulting package for this example measures approximately 4.1 GB after CSV compression inherent in the data.

Selecting storage tier

Archive storage tiers trade access speed against cost. You select the tier based on expected retrieval frequency and acceptable retrieval latency.

+----------------------------------+
| How often will this data be |
| retrieved in the next 5 years? |
+----------------+-----------------+
|
+----------------------+----------------------+
| | |
v v v
+---------+--------+ +---------+--------+ +---------+--------+
| More than | | 1-10 times | | Likely never, |
| 10 times | | (annual audits, | | compliance |
| | | occasional | | retention only |
| | | reference) | | |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+---------+--------+ +---------+--------+ +---------+--------+
| WARM TIER | | COLD TIER | | ARCHIVE TIER |
| | | | | |
| - S3 Standard-IA | | - S3 Glacier | | - S3 Glacier |
| - Azure Cool | | Instant | | Deep Archive |
| - GCS Nearline | | - Azure Cold | | - Azure Archive |
| | | - GCS Coldline | | - GCS Archive |
| Retrieval: mins | | Retrieval: mins | | Retrieval: hours |
| Cost: ~$0.01/GB | | Cost: ~$0.004/GB | | Cost: ~$0.001/GB |
+------------------+ +------------------+ +------------------+

Figure 1: Storage tier selection based on retrieval frequency

The cost figures represent approximate monthly storage costs; retrieval costs add $0.01-0.03 per GB for cold tiers and $0.02-0.05 per GB for archive tiers. A 12 GB archive stored for 10 years in deep archive tier costs approximately $14.40 in storage plus retrieval costs when accessed. The same archive in warm tier costs approximately $144 over 10 years but with immediate access.

For the programme data archive in this example, annual audit access is expected, making cold tier appropriate. Configure the storage class during upload.

Transferring to archive storage

Transfer the packaged archive to its designated storage tier with integrity verification at each stage.

  1. Upload the archive package to cloud storage with the appropriate storage class:
Terminal window
# AWS S3 with Glacier Instant Retrieval
aws s3 cp ${ARCHIVE_ID}.tar \
s3://org-archives/programme-data/${ARCHIVE_ID}.tar \
--storage-class GLACIER_IR \
--metadata "retention-until=2032-01-01,classification=INTERNAL"
# Upload checksum file to standard storage for quick verification
aws s3 cp ${ARCHIVE_ID}.tar.sha256 \
s3://org-archives/programme-data/${ARCHIVE_ID}.tar.sha256

For Azure Blob Storage:

Terminal window
# Azure with Cool tier
az storage blob upload \
--account-name orgarchives \
--container-name programme-data \
--name ${ARCHIVE_ID}.tar \
--file ${ARCHIVE_ID}.tar \
--tier Cool \
--metadata "retention-until=2032-01-01" "classification=INTERNAL"
  1. Verify the upload completed successfully by comparing checksums:
Terminal window
# Get the ETag (MD5 for single-part uploads) from S3
aws s3api head-object \
--bucket org-archives \
--key programme-data/${ARCHIVE_ID}.tar \
--query 'ETag' --output text
# For large files uploaded in parts, download and verify
aws s3 cp s3://org-archives/programme-data/${ARCHIVE_ID}.tar - | \
sha256sum | cut -d' ' -f1
# Compare with local checksum
cat ${ARCHIVE_ID}.tar.sha256

The checksums must match exactly. Any discrepancy indicates transfer corruption requiring re-upload.

  1. Apply retention locks if your storage platform supports them. Retention locks prevent deletion until the retention period expires, protecting against accidental or malicious deletion:
Terminal window
# AWS S3 Object Lock (requires bucket with Object Lock enabled)
aws s3api put-object-retention \
--bucket org-archives \
--key programme-data/${ARCHIVE_ID}.tar \
--retention '{"Mode":"GOVERNANCE","RetainUntilDate":"2032-01-01T00:00:00Z"}'

Governance mode allows users with specific permissions to override the lock; compliance mode prevents any deletion until expiry including by the root account.

  1. Register the archive in your archive index or data catalogue. This registration enables future discovery and retrieval:
INSERT INTO archive_registry (
archive_id,
source_system,
source_table,
record_count,
date_range_start,
date_range_end,
storage_location,
storage_tier,
checksum_sha256,
retain_until,
data_classification,
created_at,
created_by
) VALUES (
'prog-data-20250101-001',
'programme_db.example.org',
'programme_data',
145000,
'2019-03-15',
'2022-12-31',
's3://org-archives/programme-data/prog-data-20250101-001.tar',
'GLACIER_IR',
'a3f2b8c9d4e5f6...',
'2032-01-01',
'INTERNAL',
CURRENT_TIMESTAMP,
'archive_operator'
);

Marking source data as archived

After successful transfer and verification, update source records to indicate archival status. This prevents duplicate archiving and enables source data deletion if your retention policy permits.

  1. Update the archived flag on source records within a transaction:
BEGIN;
UPDATE programme_data
SET
archived_at = CURRENT_TIMESTAMP,
archive_reference = 'prog-data-20250101-001'
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = FALSE;
-- Verify expected count was updated
-- Should match earlier candidate count: 145000
SELECT COUNT(*) FROM programme_data
WHERE archive_reference = 'prog-data-20250101-001';
COMMIT;
  1. If your retention policy permits deletion of archived source data, schedule the deletion for a grace period after archiving. A 30-day grace period allows discovery of archive problems before source data is removed:
-- Schedule deletion 30 days after archiving
INSERT INTO scheduled_deletions (
table_name,
selection_criteria,
scheduled_for,
archive_reference,
approved_by
) VALUES (
'programme_data',
'archive_reference = ''prog-data-20250101-001''',
CURRENT_DATE + INTERVAL '30 days',
'prog-data-20250101-001',
'data_steward_name'
);

Do not delete source data immediately after archiving. The grace period enables retrieval testing and discovery of any packaging errors.

  1. Remove the staging copy of the archive package after confirming successful upload:
Terminal window
# Verify cloud copy exists and matches
CLOUD_EXISTS=$(aws s3api head-object \
--bucket org-archives \
--key programme-data/${ARCHIVE_ID}.tar \
2>&1 && echo "yes" || echo "no")
if [ "$CLOUD_EXISTS" = "yes" ]; then
rm -rf /archive/staging/${ARCHIVE_ID}
rm /archive/staging/${ARCHIVE_ID}.tar
rm /archive/staging/${ARCHIVE_ID}.tar.sha256
echo "Staging files removed"
else
echo "ERROR: Cloud copy not verified, retaining staging files"
exit 1
fi

When a legal hold is issued, you must immediately suspend normal retention processing for affected data and preserve all copies including archives.

  1. Upon receiving a legal hold notice, identify all data within scope by querying source systems and archive registries:
-- Identify active records under hold scope
UPDATE programme_data
SET legal_hold = TRUE,
legal_hold_reference = 'LH-2025-003',
legal_hold_date = CURRENT_DATE
WHERE project_id IN ('PRJ-2021-045', 'PRJ-2021-046')
OR beneficiary_region = 'Eastern Province';
-- Identify archives containing potentially relevant data
SELECT archive_id, storage_location, date_range_start, date_range_end
FROM archive_registry
WHERE source_table = 'programme_data'
AND date_range_end >= '2021-01-01';
  1. Disable any automated deletion or lifecycle policies that could affect held data:
Terminal window
# Suspend S3 lifecycle rule for affected archives
aws s3api put-bucket-lifecycle-configuration \
--bucket org-archives \
--lifecycle-configuration '{
"Rules": [{
"ID": "archive-expiry",
"Status": "Disabled",
"Filter": {"Prefix": "programme-data/"},
"Expiration": {"Days": 3650}
}]
}'
  1. Document the hold in your legal hold register with scope, date issued, issuing authority, and affected data locations:
INSERT INTO legal_holds (
hold_reference,
issued_date,
issued_by,
matter_description,
data_scope,
affected_systems,
affected_archives,
status
) VALUES (
'LH-2025-003',
'2025-01-15',
'General Counsel',
'Employment dispute - Eastern Province operations',
'All programme data for projects PRJ-2021-045, PRJ-2021-046 and Eastern Province beneficiary records',
'programme_db',
'prog-data-20230101-001, prog-data-20240101-001',
'ACTIVE'
);
  1. When the legal hold is released, restore normal retention processing and update records:
-- Release hold on active records
UPDATE programme_data
SET legal_hold = FALSE,
legal_hold_released = CURRENT_DATE
WHERE legal_hold_reference = 'LH-2025-003';
-- Update hold register
UPDATE legal_holds
SET status = 'RELEASED',
released_date = CURRENT_DATE,
released_by = 'General Counsel'
WHERE hold_reference = 'LH-2025-003';

Data that passed its retention date during the hold period becomes immediately eligible for disposal upon release.

Project and grant closeout archiving

Grant closeout archiving follows a compressed timeline with specific documentation requirements. Donor agreements typically require data preservation for 3-7 years after project end, with some requiring longer retention.

  1. Obtain the grant closeout checklist from the programme manager confirming all deliverables are submitted and final reports accepted. Do not begin archiving until programme confirms readiness.

  2. Query all project-related data across systems using the project or grant identifier:

-- Identify all data linked to closing grant
SELECT
'programme_data' as source_table,
COUNT(*) as records,
MIN(created_at) as earliest,
MAX(modified_at) as latest
FROM programme_data WHERE grant_id = 'GRANT-2020-123'
UNION ALL
SELECT
'beneficiary_records',
COUNT(*),
MIN(created_at),
MAX(modified_at)
FROM beneficiary_records WHERE grant_id = 'GRANT-2020-123'
UNION ALL
SELECT
'financial_transactions',
COUNT(*),
MIN(created_at),
MAX(modified_at)
FROM financial_transactions WHERE grant_id = 'GRANT-2020-123';
  1. Create a consolidated archive package containing all grant-related data with explicit donor retention requirements in the manifest:
Terminal window
GRANT_ID="GRANT-2020-123"
ARCHIVE_ID="grant-closeout-${GRANT_ID}-$(date +%Y%m%d)"
# Include donor retention requirements in manifest
cat > metadata/MANIFEST.json << EOF
{
"archive_id": "${ARCHIVE_ID}",
"archive_type": "grant_closeout",
"grant_id": "${GRANT_ID}",
"donor": "USAID",
"project_end_date": "2024-12-31",
"donor_retention_requirement": "7 years from project end",
"retain_until": "2031-12-31",
"created_at": "$(date -Iseconds)",
...
}
EOF
  1. Transfer the closeout archive to storage with retention lock matching donor requirements. USAID requires 7-year retention; FCDO typically requires 6 years; EU requires 5 years from final payment:
Terminal window
# Calculate retention date based on donor requirement
RETAIN_UNTIL="2031-12-31" # 7 years from 2024 project end
aws s3 cp ${ARCHIVE_ID}.tar \
s3://org-archives/grant-closeouts/${ARCHIVE_ID}.tar \
--storage-class GLACIER_IR
aws s3api put-object-retention \
--bucket org-archives \
--key grant-closeouts/${ARCHIVE_ID}.tar \
--retention "{\"Mode\":\"COMPLIANCE\",\"RetainUntilDate\":\"${RETAIN_UNTIL}T00:00:00Z\"}"

Use COMPLIANCE mode for donor-mandated retention to prevent any override.

  1. Notify the programme manager and finance team that archiving is complete, providing the archive reference for inclusion in closeout documentation.

Verification

After completing archiving operations, verify the archive is complete, intact, and retrievable.

Test archive integrity by downloading and verifying checksums. Perform this test on every archive before marking source data for deletion:

Terminal window
# Download archive to verification location
aws s3 cp s3://org-archives/programme-data/${ARCHIVE_ID}.tar \
/archive/verify/${ARCHIVE_ID}.tar
# Verify package checksum
cd /archive/verify
sha256sum -c ${ARCHIVE_ID}.tar.sha256
# Expected output: prog-data-20250101-001.tar: OK
# Extract and verify internal checksums
tar -xf ${ARCHIVE_ID}.tar
cd ${ARCHIVE_ID}
sha256sum -c checksums/SHA256SUMS
# Expected output: all files show OK

If any checksum fails, the archive is corrupted. Do not proceed with source data deletion. Investigate the failure, regenerate the archive from source data, and re-upload.

Test data retrievability by extracting sample records and comparing with source. Select 10-20 records at random and verify field values match:

Terminal window
# Extract specific records from archived CSV
head -1 data/programme_data.csv > /tmp/verify_sample.csv
grep -E "^(12345|67890|11111)," data/programme_data.csv >> /tmp/verify_sample.csv
# Compare with source (before deletion)
psql -h dbserver -U archive_user -d programme_db -c "
SELECT * FROM programme_data
WHERE id IN (12345, 67890, 11111)
" > /tmp/source_sample.txt

Visually compare the extracted records with source records. Any discrepancy indicates extraction or encoding problems requiring investigation.

Verify metadata completeness by confirming the archive package contains all required documentation:

Terminal window
# Check required files exist
REQUIRED_FILES="metadata/MANIFEST.json metadata/schema.sql metadata/data_dictionary.txt checksums/SHA256SUMS"
for file in $REQUIRED_FILES; do
if [ ! -f "$file" ]; then
echo "MISSING: $file"
exit 1
fi
done
echo "All required metadata files present"

Verify archive registry entry exists and contains correct information:

SELECT
archive_id,
storage_location,
checksum_sha256,
retain_until
FROM archive_registry
WHERE archive_id = 'prog-data-20250101-001';

The query must return exactly one row with values matching the archive package.

Troubleshooting

SymptomCauseResolution
Upload fails with “Access Denied”IAM policy missing write permission to target bucket or pathVerify IAM policy includes s3:PutObject for the target path; check bucket policy does not deny the operation
Checksum mismatch after uploadNetwork corruption during transfer or multipart upload reassembly errorRe-upload the file; for files over 5 GB, use --expected-size flag to ensure consistent multipart chunking
Archive extraction fails with “Unexpected EOF”Incomplete upload or corrupted packageVerify source tar file integrity locally; re-upload if local file is intact; regenerate if local file corrupted
Cannot set retention lockBucket not configured for Object Lock or insufficient permissionsObject Lock must be enabled at bucket creation; cannot be added later; create new bucket with Object Lock enabled and migrate archives
Retrieval from Glacier takes longer than expectedDeep Archive tier selected instead of Instant RetrievalCheck storage class with head-object; for urgent retrieval, use Expedited retrieval option at higher cost
Encrypted archive cannot be decryptedKey not recorded or key rotatedVerify key identifier in manifest matches available keys; restore key from backup if available; encrypted data is unrecoverable without key
Archive query returns no resultsArchive not registered in catalogue or search criteria mismatchQuery archive_registry directly by known archive_id; verify registration completed; re-register if entry missing
”Insufficient storage” during packagingStaging volume fullExtend staging volume or clean old staging files; archive packages require approximately 1.2x source data size during creation
CSV export contains corrupted charactersEncoding mismatch between source and exportSpecify UTF-8 encoding explicitly in export command; verify source data encoding; convert if necessary
Legal hold records still being archivedHold flag not checked in selection queryAdd AND legal_hold = FALSE to archive candidate selection; update any records incorrectly archived
Lifecycle policy deleting archives earlyRetention lock not applied or policy overrideVerify retention lock status; reapply with COMPLIANCE mode; check IAM policies for lifecycle override permissions
Grant closeout archive rejected by donorMissing required documentation or incorrect retention periodReview donor requirements; regenerate archive with complete documentation; adjust retention to match donor specification

Automation

For organisations with regular archiving requirements, automate the identification and packaging steps while retaining manual approval for transfer and source deletion.

Create an archiving script that identifies candidates and prepares packages:

#!/bin/bash
# archive_prepare.sh - Identify and package archive candidates
set -e
ARCHIVE_DATE=$(date +%Y%m%d)
STAGING_DIR="/archive/staging"
LOG_FILE="/var/log/archive/prepare-${ARCHIVE_DATE}.log"
log() {
echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}
# Identify candidates
log "Identifying archive candidates"
CANDIDATE_COUNT=$(psql -t -c "
SELECT COUNT(*) FROM programme_data
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL
AND legal_hold = FALSE
")
if [ "$CANDIDATE_COUNT" -eq 0 ]; then
log "No candidates found, exiting"
exit 0
fi
log "Found ${CANDIDATE_COUNT} candidates"
# Generate archive package
ARCHIVE_ID="prog-data-${ARCHIVE_DATE}-001"
log "Creating archive package: ${ARCHIVE_ID}"
mkdir -p "${STAGING_DIR}/${ARCHIVE_ID}"/{data,metadata,checksums}
cd "${STAGING_DIR}/${ARCHIVE_ID}"
# Export data
psql -c "\COPY (SELECT * FROM programme_data
WHERE modified_at < CURRENT_DATE - INTERVAL '24 months'
AND archived_at IS NULL AND legal_hold = FALSE)
TO 'data/programme_data.csv' WITH CSV HEADER"
# Generate metadata and checksums
# ... (remaining steps from procedure)
log "Archive package ready for review: ${STAGING_DIR}/${ARCHIVE_ID}"
log "Manual approval required before transfer"
# Send notification
echo "Archive ${ARCHIVE_ID} ready for review" | \
mail -s "Archive Ready for Approval" data-steward@example.org

Schedule the preparation script to run monthly:

/etc/cron.d/archive-prepare
0 2 1 * * archive_user /opt/scripts/archive_prepare.sh

The script prepares archives but does not transfer them. A data steward reviews the prepared package and manually executes transfer after verification.

See also