Data Platform Architecture
A data platform is the integrated set of technologies that ingest, store, process, and serve an organisation’s data assets. The platform provides the foundation upon which all analytical and operational data workloads execute, determining what questions can be answered, how quickly insights emerge, and what resources are consumed in the process. For mission-driven organisations managing programme data, beneficiary information, financial records, and operational metrics, platform architecture decisions shape both immediate capabilities and long-term flexibility.
- Data Platform
- The complete technical infrastructure for data management, comprising storage systems, compute engines, integration mechanisms, and access interfaces operating as a coherent whole.
- Data Warehouse
- A central repository optimised for analytical queries, storing structured data organised according to dimensional or relational models with schema enforced at write time.
- Data Lake
- A storage repository holding raw data in native format until needed, accepting structured, semi-structured, and unstructured data with schema applied at read time.
- Lakehouse
- An architecture combining data lake storage economics with data warehouse transaction guarantees and query performance through table formats that add structure to lake storage.
- Data Mesh
- A decentralised architecture where domain teams own and publish data products, connected through federated governance and self-serve infrastructure rather than centralised platforms.
Platform Components
Every data platform comprises five functional layers that work together to move data from source systems to consumption by analysts, applications, and reports. The layers separate concerns while enabling data to flow through transformation and enrichment stages.
+-------------------------------------------------------------------------+| DATA PLATFORM |+-------------------------------------------------------------------------+| || +-------------------------------------------------------------------+ || | ACCESS LAYER | || | +------------+ +------------+ +------------+ +------------+ | || | | BI Tools | | SQL | | APIs | | ML | | || | | | | Clients | | | | Notebooks | | || | +------------+ +------------+ +------------+ +------------+ | || +-------------------------------------------------------------------+ || | || +-------------------------------------------------------------------+ || | SEMANTIC LAYER | || | +------------------+ +------------------+ +------------------+ | || | | Metrics | | Dimensions | | Business Rules | | || | | Definitions | | Hierarchies | | Calculations | | || | +------------------+ +------------------+ +------------------+ | || +-------------------------------------------------------------------+ || | || +-------------------------------------------------------------------+ || | COMPUTE LAYER | || | +------------+ +------------+ +------------+ +------------+ | || | | Batch | | Interactive| | Stream | | ML | | || | | Processing | | Query | | Processing | | Training | | || | +------------+ +------------+ +------------+ +------------+ | || +-------------------------------------------------------------------+ || | || +-------------------------------------------------------------------+ || | STORAGE LAYER | || | +------------------+ +------------------+ +------------------+ | || | | Hot Storage | | Warm Storage | | Cold Storage | | || | | (SSD/Memory) | | (HDD/Object) | | (Archive) | | || | +------------------+ +------------------+ +------------------+ | || +-------------------------------------------------------------------+ || | || +-------------------------------------------------------------------+ || | INGESTION LAYER | || | +------------+ +------------+ +------------+ +------------+ | || | | Batch | | Streaming | | Change | | File | | || | | Extract | | Ingest | | Capture | | Upload | | || | +------------+ +------------+ +------------+ +------------+ | || +-------------------------------------------------------------------+ || |+-------------------------------------------------------------------------+Figure 1: Data platform functional layers showing the flow from ingestion through storage, compute, semantics, to access
The ingestion layer receives data from source systems through batch extraction, streaming connectors, change data capture, or file uploads. Batch extraction pulls complete datasets or incremental changes on schedules ranging from hourly to weekly. Streaming ingestion handles continuous data flows with sub-second latency. Change data capture monitors source system transaction logs to detect modifications without querying the source directly. File upload mechanisms accept manual data submissions common in humanitarian contexts where partners provide spreadsheets or exports.
The storage layer persists data across tiers optimised for different access patterns and cost profiles. Hot storage uses solid-state drives or in-memory systems for data requiring sub-second query response. Warm storage uses object storage or hard drives for data accessed daily or weekly. Cold storage archives data accessed infrequently but retained for compliance or historical analysis. A beneficiary database spanning five years might keep current-year data in hot storage, previous two years in warm storage, and older records in cold archive.
The compute layer executes transformations and queries against stored data. Batch processing handles large-scale transformations running for minutes to hours. Interactive query engines serve analyst requests expecting responses within seconds. Stream processing transforms data continuously as it arrives. Machine learning workloads train models on historical data and score new records.
The semantic layer defines business meaning on top of technical storage. Metrics specify calculations like “active beneficiaries” as distinct individuals receiving services within the past 90 days. Dimensions establish hierarchies such as geography (country, region, district, location) or time (year, quarter, month, week). Business rules encode organisational logic like fiscal year boundaries or programme eligibility criteria.
The access layer provides interfaces for humans and systems to consume data. BI tools connect for dashboard visualisation and report generation. SQL clients enable ad-hoc analytical queries. APIs expose data to applications programmatically. Notebooks support data science exploration and model development.
Architecture Patterns
Four dominant patterns structure how data platforms organise these components. Each pattern makes different tradeoffs between structure, flexibility, performance, and operational complexity. Organisations select patterns based on their data diversity, query requirements, technical capacity, and budget constraints.
Data Warehouse Pattern
The data warehouse pattern centralises analytical data in a structured repository with schema defined before data loading. All data conforms to dimensional or relational models, enabling consistent query semantics and predictable performance.
+------------------------------------------------------------------------+| DATA WAREHOUSE PATTERN |+------------------------------------------------------------------------+| || +-----------+ +-----------+ +-----------+ +-----------+ || | Finance | | Programme | | HR | | CRM | || | System | | Database | | System | | Platform | || +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ || | | | | || v v v v || +-------------------------------------------------------------+ || | ETL PROCESSES | || | Extract --> Transform --> Load | || +---------------------------+---------------------------------+ || | || v || +-------------------------------------------------------------+ || | DATA WAREHOUSE | || | | || | +----------------+ +----------------+ +----------------+ | || | | STAGING | | INTEGRATION | | PRESENTATION | | || | | AREA | | LAYER | | LAYER | | || | | | | | | | | || | | - Raw loads | | - Conformed | | - Star | | || | | - Validation | | dimensions | | schemas | | || | | - Cleansing | | - History | | - Aggregates | | || | +----------------+ +----------------+ +----------------+ | || +-------------------------------------------------------------+ || | || v || +-------------------------------------------------------------+ || | BI / REPORTING | || +-------------------------------------------------------------+ || |+------------------------------------------------------------------------+Figure 2: Data warehouse pattern with ETL processing feeding structured layers from staging through integration to presentation
Data enters through ETL (Extract, Transform, Load) processes that pull from source systems, apply transformations to conform data to warehouse models, and load into staging areas. The staging area holds raw extracts temporarily while validation and cleansing execute. The integration layer maintains conformed dimensions shared across subject areas and preserves historical changes. The presentation layer organises data into star schemas optimised for analytical queries.
Star schemas structure data around central fact tables recording measurements (beneficiaries served, funds disbursed, items distributed) surrounded by dimension tables providing context (who, what, where, when). A distribution fact table records each distribution event with foreign keys to beneficiary, location, date, item, and programme dimensions. Queries join facts to dimensions to answer questions like “How many hygiene kits were distributed in Northern Region during Q3?”
The warehouse pattern excels when data is structured, query patterns are predictable, and data quality requirements are strict. Finance reporting, donor compliance reports, and operational dashboards benefit from warehouse architecture. The pattern struggles with unstructured data like documents and images, with rapidly changing schemas, and with exploratory analysis requiring access to raw data.
Warehouse technology options span self-hosted and cloud-managed services. PostgreSQL with appropriate indexing handles warehouses under 500 GB. Apache Hive provides warehouse semantics on Hadoop clusters for organisations with existing big data infrastructure. Cloud data warehouses like BigQuery, Snowflake, Redshift, or Synapse offer managed services with consumption-based pricing.
Data Lake Pattern
The data lake pattern stores data in raw format without upfront schema definition, deferring structure until query time. This schema-on-read approach accepts any data format and preserves original fidelity.
+------------------------------------------------------------------------+| DATA LAKE PATTERN |+------------------------------------------------------------------------+| || +------------+ +------------+ +------------+ +------------+ || | Databases | | Files | | APIs | | Streams | || +------+-----+ +------+-----+ +------+-----+ +------+-----+ || | | | | || v v v v || +------------------------------------------------------------------+ || | INGESTION | || +------------------------------------------------------------------+ || | || v || +------------------------------------------------------------------+ || | DATA LAKE STORAGE | || | | || | +------------------+ | || | | RAW ZONE | Immutable landing, original format | || | | /raw/ | Partitioned by source and date | || | +--------+---------+ | || | | | || | v | || | +------------------+ | || | | CURATED ZONE | Cleansed, validated, standardised | || | | /curated/ | Schema applied, PII masked | || | +--------+---------+ | || | | | || | v | || | +------------------+ | || | | CONSUMPTION | Aggregated, modelled for use cases | || | | /consumption/ | Domain-specific datasets | || | +------------------+ | || | | || +------------------------------------------------------------------+ || | || +----------------------+----------------------+ || | | | || v v v || +------------+ +------------+ +------------+ || | SQL Engine | | Spark | | ML | || | (Athena, | | Processing | | Workbench | || | Presto) | | | | | || +------------+ +------------+ +------------+ || |+------------------------------------------------------------------------+Figure 3: Data lake pattern showing zone-based organisation from raw through curated to consumption with multiple query engines
Lakes organise data into zones reflecting transformation state. The raw zone stores data exactly as received from sources, immutable and partitioned by source system and ingestion date. A CSV file from a partner organisation lands in /raw/partner_reports/2024/11/16/distribution_data.csv preserving the original content. The curated zone contains data that has been validated, cleansed, and standardised. Schemas are applied, personal data is masked or encrypted, and formats are converted to efficient columnar storage like Parquet. The consumption zone holds data aggregated and modelled for specific use cases, ready for direct query by analysts.
The lake pattern handles diverse data types that warehouses cannot accommodate. Survey responses, satellite imagery, sensor readings, audio recordings, and unstructured documents all store in lakes alongside structured operational extracts. Lakes support exploratory analysis where analysts query raw data before determining which transformations to apply.
Lakes risk becoming data swamps when governance is absent. Without metadata cataloguing, users cannot discover what data exists. Without quality monitoring, unreliable data proliferates. Without access controls, sensitive data becomes exposed. Effective lake implementations require investment in cataloguing (Apache Atlas, DataHub), quality profiling, and fine-grained access policies.
Object storage provides the foundation for most lake implementations. AWS S3, Google Cloud Storage, Azure Blob Storage, and MinIO (self-hosted) offer the scale, durability, and economics lakes require. Query engines including Trino, Presto, Spark SQL, and Athena execute SQL against lake storage without loading data into separate systems.
Lakehouse Pattern
The lakehouse pattern adds transactional guarantees and warehouse-like features to lake storage through table formats that manage metadata alongside data files. This combination enables warehouse query patterns directly on lake storage.
+------------------------------------------------------------------------+| LAKEHOUSE PATTERN |+------------------------------------------------------------------------+| || +------------------------------------------------------------------+ || | OBJECT STORAGE | || | (S3 / GCS / ADLS / MinIO) | || | | || | +------------------------------------------------------------+ | || | | TABLE FORMAT LAYER | | || | | (Delta / Iceberg / Hudi) | | || | | | | || | | +------------------+ +------------------+ | | || | | | Transaction Log | | Metadata | | | || | | | - ACID commits | | - Schema | | | || | | | - Time travel | | - Partitions | | | || | | | - Rollback | | - Statistics | | | || | | +------------------+ +------------------+ | | || | | | | || | | +------------------+ +------------------+ | | || | | | Data Files | | Delete Vectors | | | || | | | (Parquet) | | (Row deletions) | | | || | | +------------------+ +------------------+ | | || | +------------------------------------------------------------+ | || | | || +------------------------------------------------------------------+ || | || +----------------------+----------------------+ || | | | || v v v || +------------+ +------------+ +------------+ || | BI Query | | Batch ETL | | Streaming | || | Engine | | Spark/dbt | | Ingestion | || +------------+ +------------+ +------------+ || |+------------------------------------------------------------------------+Figure 4: Lakehouse pattern with table format layer providing ACID transactions on object storage
Table formats including Delta Lake, Apache Iceberg, and Apache Hudi store transaction logs and metadata alongside Parquet data files. The transaction log records every modification as an atomic commit, enabling ACID properties (Atomicity, Consistency, Isolation, Durability) that prevent partial writes and ensure consistent reads. Time travel capabilities allow queries against historical table states: “Show beneficiary registrations as of 1 October” executes against the table snapshot from that date. Failed transformations roll back cleanly without corrupting table state.
Delete vectors track row-level deletions without rewriting entire files, enabling efficient updates and GDPR erasure compliance. Schema evolution handles column additions and type changes without breaking existing queries. Partition evolution allows changing partition schemes as query patterns shift.
The lakehouse pattern suits organisations wanting warehouse capabilities without separate warehouse infrastructure. A single storage layer serves both raw data preservation and structured analytical queries. Compute costs reduce when BI tools query lakehouse tables directly rather than loading data into dedicated warehouses.
Lakehouse implementations require more technical sophistication than managed warehouse services. Table format selection (Delta versus Iceberg versus Hudi) involves tradeoffs in ecosystem support, performance characteristics, and operational complexity. Query engine configuration affects performance significantly. Organisations with limited data engineering capacity benefit from managed lakehouse services like Databricks or cloud-native offerings that handle infrastructure operations.
Data Mesh Pattern
The data mesh pattern decentralises data ownership to domain teams who publish data as products. Central platform teams provide self-serve infrastructure rather than managing all data directly.
+------------------------------------------------------------------------+| DATA MESH PATTERN |+------------------------------------------------------------------------+| || +------------------------+ +------------------------+ || | PROGRAMMES DOMAIN | | FINANCE DOMAIN | || | | | | || | +------------------+ | | +------------------+ | || | | Beneficiary | | | | Grant | | || | | Data Product | | | | Data Product | | || | | - Schema | | | | - Schema | | || | | - Quality SLOs | | | | - Quality SLOs | | || | | - Documentation | | | | - Documentation | | || | | - Access policy | | | | - Access policy | | || | +------------------+ | | +------------------+ | || | | | | || | +------------------+ | | +------------------+ | || | | Distribution | | | | Expenditure | | || | | Data Product | | | | Data Product | | || | +------------------+ | | +------------------+ | || +------------------------+ +------------------------+ || | | || +----------------+---------------+ || | || v || +------------------------------------------------------------------+ || | FEDERATED GOVERNANCE | || | +----------------+ +----------------+ +----------------+ | || | | Global | | Interop | | Security | | || | | Policies | | Standards | | Baseline | | || | +----------------+ +----------------+ +----------------+ | || +------------------------------------------------------------------+ || | || v || +------------------------------------------------------------------+ || | SELF-SERVE PLATFORM | || | +----------------+ +----------------+ +----------------+ | || | | Data Product | | Compute | | Access | | || | | Templates | | Provisioning | | Management | | || | +----------------+ +----------------+ +----------------+ | || +------------------------------------------------------------------+ || |+------------------------------------------------------------------------+Figure 5: Data mesh pattern with domain-owned data products governed by federated policies and supported by self-serve infrastructure
A data product packages data with everything consumers need: schema definitions, quality service level objectives, documentation, lineage, and access policies. The programmes domain publishes a beneficiary data product containing deduplicated, validated beneficiary records with documented freshness guarantees and approved access procedures. Consumers discover products through a central catalogue, understand quality and coverage through published metadata, and access data through standardised interfaces.
Federated governance establishes organisation-wide standards while respecting domain autonomy. Global policies mandate encryption at rest, personal data handling, and retention requirements. Interoperability standards ensure data products can join across domains through shared identifiers and compatible formats. Security baselines specify minimum access controls all products must implement.
Self-serve infrastructure enables domain teams to build data products without platform team bottlenecks. Templates provision storage, compute, and catalogue entries through standardised interfaces. Domain teams select configurations and deploy products independently while the platform handles underlying infrastructure complexity.
The mesh pattern suits organisations with strong domain teams, diverse data ownership, and coordination challenges that make centralisation impractical. Large federations where country offices manage distinct programmes benefit from mesh principles. The pattern requires significant organisational maturity: domain teams must accept data ownership responsibility, governance must function across autonomous units, and platform capabilities must genuinely enable self-service.
Most mission-driven organisations lack the scale and technical capacity for full mesh implementation. However, mesh principles apply selectively. Treating data as a product with explicit quality guarantees and documentation improves data reliability regardless of architecture. Federated governance models accommodate regional autonomy common in international organisations. Self-serve capabilities reduce central team bottlenecks even in primarily centralised architectures.
Storage Architecture
Storage layer design balances query performance, storage costs, and data durability across access patterns that vary by orders of magnitude. The tier structure optimises each access pattern while enabling data movement between tiers as access frequency changes.
+-------------------------------------------------------------------------+| STORAGE TIER ARCHITECTURE |+-------------------------------------------------------------------------+| || ACCESS || PATTERN STORAGE TIER CHARACTERISTICS || || +--------+ +----------------+ +---------------------------+ || |Multiple| | | | - SSD or NVMe storage | || |times | --> | HOT | --> | - In-memory caching | || |per day | | | | - Sub-second queries | || +--------+ +----------------+ | - Cost: $0.10-0.20/GB/mo | || +---------------------------+ || | || | Age-based or access-based || | migration policies || v || +--------+ +----------------+ +---------------------------+ || |Daily to| | | | - Object storage | || |weekly | --> | WARM | --> | - Columnar formats | || |access | | | | - Seconds to minutes | || +--------+ +----------------+ | - Cost: $0.02-0.05/GB/mo | || +---------------------------+ || | || | Retention-based || | archival policies || v || +--------+ +----------------+ +---------------------------+ || |Rarely | | | | - Archive storage class | || |accessed| --> | COLD | --> | - Retrieval fees apply | || | | | | | - Minutes to hours | || +--------+ +----------------+ | - Cost: $0.004-0.01/GB/mo | || +---------------------------+ || |+-------------------------------------------------------------------------+Figure 6: Storage tier architecture showing access patterns, tier characteristics, and migration policies
Hot storage serves data queried multiple times per day requiring sub-second response. Current operational datasets, active dashboards, and frequently accessed reference data reside in hot storage. Implementation uses database tables with appropriate indexes, in-memory caches like Redis, or SSD-backed storage. Monthly costs range from $0.10 to $0.20 per GB depending on performance requirements.
A beneficiary registration system with 200,000 active records consuming 15 GB hot storage costs approximately $2.25 per month for storage alone, with additional costs for compute and queries.
Warm storage holds data accessed daily to weekly where query response within seconds to minutes is acceptable. Historical operational data, completed project records, and analytical datasets for periodic reporting reside in warm storage. Implementation uses object storage (S3, GCS, Azure Blob) with data in columnar formats like Parquet that compress effectively and enable efficient analytical queries. Monthly costs range from $0.02 to $0.05 per GB.
An organisation retaining three years of programme data at 500 GB in warm storage pays approximately $15 per month.
Cold storage archives data accessed infrequently but retained for compliance, legal holds, or potential future analysis. Completed grant records, historical financial data beyond reporting requirements, and raw data preserved for reproducibility reside in cold storage. Implementation uses archive storage classes (S3 Glacier, Archive Storage) with retrieval fees and latency measured in minutes to hours. Monthly costs range from $0.004 to $0.01 per GB.
Seven years of archived financial records at 2 TB in cold storage costs approximately $12 per month.
Lifecycle policies automate tier transitions based on age or access frequency. A policy might specify: data older than 90 days moves from hot to warm storage; data older than two years moves from warm to cold; data exceeding retention period deletes automatically. Implementation uses cloud provider lifecycle rules or custom automation monitoring access patterns.
Storage format selection affects both costs and query performance. Row-oriented formats like CSV store data record-by-record, efficient for transactional access but inefficient for analytical queries reading specific columns across millions of rows. Columnar formats like Parquet store data column-by-column, enabling queries to read only needed columns and achieving compression ratios of 5:1 to 10:1 compared to CSV. Converting 100 GB of CSV beneficiary data to Parquet yields approximately 15 GB with faster analytical query execution.
Compute Architecture
Compute resources transform and query data through engines optimised for different workload characteristics. Matching compute patterns to workload requirements prevents both under-provisioning that delays results and over-provisioning that wastes resources.
Batch compute processes large data volumes in scheduled jobs running minutes to hours. Daily aggregations, periodic report generation, and historical data transformations execute as batch workloads. Batch engines including Spark, dbt, and scheduled SQL execute transformations against data at rest. Resource allocation provisions sufficient capacity for job completion within required windows, then releases resources. A nightly transformation job processing the day’s programme data provisions compute at midnight, executes for 45 minutes, and terminates.
Interactive compute serves analyst queries expecting responses within seconds. Dashboard rendering, ad-hoc exploration, and self-service analytics require interactive compute. Query engines including Trino, Presto, Athena, and BigQuery optimise for rapid response through query planning, caching, and parallel execution. Resources provision persistently during business hours or scale dynamically with demand.
Streaming compute processes data continuously as it arrives with latency measured in milliseconds to seconds. Real-time alerting, live dashboards, and continuous data validation require streaming compute. Engines including Kafka Streams, Flink, and Spark Streaming maintain processing state and handle late-arriving data. Resources provision continuously with capacity for peak throughput.
Compute scaling strategies affect cost and responsiveness. Fixed provisioning allocates static resources sized for peak demand, simple to manage but wasteful during quiet periods. Autoscaling adjusts resources based on queue depth or query concurrency, efficient but complex to tune. Serverless models charge per query or per processing unit, eliminating idle costs but potentially expensive at scale.
An organisation running interactive queries during business hours and batch jobs overnight might provision fixed interactive compute from 07:00 to 19:00, autoscaling batch compute for nightly jobs, and serverless capacity for occasional off-hours queries.
Integration Mechanisms
Data flows between platform components and external systems through integration mechanisms that vary in latency, reliability, and operational complexity.
Batch integration moves data in discrete scheduled transfers. File-based exchanges export CSV or JSON files to shared storage for downstream pickup. Database replication copies table snapshots or incremental changes between databases. API polling retrieves current state from web services on schedules. Batch integration suits workloads tolerating staleness measured in hours and benefits from simplicity and predictability.
Change data capture streams database modifications as they occur. CDC tools monitor source database transaction logs and publish insert, update, and delete events to message queues or direct to consumers. Debezium (open source) supports PostgreSQL, MySQL, SQL Server, and MongoDB. CDC provides near-real-time data freshness without querying source systems, reducing load on operational databases.
Event streaming moves data through persistent message brokers. Producers publish events to topics; consumers subscribe to topics and process events. Kafka and Pulsar provide durable, ordered event streams with retention enabling replay. Streaming suits high-volume, low-latency integration where systems must react to changes within seconds.
API integration exposes data through request-response interfaces. REST APIs serve synchronous requests for current data. GraphQL provides flexible query capabilities. API integration suits application-to-application data access where consumers need specific subsets of current data.
Platform Evolution
Data platforms evolve through stages as organisational data maturity and technical capacity increase. Starting with appropriate scope and growing incrementally prevents both under-investment that limits capability and over-engineering that exceeds capacity to operate.
+------------------------------------------------------------------------+| PLATFORM EVOLUTION STAGES |+------------------------------------------------------------------------+| || STAGE 1: FOUNDATION || +----------------------------------------------------------------+ || | - Single database for reporting (PostgreSQL) | || | - Basic ETL scripts (Python, scheduled cron) | || | - Spreadsheet-based analysis | || | - File backup to cloud storage | || | | || | Investment: $0-500/month | Capacity: 1 person part-time | || +----------------------------------------------------------------+ || | || v || STAGE 2: STRUCTURED ANALYTICS || +----------------------------------------------------------------+ || | - Dedicated data warehouse or lakehouse | || | - Orchestrated pipelines (Airflow, Dagster) | || | - BI platform (Metabase, Superset) | || | - Basic data catalogue | || | | || | Investment: $500-2,000/month | Capacity: 1-2 people | || +----------------------------------------------------------------+ || | || v || STAGE 3: GOVERNED PLATFORM || +----------------------------------------------------------------+ || | - Multi-zone data lake with governance | || | - Semantic layer with defined metrics | || | - Automated quality monitoring | || | - Self-service analytics | || | | || | Investment: $2,000-10,000/month | Capacity: 2-4 people | || +----------------------------------------------------------------+ || | || v || STAGE 4: ADVANCED PLATFORM || +----------------------------------------------------------------+ || | - Real-time streaming integration | || | - Machine learning capabilities | || | - Federated or mesh architecture | || | - Advanced lineage and impact analysis | || | | || | Investment: $10,000+/month | Capacity: 4+ people | || +----------------------------------------------------------------+ || |+------------------------------------------------------------------------+Figure 7: Platform evolution stages showing investment requirements and capacity needs at each level
Stage 1: Foundation establishes basic reporting capability with minimal investment. A PostgreSQL database stores extracts from operational systems. Python scripts scheduled through cron or Windows Task Scheduler perform nightly extracts and transformations. Analysts query the database directly or export to spreadsheets for analysis. Cloud storage holds database backups and file exports.
This stage suits organisations with a single IT person managing technology alongside other duties, limited budget for data infrastructure, straightforward reporting requirements, and data volumes under 50 GB. Investment runs $0 to $500 per month using open source tools on modest infrastructure. One person dedicating 20% of time can maintain foundation-stage operations.
Stage 2: Structured Analytics adds dedicated analytical infrastructure and self-service BI. A data warehouse (PostgreSQL with analytical configuration, DuckDB, or cloud warehouse) stores modelled data. Pipeline orchestration tools (Airflow, Dagster, Prefect) manage transformation dependencies and scheduling. BI platforms (Metabase, Superset, or commercial alternatives) enable analysts to build dashboards and explore data without writing SQL. A basic data catalogue documents available datasets.
This stage suits organisations with growing reporting complexity, multiple data consumers, data volumes from 50 GB to 500 GB, and budget for dedicated data infrastructure. Investment runs $500 to $2,000 per month. One to two people can maintain stage 2 operations.
Stage 3: Governed Platform implements comprehensive data governance with quality monitoring and self-service analytics. A multi-zone data lake holds raw, curated, and consumption data with clear policies governing each zone. A semantic layer defines metrics and dimensions consistently across the organisation. Automated quality monitoring alerts when data fails validation rules. Self-service capabilities enable analysts to create new reports without engineering support.
This stage suits organisations with diverse data sources, multiple teams consuming data, regulatory or donor requirements for data governance, and data volumes exceeding 500 GB. Investment runs $2,000 to $10,000 per month. Two to four people with data engineering skills maintain stage 3 operations.
Stage 4: Advanced Platform adds real-time capabilities, machine learning, and potentially federated architecture. Streaming integration enables sub-minute data freshness for operational dashboards. ML infrastructure supports model training, deployment, and monitoring. Federated or mesh patterns distribute ownership across domain teams. Advanced lineage tracks data from source through all transformations to consumption.
This stage suits large organisations with real-time operational requirements, data science teams, federated organisational structures, and complex compliance requirements. Investment exceeds $10,000 per month. Four or more specialists maintain stage 4 operations.
Evolution between stages occurs incrementally over 12 to 24 months per transition. Rushing evolution creates technical debt and operational burden exceeding organisational capacity. Delaying evolution past capability needs limits analytical potential and accumulates manual workarounds.
Technology Selection
Technology selection balances capability requirements, operational capacity, budget constraints, and strategic considerations including data sovereignty and vendor independence. No single technology choice suits all organisations; context determines appropriate selection.
Cloud versus Self-Hosted
Cloud-managed services reduce operational burden by handling infrastructure provisioning, scaling, patching, and availability. Teams focus on data work rather than server administration. Cloud costs scale with usage, avoiding capital expenditure for infrastructure. Cloud providers offer sophisticated capabilities (ML services, managed streaming, global distribution) impractical for individual organisations to build.
Self-hosted platforms provide complete control over data location and access. Data remains within organisational or regional infrastructure, addressing sovereignty requirements for sensitive beneficiary or protection data. Self-hosting avoids dependencies on US-headquartered cloud providers subject to CLOUD Act provisions. Costs become more predictable at scale, avoiding consumption-based pricing that can spike unexpectedly.
Hybrid approaches combine cloud economics for less sensitive workloads with self-hosted control for sensitive data. Aggregated, anonymised analytical data might reside in cloud warehouses while identifiable beneficiary records remain in self-hosted databases within appropriate jurisdictions.
Open Source Options
Open source technologies provide warehouse, lake, and lakehouse capabilities without licensing costs and with full source code access.
PostgreSQL serves as an analytical database for warehouse workloads under 500 GB with appropriate configuration (increased work_mem, parallel query settings, columnar extensions like Citus). DuckDB provides embedded analytical query capability with excellent single-machine performance. ClickHouse handles analytical workloads requiring higher throughput than PostgreSQL.
MinIO provides S3-compatible object storage for self-hosted data lakes. Apache Spark executes batch transformations across distributed data. Trino and Presto query lake storage with SQL semantics. Apache Kafka provides event streaming; Apache Pulsar offers similar capabilities with different operational characteristics.
Apache Airflow orchestrates batch pipelines through directed acyclic graphs defining task dependencies. Dagster and Prefect provide more modern orchestration with better testing and observability. dbt (data build tool) transforms data through SQL-based models with dependency management and documentation generation.
Apache Superset and Metabase provide self-hosted BI platforms with dashboard building, query exploration, and access control. Apache Atlas and DataHub offer data cataloguing with lineage tracking.
Commercial Options
Commercial platforms reduce operational complexity through managed services, support agreements, and integrated tooling.
Snowflake, Databricks, and Google BigQuery provide cloud data platforms with warehouse and lakehouse capabilities, consumption-based pricing, and minimal infrastructure management. Microsoft Fabric integrates data platform capabilities with the Microsoft 365 ecosystem. All major cloud providers (AWS, Google Cloud, Azure) offer data platform services with varying pricing and capability tradeoffs.
Fivetran and Airbyte (open core) provide managed data integration with pre-built connectors to common sources. Tableau, Power BI, and Looker offer commercial BI platforms with advanced visualisation and enterprise features.
Commercial options suit organisations prioritising reduced operational burden over cost optimisation or vendor independence. Nonprofit programmes from Microsoft, Google, AWS, and Salesforce reduce costs for qualifying organisations.
Implementation Considerations
For Organisations with Limited IT Capacity
Start at Stage 1 with PostgreSQL, scheduled Python scripts, and cloud storage for backups. This configuration requires no specialised data engineering skills and runs on a single modest server or managed database instance. Export data to spreadsheets when analysts lack SQL skills.
Metabase’s hosted offering or Preset (hosted Superset) provides BI capability without self-hosting infrastructure. Google BigQuery’s free tier (1 TB queries per month) or DuckDB on analyst workstations enables analytical queries without warehouse infrastructure.
Prioritise operational stability over advanced features. A reliable nightly extract that analysts trust beats sophisticated real-time pipelines that frequently fail and require engineering intervention.
For Organisations with Established IT Functions
Evaluate current state before selecting architecture. Organisations with existing Hadoop infrastructure might extend with Hive or Spark rather than introducing new platforms. Microsoft-centric environments benefit from integration with Fabric or Synapse. Existing PostgreSQL expertise suggests building analytical capability there before introducing new database technologies.
Implement Stage 2 components incrementally: orchestration first to improve pipeline reliability, then BI platform for self-service, then data catalogue for discoverability. Each component provides standalone value while building toward comprehensive capability.
Consider lakehouse architecture (Delta Lake or Iceberg on object storage) to avoid maintaining separate lake and warehouse infrastructure. Modern lakehouse implementations provide warehouse query performance with lake storage economics.
For Organisations Operating in Field Contexts
Field operations require platform designs accommodating intermittent connectivity and limited bandwidth. Centralised cloud platforms work when reliable connectivity exists to headquarters or regional hubs. Distributed architectures with local data stores and periodic synchronisation suit disconnected field contexts.
Lightweight analytical databases (DuckDB, SQLite) run on field laptops and tablets, enabling local analysis without network dependency. Periodic synchronisation when connectivity is available keeps central platform updated.
Data compression and incremental synchronisation reduce bandwidth requirements. Synchronising only changed records rather than full dataset replications makes field connectivity viable for data operations.
Data Sovereignty and Jurisdiction
Platform architecture must address data residency requirements for beneficiary and protection data. Cloud provider region selection determines where data is stored and processed. Major providers offer regions across multiple jurisdictions, but not all services are available in all regions.
US-headquartered providers (AWS, Google Cloud, Microsoft Azure) remain subject to US legal processes regardless of data storage location. Organisations with strict sovereignty requirements consider European providers (OVHcloud, Scaleway, Hetzner), regional providers, or self-hosted infrastructure.
Hybrid architectures can segregate data by sensitivity: less sensitive operational analytics in cloud platforms, sensitive beneficiary data in self-hosted or jurisdictionally appropriate infrastructure.