Data Pipeline Design
Data pipelines are automated workflows that extract data from source systems, apply transformations, and load results into destination systems. A pipeline encapsulates the logic for data movement as code, enabling version control, testing, and reliable execution at scale. For organisations operating programme management systems, financial platforms, and reporting tools, pipelines create the connective tissue that allows data to flow between systems without manual intervention.
- Pipeline
- An automated sequence of data processing steps that executes on a defined schedule or trigger, moving data from sources to destinations while applying transformations.
- DAG (Directed Acyclic Graph)
- A graph structure representing task dependencies where edges indicate execution order and the absence of cycles guarantees the workflow can complete.
- Idempotency
- The property whereby executing a pipeline multiple times with the same input produces identical output, enabling safe retries without data duplication.
- Backfill
- Re-executing a pipeline for historical time periods to populate missing data or apply updated transformation logic to past records.
Pipeline execution models
Pipelines execute according to three primary models that differ in latency, complexity, and resource consumption. The choice between models depends on how quickly downstream systems require data and what infrastructure exists to support processing.
Batch pipelines process accumulated data at scheduled intervals. A nightly batch pipeline extracting the previous day’s transactions runs once every 24 hours, processing all records created since the last execution. Batch processing suits workloads where data consumers tolerate latency measured in hours or days. The model simplifies infrastructure because compute resources spin up only during scheduled windows, then release. Most reporting, analytics, and regulatory submissions operate effectively on batch schedules.
Micro-batch pipelines process data in small, frequent increments. Rather than waiting 24 hours, a micro-batch pipeline might execute every 15 minutes, processing only records created in that window. This approach reduces latency to minutes while retaining the transactional semantics of batch processing. Micro-batch suits operational dashboards, near-real-time monitoring, and integration scenarios where hourly delays create friction but sub-second latency is unnecessary.
Streaming pipelines process records individually as they arrive, maintaining continuous computation. A streaming pipeline consuming events from a message queue processes each event within milliseconds of production. Streaming demands persistent compute resources and introduces complexity around state management, but enables use cases impossible with batch approaches: real-time alerting, live dashboards, and immediate data synchronisation.
+------------------------------------------------------------------+ | PIPELINE EXECUTION MODELS | +------------------------------------------------------------------+
BATCH MICRO-BATCH STREAMING
Source Source Source | | | | (accumulate 24h) | (accumulate 15m) | (continuous) v v v +--------+ +--------+ +--------+ | Buffer | | Buffer | | Queue | | (file) | | (file) | | (Kafka)| +--------+ +--------+ +--------+ | | | | 02:00 daily | every 15 min | per event v v v +--------+ +--------+ +--------+ |Process | |Process | |Process | | all | | window | | single | +--------+ +--------+ +--------+ | | | v v v Target Target Target
Latency: hours minutes millisecondsCompute: ephemeral ephemeral persistentComplexity: lowest moderate highestFigure 1: Pipeline execution models comparing batch, micro-batch, and streaming approaches
The distinction between models is not absolute. A single data platform commonly employs all three: batch pipelines for nightly warehouse loads, micro-batch for operational reporting, and streaming for real-time alerting. The architecture section of this page addresses how these models compose within a unified platform.
ETL and ELT patterns
Two fundamental patterns govern how pipelines structure the relationship between extraction, transformation, and loading. The patterns differ in where transformation occurs and what that placement implies for system design.
ETL (Extract-Transform-Load) applies transformations before loading data into the destination. A pipeline extracts beneficiary records from a registration system, transforms field names to match warehouse conventions, validates data quality, and then loads clean records into the warehouse. Transformation happens in a processing layer between source and destination, with only conformant data reaching the target system.
+------------------------------------------------------------------+| ETL PATTERN |+------------------------------------------------------------------+
+----------+ +------------------+ +----------+ | | | | | | | Source +----->+ Transform +----->+ Target | | System | | Layer | | System | | | | | | | +----------+ +------------------+ +----------+ | +-------+-------+ | | +-----------+ +-----------+ | Cleanse | | Conform | | Validate | | Aggregate | +-----------+ +-----------+
Data quality enforced BEFORE target load Target contains only transformed, valid data Processing capacity scales independentlyFigure 2: ETL pattern with transformation occurring before target load
ETL emerged when destination systems (data warehouses) had limited compute capacity and expensive storage. Transforming data before loading minimised warehouse resource consumption. The pattern remains appropriate when the destination system genuinely lacks transformation capability, when data quality must be enforced before arrival, or when the transformation layer provides capabilities absent in the destination.
ELT (Extract-Load-Transform) loads raw data first, then applies transformations within the destination system. A pipeline extracts beneficiary records, loads them unchanged into a staging area of the warehouse, and transformation queries within the warehouse produce clean output tables. The destination system performs all data manipulation.
+------------------------------------------------------------------+| ELT PATTERN |+------------------------------------------------------------------+
+----------+ +---------------------------------------------+ | | | Target System | | Source +----->+ | | System | | +---------+ +---------+ +------+ | | | | | | | | | | | +----------+ | | Raw +----->+ Trans- +----->+ Final| | | | Stage | | form | | Table| | | | | | (SQL) | | | | | +---------+ +---------+ +------+ | | | +---------------------------------------------+
Raw data preserved in staging Transformations execute in warehouse SQL Re-transformation possible without re-extractionFigure 3: ELT pattern with transformation occurring within the target system
ELT leverages the compute power of modern cloud data warehouses and data lakes. Systems like Snowflake, BigQuery, and Databricks provide elastic compute that scales to transformation workloads, making separate transformation infrastructure redundant. ELT preserves raw data in staging tables, enabling re-transformation when business logic changes without re-extracting from sources. This raw data preservation also supports lineage tracking and audit requirements.
The choice between ETL and ELT depends on where compute capacity exists, whether raw data preservation matters, and how frequently transformation logic changes.
| Factor | ETL favoured | ELT favoured |
|---|---|---|
| Destination compute | Limited or fixed | Elastic, scalable |
| Raw data needs | No audit requirement | Must preserve source state |
| Transformation complexity | Simple, stable logic | Complex, evolving logic |
| Source system load | Can tolerate re-extraction | Extraction is expensive |
| Tool ecosystem | Dedicated ETL tools available | SQL-centric team skills |
Hybrid approaches combine both patterns. A pipeline might apply data quality validation during extraction (ETL behaviour) while deferring business logic transformations to the warehouse (ELT behaviour). The boundary between patterns is practical, not doctrinal.
Extraction patterns
Extraction retrieves data from source systems for pipeline processing. Three patterns address different source system characteristics and data volume considerations.
Full extraction retrieves all records from the source system on each pipeline execution. A pipeline extracting a reference table of 500 locations loads all 500 records every run, replacing the previous load. Full extraction suits small, slowly-changing datasets where the overhead of tracking changes exceeds the cost of complete retrieval. The pattern simplifies pipeline logic because no state management tracks what was previously extracted.
Full extraction becomes impractical as data volume grows. A table containing 10 million beneficiary records requires 10 million row transfers on every execution, consuming network bandwidth, source system resources, and processing time regardless of whether any records actually changed.
Incremental extraction retrieves only records created or modified since the last extraction. The pipeline tracks a high-water mark, typically a timestamp or sequence number, and each execution queries for records beyond that mark. A pipeline with a high-water mark of 2024-11-15 23:00:00 extracts only records where modified_date > '2024-11-15 23:00:00', updating the mark after successful completion.
+------------------------------------------------------------------+| INCREMENTAL EXTRACTION |+------------------------------------------------------------------+
Source Table Pipeline State +----+------------+--------+ +------------------+ | id | name | mod_dt | | high_water_mark | +----+------------+--------+ | 2024-11-15 23:00 | | 1 | Location A | Nov 10 | +------------------+ | 2 | Location B | Nov 12 | | | 3 | Location C | Nov 16 |<-----------+ Extract where | 4 | Location D | Nov 17 | mod_dt > mark +----+------------+--------+ | New high_water_mark +----------------------> 2024-11-17 (max mod_dt)
Extraction window: Nov 15 23:00 to Nov 17 (latest) Records extracted: 3, 4 (2 rows instead of 4)Figure 4: Incremental extraction using high-water mark to identify changed records
Incremental extraction requires source systems to maintain reliable timestamps or sequence numbers. When source systems lack such fields or when timestamps prove unreliable (due to clock skew, bulk updates, or application bugs), incremental extraction produces incorrect results.
Change data capture (CDC) reads the source system’s transaction log to identify record changes at the database level. Rather than querying application tables, CDC tools consume the log stream that the database uses for replication, capturing inserts, updates, and deletes as they occur. CDC provides the lowest latency and most complete change information, including deletions that timestamp-based approaches miss.
CDC requires access to database transaction logs, which involves database configuration changes and elevated permissions. Not all database platforms support CDC extraction, and managed database services may not expose log access. The operational complexity of CDC exceeds incremental extraction but delivers superior accuracy for high-volume, change-intensive sources.
| Extraction pattern | Best suited for | Limitations |
|---|---|---|
| Full | Reference data under 100,000 rows | Impractical at scale |
| Incremental | Transactional data with reliable timestamps | Misses deletions, depends on source field quality |
| CDC | High-volume, change-intensive sources | Requires log access, operational complexity |
A practical approach starts with full extraction during initial development, shifts to incremental extraction as data volumes grow, and adopts CDC only for sources where accuracy requirements or volume justify the complexity.
Transformation patterns
Transformations convert extracted data into the structure and quality required by destination systems. Transformations decompose into stages that build progressively from raw input to final output.
Staging copies extracted data into the destination environment without structural changes. A staging table mirrors the source schema, preserving raw values exactly as extracted. Staging serves multiple purposes: it provides a checkpoint for restart after failure, preserves raw data for audit, and decouples subsequent transformation from extraction timing. Staging tables commonly include metadata columns recording extraction timestamp and source system identifier.
Cleansing addresses data quality issues that would otherwise propagate to downstream consumers. Cleansing operations include null handling (replacing null values with defaults or filtering null rows), format standardisation (converting date strings to date types, normalising phone number formats), and deduplication (identifying and removing duplicate records based on business keys). Cleansing rules derive from the data quality standards established in the organisation’s quality framework.
Validation applies business rules to identify records that fail quality criteria without necessarily removing them. A validation step flags records where beneficiary age exceeds 120 years or where geographic coordinates fall outside expected bounds. Validation differs from cleansing in that it identifies problems rather than fixing them, supporting review workflows where analysts assess flagged records.
Enrichment augments records with data from reference sources. An enrichment step joins beneficiary records with location reference data to append administrative boundary names, or calls a geocoding service to convert addresses to coordinates. Enrichment introduces dependencies on reference data currency and external service availability.
Conforming aligns data to destination schema requirements. Conforming renames fields from source conventions to destination conventions, applies data type conversions, and restructures records to match target table schemas. A source system storing names as first_name and last_name fields conforms to a destination expecting a single full_name field through concatenation.
Aggregation summarises detail records into analytical structures. An aggregation step counts beneficiaries by location and programme, producing summary rows from transaction-level detail. Aggregation typically produces dimensional model outputs: fact tables containing measures and dimension tables containing attributes.
These stages execute in dependency order within a pipeline DAG. Cleansing depends on staging completion; enrichment depends on cleansing; aggregation depends on enrichment. The orchestration layer manages these dependencies.
Loading patterns
Loading writes transformed data to destination systems. Three patterns address different requirements for handling existing data and new arrivals.
Append adds new records to the destination without modifying existing records. Each pipeline execution inserts rows, growing the destination table monotonically. Append suits immutable event data where records represent facts that never change: transactions, log entries, sensor readings. An append load of 10,000 daily transactions adds 10,000 rows to the destination, leaving prior days’ records untouched.
Overwrite replaces the entire destination table with current pipeline output. A reference data pipeline loading 500 locations deletes all existing location records and inserts the current 500. Overwrite simplifies logic when the complete current state matters more than change history, and works effectively for small reference datasets. Overwrite becomes problematic when destination tables grow large or when downstream consumers query during the brief window when the table is empty or partially loaded.
Upsert (update-insert, also called merge) combines insert and update operations based on record keys. Records with keys matching existing destination rows update those rows; records with new keys insert as new rows. Upsert preserves existing data while incorporating changes, suitable for master data and slowly-changing dimensions.
-- Upsert pattern in SQL (PostgreSQL syntax)INSERT INTO destination (id, name, status, updated_at)SELECT id, name, status, updated_at FROM stagingON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name, status = EXCLUDED.status, updated_at = EXCLUDED.updated_atWHERE destination.updated_at < EXCLUDED.updated_at;Upsert requires a reliable business key to match source and destination records. When sources lack stable identifiers, upsert produces incorrect matches and data corruption. Upsert also does not handle deletions; records deleted from the source remain in the destination unless separate deletion logic addresses them.
Soft delete extends upsert to handle deletions by marking records as deleted rather than physically removing them. A boolean is_deleted flag or a deleted_at timestamp indicates deleted status. Downstream queries filter on the deletion flag, and historical queries can recover deleted records. Soft delete preserves audit trails and enables recovery from erroneous deletions.
Orchestration
Orchestration coordinates pipeline execution: scheduling runs, managing dependencies between tasks, handling failures, and providing visibility into execution state. An orchestration platform serves as the control plane for all pipeline activity.
Scheduling determines when pipelines execute. Cron-based schedules trigger pipelines at fixed times: 0 2 * * * executes daily at 02:00 UTC. Event-based triggers start pipelines when upstream events occur: a file landing in storage, a message arriving on a queue, or another pipeline completing. Hybrid approaches combine both, with scheduled runs handling normal operation and event triggers accelerating processing when data arrives early.
Schedule design considers source system patterns, destination SLAs, and resource contention. A pipeline loading warehouse data for morning reports must complete before business hours begin. If source data becomes available at 01:00 and transformation requires 45 minutes, scheduling the pipeline at 01:30 provides margin for source delays while meeting the 08:00 availability target.
DAG structure defines task dependencies within a pipeline. Each task represents a discrete unit of work: extract from one source, apply one transformation, load to one destination. Dependencies between tasks form a directed acyclic graph where edges indicate execution order.
+------------------------------------------------------------------+| PIPELINE DAG EXAMPLE |+------------------------------------------------------------------+
+-------------+ | Start | +------+------+ | +------------------+------------------+ | | | +------v------+ +------v------+ +------v------+ | Extract | | Extract | | Extract | | Beneficiary | | Location | | Programme | +------+------+ +------+------+ +------+------+ | | | +--------+---------+ | | | +------v------+ | | Stage & | | | Cleanse |<--------------------+ +------+------+ | +------v------+ | Enrich & | | Conform | +------+------+ | +-----------+-----------+ | | +------v------+ +------v------+ | Load Fact | | Load Dim | | Tables | | Tables | +------+------+ +------+------+ | | +-----------+-----------+ | +------v------+ | Notify | | Complete | +-------------+Figure 5: Pipeline DAG showing task dependencies and parallel execution paths
DAG design maximises parallelism where dependencies permit. Independent extraction tasks execute concurrently, reducing total pipeline duration. The orchestrator manages parallel execution within configured resource limits.
Task configuration specifies execution parameters for each task: retry count and delay, timeout duration, resource allocation, and environment variables. A configuration specifying 3 retries with exponential backoff (delays of 60, 120, 240 seconds) provides resilience against transient failures without immediate operator intervention.
# Example task configuration (Airflow syntax)extract_beneficiary: operator: PythonOperator retries: 3 retry_delay: 60 # seconds, doubles each retry execution_timeout: 1800 # 30 minutes pool: extraction_pool pool_slots: 1Error handling
Pipelines encounter failures from source unavailability, data quality issues, transformation errors, and destination problems. Error handling strategies determine how pipelines respond to failures and what happens to problematic records.
Retry with backoff re-executes failed tasks after a delay, with increasing delays between successive attempts. Transient failures (network timeouts, temporary service unavailability) resolve without intervention given sufficient retry attempts. A task configured for 3 retries with exponential backoff waits 1 minute, then 2 minutes, then 4 minutes before failing permanently.
Circuit breaker patterns prevent cascading failures by halting execution when error rates exceed thresholds. If a source system becomes unavailable, continued connection attempts waste resources and potentially overload the recovering source. A circuit breaker that opens after 5 consecutive failures pauses extraction attempts for a cooling period, then tests connectivity before resuming.
Dead letter handling routes records that fail transformation or validation to a separate location for investigation rather than halting the entire pipeline. A record with an unparseable date value writes to a dead letter table along with the error message, while remaining records continue processing. Dead letter handling prevents individual bad records from blocking pipeline completion.
+------------------------------------------------------------------+| ERROR HANDLING FLOW |+------------------------------------------------------------------+
+----------------+ | Process Record | +-------+--------+ | +-------v--------+ | Validation | | Passes? | +-------+--------+ | +-------------+-------------+ | | | Yes | No v v +-------+--------+ +-------+--------+ | Continue | | Retry | | Processing | | Eligible? | +----------------+ +-------+--------+ | +-------------+-------------+ | | | Yes | No v v +-------+--------+ +-------+--------+ | Retry with | | Write to | | Backoff | | Dead Letter | +-------+--------+ +-------+--------+ | | | v | +-------+--------+ +------------------>| Log Error | | Alert if | | Threshold | +----------------+Figure 6: Error handling flow showing retry logic and dead letter routing
Alerting notifies operators when failures require human intervention. Alert configuration balances noise (alerting on every transient failure) against missed problems (alerting only on catastrophic failures). Effective alerting triggers on patterns: 3 consecutive task failures, dead letter counts exceeding daily thresholds, or pipeline duration exceeding historical norms by more than 50%.
Idempotency and replayability
Well-designed pipelines produce identical results regardless of how many times they execute for the same input period. This property enables safe retry after failures, backfill of historical data, and testing in production-like conditions.
Partition-based loading achieves idempotency by treating each time period as an independent partition. A daily pipeline targeting a partition date=2024-11-16 deletes any existing data for that partition before loading new data. Re-executing the pipeline replaces rather than duplicates the partition content.
-- Partition-based idempotent loadDELETE FROM fact_beneficiary_activityWHERE activity_date = '2024-11-16';
INSERT INTO fact_beneficiary_activitySELECT * FROM staging_beneficiary_activityWHERE activity_date = '2024-11-16';Record-level idempotency uses upsert operations where records have unique identifiers. Each execution updates existing records and inserts new ones based on the identifier, producing consistent results across multiple runs.
Backfill re-executes pipelines for historical periods when transformation logic changes or when initial loads missed data. A pipeline designed for backfill accepts a date range parameter rather than always processing the current period. Executing with parameters start_date=2024-01-01, end_date=2024-06-30 reprocesses six months of data using current logic.
Backfill at scale requires attention to resource consumption. Reprocessing 180 days of data simultaneously overwhelms systems designed for single-day loads. Backfill strategies include sequential processing (one day at a time), throttled parallel processing (10 days concurrently), or prioritised processing (recent data first, historical data in background).
Monitoring and observability
Pipeline observability provides visibility into execution state, performance trends, and data quality metrics. Monitoring enables proactive issue detection before failures impact downstream consumers.
Execution metrics track pipeline runs: success/failure status, duration, records processed, and resource consumption. Dashboards displaying these metrics reveal patterns: pipelines slowing over time (indicating scaling issues), increasing failure rates (indicating source instability), or growing record counts (indicating data volume trends).
Data quality metrics measure output characteristics: null percentages, duplicate rates, referential integrity violations, and value distribution anomalies. A sudden increase in null values for a normally-populated field indicates source system problems even when the pipeline itself succeeds.
Freshness monitoring tracks when destination data was last updated, alerting when data falls behind expected schedules. A warehouse table that should refresh by 06:00 daily triggers an alert if the last update timestamp remains at yesterday’s value past 07:00.
Lineage integration connects pipeline observability with data lineage systems, recording which pipeline runs produced which data versions. When data quality issues emerge in reports, lineage traces backwards to the pipeline execution and source data that produced the problematic output.
Technology options
Pipeline tools span open source orchestrators, transformation frameworks, and commercial platforms. Selection depends on existing infrastructure, team skills, and operational capacity.
Open source orchestrators
Apache Airflow provides Python-based DAG authoring, a web interface for monitoring, and extensive operator libraries for connecting to data sources and destinations. Airflow requires infrastructure management (databases, web servers, schedulers, workers) but offers complete flexibility in pipeline design. Organisations with Python skills and infrastructure capacity find Airflow suitable for complex, custom pipelines.
Dagster emphasises software engineering practices: type checking, testing, and modular asset definitions. Dagster’s asset-centric model defines pipelines in terms of data assets produced rather than tasks executed. Organisations prioritising code quality and maintainability find Dagster’s abstractions valuable.
Prefect balances Airflow’s flexibility with reduced operational complexity. Prefect Cloud provides managed infrastructure, eliminating scheduler and database management. The hybrid model suits organisations wanting orchestration capabilities without infrastructure overhead.
Transformation frameworks
dbt (data build tool) transforms data within warehouses using SQL, implementing the ELT pattern. dbt manages dependency ordering, applies tests, and generates documentation from SQL files. For SQL-centric transformation workloads, dbt provides structure and best practices without requiring Python skills.
dbt operates within the ELT paradigm, assuming data already resides in a warehouse. Extraction and loading require separate tools; dbt addresses only the T in ELT.
Commercial platforms
Fivetran provides managed extraction pipelines with pre-built connectors to SaaS applications, databases, and file sources. Fivetran handles infrastructure, monitoring, and connector maintenance, reducing operational burden. The platform suits organisations with many SaaS data sources and limited engineering capacity for connector development.
Matillion combines extraction and transformation in a visual interface, targeting cloud data warehouses. The low-code approach suits teams without dedicated data engineering capacity while providing more flexibility than pure extraction tools.
| Tool | Category | Deployment | Best suited for |
|---|---|---|---|
| Apache Airflow | Orchestration | Self-hosted | Complex DAGs, Python teams |
| Dagster | Orchestration | Self-hosted or cloud | Asset-centric pipelines |
| Prefect | Orchestration | Cloud or hybrid | Managed orchestration |
| dbt | Transformation | Cloud or CLI | SQL-based warehouse transforms |
| Fivetran | Extraction | Cloud | SaaS source extraction |
| Matillion | ETL/ELT | Cloud | Visual pipeline development |
Implementation considerations
For organisations with limited IT capacity
Begin with managed services that minimise infrastructure overhead. A combination of Fivetran for extraction and dbt Cloud for transformation provides pipeline capability without orchestration infrastructure. This approach suits organisations with one or two data-focused staff who cannot dedicate time to infrastructure management.
Start with batch pipelines on daily schedules. Daily processing meets most reporting requirements while providing a 24-hour window to address failures before stakeholders notice. Move to more frequent processing only when genuine business requirements demand lower latency.
Use cloud-native scheduling (Azure Data Factory, AWS Step Functions, Google Cloud Composer) when pipelines remain simple and the organisation already operates on that cloud platform. Native tools integrate with existing authentication and monitoring without additional infrastructure.
For organisations with established data functions
Invest in orchestration infrastructure when pipeline complexity exceeds what managed tools handle gracefully. Airflow or Dagster deployed on Kubernetes provides scaling and isolation for diverse pipeline workloads. The infrastructure investment pays back through flexibility in pipeline design and reduced per-pipeline costs at scale.
Implement comprehensive testing: unit tests for transformation logic, integration tests for connector behaviour, and data quality tests for output validation. Treat pipeline code with the same rigour as application code, including code review, version control, and deployment automation.
Establish pipeline development standards covering naming conventions, configuration management, error handling patterns, and documentation requirements. Standards reduce cognitive load when engineers move between pipelines and improve maintainability.
For low-bandwidth and field contexts
Design pipelines tolerating intermittent connectivity. Store extraction results locally before transmission, enabling resumption after connection interruption. Compression reduces bandwidth consumption; a pipeline compressing CSV extracts to gzip format reduces transfer volume by 80% or more.
Prioritise data transmission by business value. Critical programme data transmits first; historical or reference data follows when bandwidth permits. Chunked transmission with checkpoint tracking enables progress across multiple connectivity windows.
Consider edge processing where field infrastructure permits. A local database receiving mobile data collection submissions can perform initial transformation and aggregation, transmitting summarised results that consume less bandwidth than raw submissions.