Skip to main content

Data Lineage

Data lineage records where data originates, how it moves between systems, and what transformations alter its structure or content along the way. This recorded history creates a navigable map of data flow that serves four distinct purposes: understanding the upstream sources that feed any dataset, predicting which downstream systems a change will affect, demonstrating compliance with data protection regulations, and tracing quality issues to their root cause. For organisations managing beneficiary data across registration, case management, and reporting systems, lineage provides the visibility required to answer questions about data provenance that arise during audits, investigations, and subject access requests.

Data lineage
A recorded representation of data’s lifecycle showing origin, movement, and transformation from source to consumption. Lineage captures the relationships between data assets rather than the data values themselves.
Technical lineage
Column-level or table-level mappings derived from code, SQL, ETL jobs, or pipeline definitions. Technical lineage shows precisely which source columns contribute to which target columns through which transformation logic.
Business lineage
Higher-level relationships between business concepts, reports, and data domains. Business lineage connects a KPI on a dashboard to the business entities it represents without detailing every technical step.
Impact analysis
The use of lineage to identify all downstream assets affected by a proposed change to an upstream asset. Impact analysis answers: “If I modify this table, what reports break?”
Provenance
The complete origin history of a specific data value or record, including which systems created it and what transformations it underwent. Provenance is lineage applied to a particular data instance.

Technical and Business Lineage

Lineage exists at two levels of abstraction that serve different audiences and answer different questions. Technical lineage operates at the level of database columns, transformation functions, and pipeline stages. Business lineage operates at the level of reports, metrics, and business entities. Both representations describe the same underlying data flow, but they emphasise different aspects and require different capture mechanisms.

Technical lineage traces the exact path a data element takes from source to destination. When a beneficiary’s registration date appears in a quarterly report, technical lineage records that the registration_date column in the dim_beneficiary table derives from the created_at column in the beneficiary_staging table, which itself was extracted from the reg_timestamp field in the mobile data collection system’s API response. Each transformation step is documented: timezone conversion from UTC to local time, truncation from timestamp to date, and any business rules that modified the value.

+--------------------------------------------------------------------+
| TECHNICAL LINEAGE |
+--------------------------------------------------------------------+
| |
| Source System Staging Warehouse |
| |
| +--------------+ +--------------+ +--------------+ |
| | Mobile App | | staging. | | dim_ | |
| | API | | beneficiary | | beneficiary | |
| | | | | | | |
| | reg_timestamp+----->| created_at +----->| registration | |
| | (UTC) | | (UTC) | | _date | |
| | | | | | (local date) | |
| +--------------+ +------+-------+ +------+-------+ |
| | | |
| v v |
| +--------+--------+ +-------+--------+ |
| | Transform: | | Transform: | |
| | - Validate | | - TZ convert | |
| | - Deduplicate | | - Truncate | |
| | - Type cast | | - Format | |
| +-----------------+ +----------------+ |
| |
+--------------------------------------------------------------------+

Figure 1: Technical lineage showing column-level transformations from mobile API through staging to warehouse

Business lineage abstracts away transformation details to show how business concepts relate. The same registration date example, expressed as business lineage, might show that the “Beneficiary Registration Report” depends on the “Beneficiary” entity, which is sourced from the “Mobile Registration” process. Business users do not need to know about timezone conversions or staging tables; they need to understand that registration metrics come from the mobile data collection workflow and that changes to that workflow affect registration reporting.

+--------------------------------------------------------------------+
| BUSINESS LINEAGE |
+--------------------------------------------------------------------+
| |
| Business Process Business Entity Business Report |
| |
| +-----------------+ +-----------------+ +-----------------+ |
| | | | | | | |
| | Mobile +--->| Beneficiary +-->| Registration | |
| | Registration | | | | Dashboard | |
| | | +-----------------+ | | |
| +-----------------+ | +-----------------+ |
| | | |
| +-----------------+ | +-----------------+ |
| | | v | | |
| | Case +--->+-----------------+-->| Quarterly | |
| | Management | | | | Programme | |
| | | | Service | | Report | |
| +-----------------+ | Delivery | | | |
| | | +-----------------+ |
| +-----------------+ |
| |
+--------------------------------------------------------------------+

Figure 2: Business lineage connecting processes, entities, and reports without technical detail

The relationship between technical and business lineage is hierarchical. Business lineage nodes aggregate multiple technical lineage paths. The “Beneficiary” business entity encompasses dozens of database tables, and the connection between “Mobile Registration” and “Beneficiary” represents hundreds of column-level mappings. Effective lineage systems maintain both levels and allow navigation between them: a data steward viewing business lineage can drill into technical detail when investigating an issue, while an analyst viewing technical lineage can navigate up to understand business context.

Lineage Capture Mechanisms

Lineage information enters a lineage system through three distinct mechanisms, each with different coverage, accuracy, and operational overhead. The choice of mechanism depends on the systems being tracked, the granularity required, and the organisation’s capacity to maintain lineage infrastructure.

Manual documentation requires humans to record lineage relationships in a catalogue or lineage tool. An analyst documents that Report X uses Table Y, or a developer records that Pipeline A reads from System B and writes to System C. Manual documentation works for any system regardless of its technical capabilities, making it the only option for legacy systems, spreadsheets, and processes that exist outside automated pipelines. The accuracy of manual documentation depends entirely on human diligence. Documentation created during initial system implementation becomes outdated as systems evolve unless ongoing maintenance processes enforce updates. Manual documentation suits organisations with limited pipeline automation, where the volume of lineage relationships is manageable, and where the cost of automated capture exceeds the cost of manual maintenance.

Automated parsing extracts lineage from code, SQL, ETL tool configurations, and pipeline definitions without runtime observation. A parsing engine analyses a SQL statement to determine which source tables contribute to a target table, or examines an Airflow DAG definition to identify data dependencies between tasks. Parsing operates on artefacts that exist before execution: source code, configuration files, query text. This mechanism captures lineage at deployment time rather than runtime, meaning it reflects what the code says rather than what actually executes. Parsing handles conditional logic and dynamic SQL with varying accuracy depending on the parser’s sophistication. Simple column-level lineage from straightforward SQL parses reliably. Complex transformations involving user-defined functions, dynamic table names, or procedural code require parsers that understand the specific languages and frameworks involved.

Runtime capture instruments pipeline execution to observe actual data flow as it occurs. When a pipeline runs, instrumentation records which datasets were read, which transformations executed, and which datasets were written. Runtime capture reflects actual execution paths, including conditional branches that parsing might miss. The OpenLineage specification standardises runtime lineage events, allowing different pipeline tools to emit lineage in a common format. Runtime capture requires integration with each pipeline tool, either through native support, plugins, or custom instrumentation. The operational overhead includes the performance impact of instrumentation, the infrastructure to receive and store lineage events, and the expertise to configure capture across heterogeneous pipeline technologies.

+--------------------------------------------------------------------+
| LINEAGE CAPTURE ARCHITECTURE |
+--------------------------------------------------------------------+
| |
| +-------------------------+ |
| | | |
| | LINEAGE STORE | |
| | (Graph Database) | |
| | | |
| +------------+------------+ |
| ^ |
| +---------------------+---------------------+ |
| | | | |
| | | | |
| +--------+--------+ +--------+--------+ +--------+--------+ |
| | | | | | | |
| | Manual Entry | | Automated | | Runtime | |
| | API | | Parser | | Collector | |
| | | | | | | |
| +--------+--------+ +--------+--------+ +--------+--------+ |
| ^ ^ ^ |
| | | | |
| +--------+--------+ +--------+--------+ +--------+--------+ |
| | | | | | | |
| | Catalogue UI | | Code Repos | | Pipeline | |
| | Spreadsheets | | SQL Files | | Execution | |
| | Documentation | | ETL Configs | | Events | |
| | | | | | | |
| +-----------------+ +-----------------+ +-----------------+ |
| |
+--------------------------------------------------------------------+

Figure 3: Lineage capture architecture showing three mechanisms feeding a central lineage store

Most organisations use a combination of mechanisms. Automated parsing handles the bulk of pipeline lineage where tools support it. Runtime capture fills gaps for complex transformations and validates parsed lineage against actual execution. Manual documentation covers systems outside the parsing and runtime infrastructure: legacy databases, partner data exchanges, and analytical spreadsheets that exist beyond pipeline automation.

Lineage Granularity

Lineage relationships can be captured at different levels of detail, from coarse dataset-level connections to fine-grained record-level provenance. The appropriate granularity depends on the questions lineage must answer, the technical capability of capture mechanisms, and the storage and query performance requirements of the lineage system.

Dataset-level lineage records relationships between entire datasets: tables, files, API endpoints, reports. This granularity answers questions like “Which tables does this report depend on?” or “What downstream tables are affected if I drop this source table?” Dataset-level lineage requires minimal capture sophistication, as dependencies between datasets are straightforward to identify from pipeline configurations, SQL FROM clauses, and tool integrations. Storage requirements are modest because the number of datasets is orders of magnitude smaller than the number of columns or records. Dataset lineage is insufficient for detailed impact analysis when a change affects only specific columns within a table, or when debugging requires knowing which source column contributed to an incorrect value.

Column-level lineage records relationships between individual columns across datasets. This granularity answers questions like “Which source columns contribute to this calculated field?” or “If I change the data type of this column, which downstream columns are affected?” Column-level lineage enables precise impact analysis and supports data quality root cause investigation by tracing an incorrect value back through specific transformation steps. Capturing column-level lineage requires more sophisticated parsing that understands SQL expressions, transformation logic, and schema mappings. Storage requirements increase significantly, as a single table with 50 columns joining 10 source tables might generate hundreds of column-level relationships.

Record-level lineage tracks the provenance of individual data records, recording which source records contributed to each output record. This granularity answers questions like “Which mobile submissions contributed to this beneficiary’s record?” or “Can I demonstrate the origin of this specific data point for an audit?” Record-level lineage is computationally expensive to capture and store, as it requires tracking lineage during actual data processing rather than through static analysis. The storage footprint scales with data volume rather than schema size. Record-level lineage is reserved for high-value use cases: regulatory compliance requiring demonstrable provenance, audit trails for financial data, and protection data where accountability for individual records is mandatory.

+---------------------------------------------------------------------+
| LINEAGE GRANULARITY LEVELS |
+---------------------------------------------------------------------+
| |
| DATASET LEVEL |
| +----------------+ +----------------+ +----------------+ |
| | | | | | | |
| | registration +----->| beneficiary +----->| quarterly | |
| | _raw | | _warehouse | | _report | |
| | | | | | | |
| +----------------+ +----------------+ +----------------+ |
| |
| Questions answered: "What tables feed this report?" |
| |
+---------------------------------------------------------------------+
| |
| COLUMN LEVEL |
| +----------------+ +----------------+ +----------------+ |
| | registration | | beneficiary | | quarterly | |
| | _raw | | _warehouse | | _report | |
| | | | | | | |
| | first_name --+------+-> full_name --+------+-> name | |
| | last_name ---+ | | | | |
| | dob ---------> dob ------------|------+-> age | |
| | reg_date ---------> registered -----|------+-> reg_month | |
| +----------------+ +----------------+ +----------------+ |
| |
| Questions answered: "Which source columns affect this field?" |
| |
+---------------------------------------------------------------------+
| |
| RECORD LEVEL |
| +----------------+ +----------------+ +----------------+ |
| | registration | | beneficiary | | quarterly | |
| | _raw | | _warehouse | | _report | |
| | | | | | | |
| | rec_001 -------+------+-> ben_042 -----+------+-> row_7 | |
| | rec_002 -------+ | | | | |
| | rec_003 ---------------> ben_043 ------|------+-> row_8 | |
| +----------------+ +----------------+ +----------------+ |
| |
| Questions answered: "Which source records created this output?" |
| |
+---------------------------------------------------------------------+

Figure 4: Lineage granularity levels showing dataset, column, and record tracking

The practical approach for most organisations is column-level lineage as the standard, with dataset-level as a fallback for systems where column parsing is unavailable, and record-level reserved for specific compliance or audit requirements on sensitive data.

Impact Analysis

Impact analysis uses lineage to predict the consequences of a proposed change before implementing it. When a data engineer proposes modifying a column’s data type, renaming a table, or changing transformation logic, impact analysis traverses the lineage graph to identify every downstream asset that depends on the changed asset. This forward traversal from source to consumers answers the question: “What breaks if I change this?”

The mechanics of impact analysis involve querying the lineage graph with the changed asset as the starting node and traversing all outbound relationships recursively. For a proposed change to the beneficiary_staging.created_at column, the analysis traverses to dim_beneficiary.registration_date, then to every report, dashboard, and downstream table that references dim_beneficiary.registration_date. The traversal continues until reaching terminal nodes with no further downstream dependencies.

Impact analysis output takes several forms depending on the audience and purpose. For a data engineer assessing technical risk, the output is a list of affected columns and tables with their technical owners. For a change advisory board approving a release, the output is a summary of affected business reports and the stakeholders who consume them. For a data steward coordinating changes across teams, the output is a communication plan identifying whom to notify before the change proceeds.

The value of impact analysis depends directly on lineage completeness. Gaps in lineage coverage create blind spots where changes propagate without warning. A table with no captured downstream lineage appears safe to modify, but hidden dependencies in uncatalogued spreadsheets or partner integrations break silently. Organisations establish lineage coverage metrics to quantify the proportion of data assets with documented lineage, and they prioritise closing gaps in high-change areas where impact analysis is most critical.

Impact analysis also operates in reverse, tracing upstream from a problem to its source. When a report shows incorrect values, reverse impact analysis identifies all upstream assets that contribute to the problematic field. This upstream traversal narrows the investigation scope by eliminating sources that do not participate in the lineage path to the error. Combined with data quality metrics at each lineage node, root cause analysis can pinpoint where quality degradation occurred.

Regulatory Compliance Applications

Data protection regulations create specific requirements that lineage directly supports. The General Data Protection Regulation (GDPR) grants data subjects the right to know what personal data an organisation holds about them, where that data originated, and to whom it has been disclosed. Responding to a subject access request (SAR) requires tracing all systems containing a data subject’s information, which is precisely what lineage records.

When a data subject submits an access request, lineage enables systematic identification of every system containing their data. Starting from the initial collection point, lineage traversal follows the data subject’s information through staging, warehouses, reports, archives, and any downstream systems that received copies or derivatives. Without lineage, SAR response depends on institutional knowledge about system relationships, manual searches across databases, and hope that no systems were overlooked. With lineage, response follows a deterministic process: identify the data subject’s records in source systems, traverse lineage to all downstream locations, extract relevant data from each location.

The right to erasure (right to be forgotten) presents a more demanding lineage use case. Erasing a data subject’s information requires not only identifying all locations, but also understanding dependencies that might prevent erasure. If a beneficiary’s data contributed to aggregate statistics in a published report, the individual record might be erasable while the aggregate is not. Lineage clarifies these dependencies, enabling informed decisions about what can be erased, what must be anonymised, and what is retained under legitimate basis exceptions.

Record-level lineage is particularly valuable for GDPR compliance because it tracks individual data subjects rather than general data flows. A beneficiary’s registration record that was collected in 2019, migrated through three system upgrades, and contributed to 47 reports has a provenance history that record-level lineage captures. Demonstrating lawful processing basis for each step of that journey requires lineage documentation.

Donor compliance presents analogous requirements. Major institutional donors require documentation of data handling practices for grant-funded activities. USAID, FCDO, ECHO, and UN agencies each have data management requirements that include demonstrating how beneficiary data is collected, stored, processed, and protected. Lineage provides the documentation backbone for these requirements, showing the journey of programme data from field collection through reporting to donor submissions.

Lineage Storage and Query Architecture

Lineage relationships form a graph structure where nodes represent data assets and edges represent lineage connections. This structure maps naturally to graph databases, which store and query graph data more efficiently than relational databases for lineage use cases. The primary query patterns for lineage, traversing upstream and downstream paths of arbitrary depth, are native operations for graph databases but require recursive CTEs or multiple queries in relational systems.

A lineage graph contains several node types: datasets (tables, files, API endpoints), columns within datasets, transformations (ETL jobs, SQL queries, pipeline stages), and business assets (reports, dashboards, metrics). Edges connect these nodes with relationship types that indicate the nature of the lineage connection: “column A feeds column B”, “transformation T reads dataset D”, “report R displays metric M”.

+------------------------------------------------------------------+
| LINEAGE GRAPH STRUCTURE |
+------------------------------------------------------------------+
| |
| Node Types Edge Types |
| +------------------+ +------------------+ |
| | DATASET | | FEEDS | |
| | - table | | (column to | |
| | - file | | column) | |
| | - api_endpoint | +------------------+ |
| +------------------+ | READS | |
| | COLUMN | | (transform to | |
| | - name | | dataset) | |
| | - data_type | +------------------+ |
| | - parent_dataset | | WRITES | |
| +------------------+ | (transform to | |
| | TRANSFORMATION | | dataset) | |
| | - job_name | +------------------+ |
| | - transform_type | | DISPLAYS | |
| | - code_location | | (report to | |
| +------------------+ | column/metric) | |
| | BUSINESS_ASSET | +------------------+ |
| | - report | | DERIVES | |
| | - dashboard | | (business entity | |
| | - metric | | to dataset) | |
| +------------------+ +------------------+ |
| |
+------------------------------------------------------------------+
| |
| Example Graph Fragment |
| |
| [registration_raw] |
| | |
| (first_name)-----+ |
| (last_name)------+---->[concat_names]---->(full_name) |
| | | | |
| [beneficiary_staging] | [dim_beneficiary] |
| | | |
| +-------->[quarterly_report] |
| |
+------------------------------------------------------------------+

Graph databases suitable for lineage storage include Neo4j (commercial with community edition), JanusGraph (open source, distributed), and Amazon Neptune (managed service). The choice depends on scale requirements, operational capacity, and existing infrastructure. Organisations processing fewer than 1 million lineage edges operate comfortably on Neo4j Community Edition. Larger deployments with billions of edges require distributed solutions like JanusGraph or managed services.

Query performance for lineage depends on index design and query patterns. Indices on asset identifiers (table names, column names, job names) enable fast lookups for the starting node of a traversal. Path queries that traverse multiple hops benefit from graph database optimisations that relational databases cannot match. A query to find all downstream consumers of a column, with no limit on path depth, executes in milliseconds on a graph database but requires complex recursive queries on relational storage.

Lineage systems integrate with data catalogues to provide unified search and navigation. The catalogue stores descriptive metadata (business definitions, ownership, classification) while the lineage system stores relationship metadata. Users search the catalogue to find a dataset of interest, then navigate to the lineage view to understand its dependencies. Bidirectional integration allows lineage queries to return enriched results that include catalogue metadata alongside lineage paths.

Technology Options

Lineage capabilities exist in standalone lineage tools, data catalogue platforms, and as features within broader data platforms. The choice depends on existing infrastructure, integration requirements, and the sophistication of lineage capture needed.

Open Source Options

Apache Atlas provides data governance capabilities including lineage for the Hadoop ecosystem. Atlas captures lineage from Hive, Spark, Sqoop, and other Hadoop-adjacent tools through hooks that emit lineage events during execution. The lineage model supports dataset and column granularity. Atlas suits organisations with significant Hadoop infrastructure and open source preferences. Deployment requires Kafka, Solr, and HBase, creating operational overhead for organisations without existing expertise in these technologies.

OpenLineage is a specification rather than a platform, defining a standard format for lineage events that any tool can emit and any collector can receive. OpenLineage integration exists for Airflow, Spark, dbt, and Great Expectations. Organisations adopting OpenLineage implement a collector (Marquez is the reference implementation) and configure pipeline tools to emit OpenLineage events. The specification approach provides flexibility but requires integration work for each pipeline tool.

Marquez implements the OpenLineage specification with a lineage store, API, and basic web interface. Marquez collects OpenLineage events from integrated tools and provides lineage query capabilities. As a focused lineage solution, Marquez lacks the broader catalogue features of Apache Atlas but offers simpler deployment (PostgreSQL backend, Docker deployment) and lower operational overhead.

DataHub combines data catalogue and lineage capabilities in an open source platform backed by LinkedIn’s production usage and Akamai’s investment. DataHub ingests metadata from databases, BI tools, orchestrators, and data quality tools, capturing lineage during ingestion. The platform provides search, discovery, lineage visualisation, and governance features. DataHub’s architecture uses Kafka, Elasticsearch, MySQL, and Neo4j, requiring infrastructure investment but offering comprehensive capabilities.

Commercial Options

Alation provides enterprise data catalogue with lineage, offering automated lineage extraction from SQL, BI tools, and ETL platforms. Alation’s query log analysis captures lineage from actual query execution, providing column-level accuracy for supported databases. The platform excels in SQL-based environments with strong Tableau, Power BI, and Looker integrations.

Collibra offers data governance platform with lineage as one component alongside catalogue, glossary, and policy management. Collibra Lineage captures from databases, BI tools, and ETL through connectors and automated harvesting. The platform targets enterprise governance programmes requiring comprehensive policy enforcement alongside lineage.

Both commercial options price per data asset or user, with enterprise licensing exceeding $100,000 annually for medium-scale deployments. Nonprofit pricing programmes reduce costs where available but remain significant compared to open source alternatives.

CapabilityApache AtlasOpenLineage + MarquezDataHubAlationCollibra
Dataset lineageYesYesYesYesYes
Column lineageYesYesYesYesYes
Record lineageNoWith custom workNoNoLimited
SQL parsingHive, SparkVia pluginsMultipleStrongStrong
BI tool integrationLimitedVia pluginsYesStrongStrong
Catalogue featuresBasicNo (Marquez)ComprehensiveComprehensiveComprehensive
Deployment complexityHighMediumMedium-HighManagedManaged
Open sourceYesYesYesNoNo

Implementation Considerations

For Organisations with Limited IT Capacity

Lineage implementation for small teams focuses on maximum value with minimum infrastructure. Dataset-level lineage provides meaningful impact analysis and compliance support without the complexity of column-level capture. Manual documentation of lineage relationships in a spreadsheet or simple catalogue tool costs nothing beyond time and delivers immediate value for audit and SAR response.

A minimum viable lineage implementation:

  1. Create a spreadsheet listing all data assets (tables, files, reports) with columns for upstream sources and downstream consumers
  2. Document the most critical lineage paths: registration data through to donor reports, protection data through to secure storage
  3. Review and update documentation quarterly or when systems change
  4. Use the documentation for impact analysis before making changes

This approach scales to approximately 100 data assets before maintenance burden exceeds the capacity of manual processes. Organisations approaching this threshold should evaluate lightweight tools like DataHub (self-hosted) or cloud-native options like AWS Glue Data Catalog (limited lineage but integrated with AWS infrastructure).

For Organisations with Established Data Functions

Organisations with dedicated data engineering capacity implement automated lineage capture integrated with pipeline tooling. The OpenLineage specification provides a standards-based approach that avoids vendor lock-in while enabling integration across heterogeneous pipelines.

An effective implementation sequence:

  1. Deploy Marquez or DataHub as the lineage store, depending on whether broader catalogue features are needed
  2. Enable OpenLineage integration in primary pipeline tools (Airflow, dbt, Spark)
  3. Configure SQL parsers for databases not covered by runtime capture
  4. Implement governance processes requiring lineage documentation for new pipelines
  5. Build impact analysis into change management workflows

Column-level lineage becomes feasible with automated capture, enabling precise impact analysis and detailed root cause investigation. The implementation investment pays back through reduced incident investigation time, faster change impact assessment, and systematic SAR response.

Lineage Coverage Metrics

Lineage value correlates with coverage completeness. Establish metrics to track lineage coverage and prioritise gap closure:

Asset coverage measures the percentage of data assets with documented lineage. Calculate separately for upstream lineage (sources documented) and downstream lineage (consumers documented). Target 90% coverage for critical data domains (beneficiary data, financial data, protection data) and 70% coverage for operational data.

Path completeness measures whether lineage paths are fully connected from source to consumption. A report with documented direct dependencies but missing intermediate lineage has incomplete path coverage. Track the percentage of terminal nodes (reports, dashboards) with complete lineage paths to original sources.

Freshness measures how current lineage documentation is relative to system changes. Automated capture maintains freshness automatically. Manual documentation requires review cycles aligned with change frequency, with staleness thresholds appropriate to the domain.

Integration with Data Quality

Lineage amplifies data quality management by connecting quality metrics to their sources. When quality monitoring detects anomalies in a warehouse table, lineage identifies which upstream sources and transformations contributed to the problematic data. Quality scores attached to lineage nodes enable proactive identification of at-risk downstream assets: if a source table’s completeness score drops below threshold, lineage identifies every downstream consumer that inherits the quality risk.

This integration requires lineage and quality systems to share a common asset model, either through direct integration or through a shared catalogue that both systems reference. DataHub provides this integration natively. Organisations using separate tools for lineage and quality establish identifier conventions that enable correlation between systems.

See Also