Data Cataloguing and Metadata
A data catalogue is a searchable inventory of an organisation’s data assets that enables staff to find, understand, and appropriately use data without requiring direct knowledge of where that data resides or how it is structured. The catalogue achieves this through systematic collection and organisation of metadata, which describes data assets in terms that both technical and non-technical users can interpret. For mission-driven organisations managing programme data across multiple systems, geographies, and partnerships, cataloguing transforms scattered information into a navigable resource that supports evidence-based decisions, reduces duplication, and ensures institutional knowledge survives staff transitions.
- Metadata
- Data that describes other data. Includes technical attributes such as column names and data types, business context such as definitions and ownership, and operational details such as update frequency and access patterns.
- Data catalogue
- A centralised, searchable repository of metadata that provides a unified view of organisational data assets regardless of their physical location or format.
- Business glossary
- A controlled vocabulary of business terms with authoritative definitions, relationships, and ownership that ensures consistent interpretation across the organisation.
- Data asset
- Any collection of data that has value to the organisation: databases, tables, files, APIs, reports, or datasets regardless of storage location.
- Metadata harvesting
- Automated extraction of metadata from source systems through connectors that scan databases, files, and applications to populate the catalogue.
Metadata types and relationships
Metadata divides into three categories that serve different audiences and purposes. Technical metadata describes data structure and storage. Business metadata provides meaning and context. Operational metadata tracks how data behaves over time. Effective cataloguing requires all three types working together, as technical metadata alone cannot tell a programme manager whether a dataset contains the beneficiary information they need, and business metadata alone cannot help a developer build an integration.
Technical metadata originates from source systems and describes physical characteristics. A database table generates technical metadata including its schema name, column names, data types, primary and foreign keys, indexes, and constraints. A CSV file generates technical metadata including its path, size, delimiter, encoding, and header row configuration. Technical metadata answers questions about structure: what columns exist, what data types are permitted, how tables relate to each other.
Business metadata originates from human documentation and describes meaning. The same database table requires business metadata to explain what it represents in operational terms: this table stores beneficiary registration records from the 2024 flood response in Cox’s Bazar. Business metadata includes definitions that explain what each column means in programme terms, ownership that identifies who is responsible for the data, sensitivity classification that determines access restrictions, and context that explains how the data relates to organisational activities. Business metadata answers questions about purpose: what does this data mean, who owns it, can I use it for this analysis.
Operational metadata originates from monitoring systems and describes behaviour. The same database table accumulates operational metadata over time: it was last updated 3 hours ago, it contains 47,293 rows, it grows by approximately 200 rows per day, it is accessed by 12 unique users per week, the last ETL job completed successfully at 04:15 UTC. Operational metadata answers questions about reliability: is this data current, how much data exists, who else uses it.
+-------------------------------------------------------------------+| METADATA RELATIONSHIPS |+-------------------------------------------------------------------+| || +------------------------+ || | SOURCE SYSTEMS | || | | || | +------+ +------+ | || | | DB | | Files| | GENERATES || | +--+---+ +--+---+ +------------------+ || | | | | | || +------------------------+ | || | | v || | | +------------+------------+ || | +------------->| TECHNICAL METADATA | || +----------------------->| | || | - Schema, tables | || | - Columns, types | || | - Keys, constraints | || | - File paths, formats | || +------------+------------+ || | || +------------------------+ | || | DATA STEWARDS | | ENRICHES || | | | || | Document meaning | v || | Assign ownership | +------------+------------+ || | Classify sensitivity +---->| BUSINESS METADATA | || | | | | || +------------------------+ | - Definitions | || | - Ownership | || | - Classification | || | - Business context | || +------------+------------+ || | || +------------------------+ | || | MONITORING SYSTEMS | | COMBINES || | | | || | Track access | v || | Measure freshness | +------------+------------+ || | Count usage +---->| OPERATIONAL METADATA | || | | | | || +------------------------+ | - Last update time | || | - Row counts | || | - Access frequency | || | - Job status | || +------------+------------+ || | || | FEEDS || v || +-------------------------+ || | DATA CATALOGUE | || | | || | Unified, searchable | || | view of all metadata | || +-------------------------+ || |+-------------------------------------------------------------------+Figure 1: Metadata types originate from different sources and combine in the catalogue
The relationships between metadata types create a complete picture of each data asset. Technical metadata from a PostgreSQL database indicates that beneficiary_registrations contains a column named hh_size with type integer. Business metadata from steward documentation explains that hh_size represents household size, defined as the count of individuals who share meals and sleeping arrangements, owned by the M&E team, and classified as programme-sensitive. Operational metadata from monitoring systems shows that the column contains values ranging from 1 to 23, with a mean of 5.2, updated daily via the KoboToolbox sync job that last ran successfully 6 hours ago. A programme analyst searching for household data can find this asset, understand what it means, assess whether it meets their needs, and identify whom to contact for access.
Data catalogue architecture
A data catalogue consists of four architectural layers: connectors that harvest metadata from source systems, a metadata repository that stores and indexes the harvested information, a search and discovery interface that enables users to find relevant assets, and governance features that manage access, quality, and lineage.
+------------------------------------------------------------------+| DATA CATALOGUE ARCHITECTURE |+------------------------------------------------------------------+| || +---------------------+ +----------------------+ || | SOURCE SYSTEMS | | SOURCE SYSTEMS | || | | | | || | +-----+ +-------+ | | +------+ +-------+ | || | | Pg | |Kobo | | | |Spread| |Files | | || | | DB | |API | | | |sheets| | | | || | +--+--+ +---+---+ | | +--+---+ +---+---+ | || +-----|--------|------+ +-----|--------|-------+ || | | | | || v v v v || +-----|--------|---------------|--------|------+ || | CONNECTOR LAYER | || | | || | +----------+ +----------+ +----------+ | || | | JDBC | | REST API | | File | | || | | Connector| | Connector| | Scanner | | || | +----+-----+ +----+-----+ +----+-----+ | || | | | | | || +-------|-------------|-------------|----------+ || | | | || v v v || +-------+-------------+-------------+----------+ || | METADATA REPOSITORY | || | | || | +----------------+ +------------------+ | || | | Graph Store | | Search Index | | || | | | | | | || | | - Entities | | - Full text | | || | | - Relations | | - Faceted | | || | | - Properties | | - Ranked | | || | +----------------+ +------------------+ | || | | || | +----------------+ +------------------+ | || | | Versioning | | Access Control | | || | | | | | | || | | - History | | - Permissions | | || | | - Audit | | - Policies | | || | +----------------+ +------------------+ | || +----------------------------------------------+ || | || v || +----------------------------------------------+ || | USER INTERFACE LAYER | || | | || | +----------+ +----------+ +----------+ | || | | Search | | Browse | | Glossary | | || | | Portal | | Explorer | | Manager | | || | +----------+ +----------+ +----------+ | || | | || | +----------+ +----------+ +----------+ | || | | Profile | | Lineage | | Admin | | || | | Viewer | | Viewer | | Console | | || | +----------+ +----------+ +----------+ | || +----------------------------------------------+ || |+------------------------------------------------------------------+Figure 2: Four-layer catalogue architecture from connectors to user interface
The connector layer bridges the catalogue to source systems. Each connector type handles a specific protocol or system type. JDBC connectors extract metadata from relational databases by querying system catalogues such as information_schema in PostgreSQL or sys.tables in SQL Server. REST API connectors retrieve metadata from web services and SaaS platforms by calling their metadata endpoints. File scanners traverse storage systems, parsing file structures and inferring schemas from content. Custom connectors handle proprietary systems or non-standard formats.
A connector for PostgreSQL executes queries against system tables to extract database structure:
-- Extract table metadataSELECT schemaname, tablename, tableowner, pg_total_relation_size(schemaname || '.' || tablename) as size_bytesFROM pg_tablesWHERE schemaname NOT IN ('pg_catalog', 'information_schema');
-- Extract column metadataSELECT table_schema, table_name, column_name, data_type, character_maximum_length, is_nullable, column_defaultFROM information_schema.columnsWHERE table_schema NOT IN ('pg_catalog', 'information_schema');The metadata repository stores harvested information in structures optimised for different access patterns. A graph store represents entities and their relationships, enabling traversal queries such as finding all tables that contain beneficiary data or identifying which reports depend on a specific source table. A search index enables full-text queries across all metadata, supporting discovery by keyword, facet filtering by classification or owner, and relevance ranking. Versioning tracks changes to metadata over time, maintaining history for audit and enabling comparison of current state to previous states. Access control enforces permissions, ensuring users see only metadata for assets they are authorised to access.
The user interface layer presents metadata to different audiences. A search portal provides the primary discovery mechanism, accepting natural language queries and returning ranked results with previews. A browse explorer offers hierarchical navigation through systems, databases, schemas, and tables for users who prefer structured exploration. A glossary manager enables stewards to maintain business term definitions. A profile viewer displays detailed metadata for individual assets including statistics, sample values, and quality indicators. A lineage viewer visualises data flow upstream to sources and downstream to consumers. An admin console provides configuration, connector management, and governance controls.
Cataloguing workflow
Building a useful catalogue requires a systematic process that moves from broad inventory through progressive enrichment. The workflow proceeds in four stages: discovery identifies what data assets exist, profiling extracts technical metadata automatically, enrichment adds business context through human documentation, and maintenance keeps the catalogue current as systems evolve.
+-------------------------------------------------------------------+| CATALOGUING WORKFLOW |+-------------------------------------------------------------------+| || STAGE 1: DISCOVERY || +------------------------------------------------------------+ || | | || | Identify Configure Run initial Review | || | source --> connectors --> harvesting --> findings | || | systems | || | | || | Output: Inventory of data assets | || +------------------------------------------------------------+ || | || v || STAGE 2: PROFILING || +------------------------------------------------------------+ || | | || | Execute Analyse Detect Store | || | schema --> data --> patterns --> profiles | || | scans samples | || | | || | Output: Technical metadata, statistics, quality indicators| || +------------------------------------------------------------+ || | || v || STAGE 3: ENRICHMENT || +------------------------------------------------------------+ || | | || | Assign Document Link to Classify | || | owners --> definitions-->glossary --> sensitivity | || | terms | || | | || | Output: Business context, searchable descriptions | || +------------------------------------------------------------+ || | || v || STAGE 4: MAINTENANCE || +------------------------------------------------------------+ || | | || | Schedule Detect Alert on Review | || | regular --> changes --> anomalies --> periodically | || | scans | || | | || | Output: Current, trusted catalogue | || +------------------------------------------------------------+ || |+-------------------------------------------------------------------+Figure 3: Four-stage workflow from discovery through ongoing maintenance
Discovery begins with an inventory of source systems. For a typical mission-driven organisation, this includes operational databases running programme applications, data collection platforms such as KoboToolbox or ODK, cloud storage containing spreadsheets and documents, business systems including finance and HR, and integration points with partners or donors. Each system requires appropriate connector configuration including connection credentials, scan scope, and harvesting schedule. Initial harvesting populates the catalogue with raw technical metadata: this database contains these schemas, which contain these tables, which contain these columns.
Profiling extends technical metadata with computed statistics. The catalogue samples data values to calculate distributions, detect patterns, and identify quality issues. For a beneficiary table, profiling might reveal that the phone_number column contains 94% valid phone numbers matching expected formats, 4% null values, and 2% malformed values; that registration_date ranges from 2024-01-15 to 2024-11-02 with no future dates; that household_size has mean 5.2, median 4, and maximum 23. These profiles enable users to assess fitness for purpose before requesting access.
Enrichment adds human-generated business context. Data stewards review automatically harvested assets and document their meaning, ownership, and usage guidance. Enrichment connects technical column names to business glossary terms, explaining that ben_id links to the glossary term “Beneficiary Identifier” which carries a precise definition, data quality rules, and sensitivity classification. Enrichment assigns ownership, identifying the programme team responsible for data accuracy and the technical team responsible for system availability. Enrichment provides usage notes, indicating that this dataset should not be used for aggregate reporting without accounting for duplicate registrations across response phases.
Maintenance keeps the catalogue aligned with reality as systems change. Scheduled harvesting detects new tables, modified columns, and deleted assets. Change detection compares current metadata to previous versions, flagging differences for review. Anomaly alerting notifies stewards when metrics shift unexpectedly, such as a table that normally grows by 200 rows daily suddenly adding 50,000 rows or a column that normally contains 2% nulls suddenly containing 40%. Periodic review cycles prompt stewards to verify that documentation remains accurate and ownership assignments remain current.
Business glossary structure
A business glossary provides the authoritative vocabulary that enables consistent interpretation of data across the organisation. Without a glossary, the same term carries different meanings in different contexts: “beneficiary” might mean a registered individual in one system, a household in another, and anyone receiving any form of assistance in a third. The glossary resolves this ambiguity by establishing canonical definitions that all systems and reports reference.
+-------------------------------------------------------------------+| BUSINESS GLOSSARY STRUCTURE |+-------------------------------------------------------------------+| || +-----------------------------+ || | DOMAIN | || | "Programme Data" | || +-------------+---------------+ || | || +---------+---------+ || | | || v v || +------+------+ +-----+-----+ || |CATEGORY | |CATEGORY | || |"Beneficiary"| |"Activity" | || +------+------+ +-----+-----+ || | | || +----+----+ +----+----+ || | | | | || v v v v || +--+--+ +--+--+ +--+--+ +--+--+ || |TERM | |TERM | |TERM | |TERM | || +--+--+ +--+--+ +--+--+ +--+--+ || | | | | || v v v v || || +------------------------------------------------------------+ || | TERM: Beneficiary | || +------------------------------------------------------------+ || | Definition: An individual registered to receive direct | || | assistance from a programme intervention | || | | || | Synonyms: Participant, Client, Programme Recipient | || | | || | Related terms: Household, Indirect Beneficiary | || | | || | Owner: M&E Director | || | | || | Classification: Programme-sensitive | || | | || | Data quality rules: | || | - Must have unique identifier | || | - Must be linked to exactly one household | || | - Registration date must not be in future | || | | || | Linked assets: | || | - registration_db.beneficiaries.ben_id | || | - kobo_flood_response.submissions.respondent_id | || | - cva_platform.recipients.recipient_code | || +------------------------------------------------------------+ || |+-------------------------------------------------------------------+Figure 4: Glossary hierarchy from domain through category to term definitions
Glossary structure organises terms hierarchically. Domains represent major subject areas such as programme data, finance, human resources, or operations. Categories group related terms within domains: the programme data domain contains categories for beneficiaries, activities, indicators, and locations. Individual terms carry definitions, relationships, and governance attributes.
A complete glossary term entry includes several components. The definition states precisely what the term means in this organisational context, written to resolve ambiguity rather than to be comprehensive. Synonyms list alternative terms that refer to the same concept, enabling search to find the canonical term regardless of which variant a user enters. Related terms identify concepts that are distinct but connected, helping users navigate to adjacent concepts. The owner identifies the person or role responsible for maintaining the definition and resolving disputes about interpretation. Classification indicates sensitivity level, which flows through to linked data assets. Data quality rules specify constraints that valid instances must satisfy. Linked assets connect the term to physical data elements in catalogued systems, enabling users to find actual data that represents this concept.
Building an initial glossary requires focused effort. An effective approach starts with terms that cause confusion: if programme and finance teams argue about numbers because they define “beneficiary” differently, that term needs a glossary entry. Core entity terms such as beneficiary, household, partner, project, and activity form a foundation. Measurement terms such as indicators, targets, and actuals require precise definitions to ensure consistent reporting. Classification terms used in data governance such as sensitivity levels and data categories need glossary entries that policies can reference.
Glossary governance determines how terms are proposed, approved, and maintained. A typical model designates domain stewards who own definitions within their area of expertise, with a data governance council resolving cross-domain conflicts and approving changes to widely-used terms. Change requests follow a workflow: someone proposes a new term or definition change, the relevant steward reviews and approves or requests revision, and approved changes publish with an effective date. Version history preserves previous definitions, essential when historical reports used earlier interpretations.
Search and discovery patterns
The catalogue’s value lies in enabling users to find relevant data without knowing where to look. Effective discovery supports multiple search patterns: keyword search for users who know roughly what they want, faceted navigation for users exploring a domain, and recommendation for users who benefit from seeing related assets.
Keyword search accepts natural language queries and returns ranked results. A programme officer searching for “flood response beneficiary data Bangladesh” should find relevant datasets even if no single asset contains all those terms. The search engine matches query terms against asset names, descriptions, column names, glossary term definitions, and tags. Relevance ranking considers term frequency, field importance (matches in asset name rank higher than matches in column descriptions), and usage signals (frequently accessed assets rank higher for ambiguous queries).
Faceted navigation enables filtering and exploration. Available facets include system (filter to assets in a specific database or platform), domain (filter to programme data versus finance versus HR), owner (filter to assets owned by a specific team), classification (filter to assets at a specific sensitivity level), and freshness (filter to assets updated within a time period). Combining facets progressively narrows results: programme data, owned by M&E team, updated in last 30 days. Facet counts show how many assets match each option, guiding users toward productive paths.
Asset profiles present detailed metadata for individual items. A table profile displays technical metadata including schema, columns with types and statistics, keys and relationships, and row counts. Business metadata appears alongside: the description explaining what the table represents, the owner and their contact information, glossary terms linked to columns, and usage guidance. Operational metadata shows freshness indicators, access patterns, and related jobs or pipelines. Sample data provides concrete examples of actual values, helping users verify they have found the right asset. Related assets suggest other data that users examining this asset might also need.
Query patterns optimise for common discovery needs. “What data do we have about X?” searches for X across all metadata fields and returns assets ranked by relevance. “Where does column Y come from?” navigates lineage upstream from the column. “What happens if table Z changes?” navigates lineage downstream to dependent assets. “Who should I contact about this data?” displays ownership information with contact details. “Is this data suitable for external reporting?” checks classification and displays usage restrictions.
Automated metadata harvesting
Manual cataloguing cannot scale. Organisations with dozens of databases, hundreds of spreadsheets, and thousands of files require automation to maintain comprehensive, current metadata. Harvesting automates the extraction of technical metadata, reducing manual effort to enrichment activities that require human judgment.
Harvester configuration specifies what to scan and how frequently. A database harvester connects using provided credentials, queries system catalogues to extract schema structure, optionally samples data to compute profiles, and writes results to the metadata repository. Configuration parameters control behaviour:
# PostgreSQL harvester configurationharvester: type: postgresql connection: host: programme-db.internal.example.org port: 5432 database: registration username: catalogue_reader # Password from secrets manager password_secret: catalogue/programme-db scope: schemas: include: - public - programme exclude: - pg_catalog - information_schema tables: exclude_patterns: - '*_backup' - '*_archive_*' profiling: enabled: true sample_size: 10000 compute_statistics: true detect_patterns: true schedule: frequency: daily time: "03:00" timezone: "UTC"Profiling configuration determines the depth of automated analysis. Basic profiling extracts schema structure only: tables, columns, types, constraints. Standard profiling adds statistical analysis: row counts, null percentages, distinct value counts, min/max/mean for numeric columns. Deep profiling detects patterns and relationships: inferred data types (columns containing phone numbers, email addresses, dates stored as strings), potential primary keys (columns with unique values), candidate foreign keys (columns whose values match another table’s key).
Incremental harvesting reduces processing time for subsequent runs. Rather than scanning entire databases on each execution, the harvester detects changes since the last run and processes only modified objects. Change detection uses database system views that track modification times, or compares current schema fingerprints against stored versions. An initial full harvest of a 500-table database might require 4 hours; subsequent incremental harvests complete in minutes by processing only the 12 tables that changed.
File harvesting handles unstructured and semi-structured sources. A file scanner traverses directory structures, identifies file types, and extracts available metadata. For spreadsheets, the scanner detects worksheets, header rows, and column structures. For delimited files, the scanner infers delimiters, encodings, and schemas. For documents, the scanner extracts properties such as author, creation date, and modification history. Configuration specifies paths to scan, file types to process, and exclusion patterns:
# File harvester configurationharvester: type: filesystem paths: - path: /data/programme-exports recursive: true file_types: - csv - xlsx - json exclude_patterns: - '*.tmp' - '~*' - path: /data/reports recursive: false file_types: - xlsx schema_inference: sample_rows: 1000 header_detection: auto type_inference: true schedule: frequency: weekly day: sunday time: "02:00"API harvesting extracts metadata from web services and SaaS platforms. Many data collection platforms expose metadata through APIs: KoboToolbox provides project and form structure through its REST API; Salesforce exposes object and field definitions through its metadata API. API harvesters authenticate using provided credentials, retrieve metadata through platform-specific endpoints, and transform responses into the catalogue’s internal format.
Harvesting schedules balance freshness against system load. High-change systems such as operational databases benefit from daily harvesting. Stable reference data might harvest weekly. Large-scale file systems might harvest incrementally, processing changes continuously rather than in scheduled batches. Harvesting runs during low-usage periods to minimise impact on source systems.
Technology options
Data catalogue technology ranges from lightweight tools suitable for single-person IT departments to enterprise platforms designed for large, distributed organisations. Selection criteria include scale (number of assets to catalogue), complexity (variety of source system types), usage (number and sophistication of catalogue users), and operational capacity (ability to deploy and maintain infrastructure).
Apache Atlas is an open-source catalogue developed within the Hadoop ecosystem but applicable to broader data landscapes. Atlas provides a type system for modelling data assets, a graph-based metadata store, REST APIs for integration, and a web interface for search and browse. Atlas excels at lineage tracking, representing data flow through ETL processes. Deployment requires Java runtime and either embedded or external Apache Kafka and Apache Solr instances. Atlas suits organisations with existing Hadoop infrastructure or strong Java operations skills, handling catalogues of 10,000 to 500,000 assets. Configuration manages type definitions:
{ "enumDefs": [], "structDefs": [], "classificationDefs": [ { "name": "Sensitive", "description": "Contains sensitive data requiring access control", "attributeDefs": [ { "name": "level", "typeName": "string", "isOptional": false } ] } ], "entityDefs": [ { "name": "programme_dataset", "superTypes": ["DataSet"], "attributeDefs": [ { "name": "programme_code", "typeName": "string", "isOptional": true }, { "name": "data_steward", "typeName": "string", "isOptional": false } ] } ]}DataHub is an open-source metadata platform originally developed at LinkedIn. DataHub provides a modern architecture with a React frontend, GraphQL API, and pluggable backend storage. Pre-built connectors support common databases, data warehouses, BI tools, and orchestrators. DataHub emphasises extensibility, enabling custom metadata models without modifying core code. Deployment uses Docker Compose for development or Kubernetes for production. DataHub suits organisations seeking a modern interface with active development community, handling catalogues from 1,000 to 1,000,000 assets. A minimal deployment starts with:
# Clone DataHub repositorygit clone https://github.com/datahub-project/datahub.gitcd datahub/docker/quickstart
# Start services (requires Docker Compose)./quickstart.sh
# Access UI at http://localhost:9002# Default credentials: datahub / datahubAmundsen is an open-source catalogue originally developed at Lyft. Amundsen focuses on search and discovery, providing an intuitive interface for finding data assets. The architecture separates frontend, search, and metadata services, enabling independent scaling. Amundsen integrates with Apache Atlas or AWS Neptune for metadata storage. Deployment complexity sits between Atlas and DataHub. Amundsen suits organisations prioritising search experience over comprehensive governance features.
Alation is a commercial catalogue known for its machine learning-driven automation and collaborative features. Alation automatically discovers and classifies data, suggests descriptions based on usage patterns, and enables inline conversations about data assets. Pricing follows an enterprise model based on data source connections and user counts. Alation offers a nonprofit programme with discounted pricing. Deployment is typically cloud-hosted, reducing operational burden. Alation suits organisations with budget for commercial tools seeking reduced implementation effort.
Collibra is a commercial platform emphasising data governance alongside cataloguing. Collibra provides policy management, workflow automation, and regulatory compliance features beyond basic cataloguing. The platform handles complex governance requirements such as GDPR data subject request tracking. Pricing is enterprise-scale. Collibra offers impact pricing for nonprofit organisations. Collibra suits large organisations with formal governance requirements and dedicated data management staff.
For organisations with limited IT capacity, a pragmatic approach starts with inventory documentation in structured spreadsheets, graduating to DataHub when asset counts exceed what spreadsheets can manage (roughly 500 assets). DataHub’s Docker deployment enables a single IT person to operate a functional catalogue, with the option to migrate to Kubernetes as scale demands.
Catalogue governance
A catalogue without governance degrades into an unreliable repository that users distrust and abandon. Governance establishes who maintains metadata, how updates flow, and what quality standards apply.
Ownership assignment identifies responsibility for each catalogued asset. The data owner has decision authority over the asset: who may access it, what it may be used for, and how it should be classified. The data steward maintains metadata quality: ensuring descriptions are accurate, glossary linkages are correct, and documentation remains current. For many organisations, owners and stewards are the same person; the distinction matters when ownership is executive (a programme director owns beneficiary data) while stewardship is operational (an M&E analyst maintains the catalogue entries).
Update workflows determine how metadata changes flow through approval. Self-service updates enable asset owners and stewards to modify metadata directly, appropriate for routine changes such as updating descriptions or adding tags. Controlled updates require approval before publication, appropriate for changes to glossary definitions or classification levels that affect multiple assets. Automated updates from harvesters typically publish immediately for technical metadata but flag business metadata conflicts for human review.
Quality standards define what constitutes acceptable catalogue entries. Minimum standards require that every catalogued asset has a description of at least 50 characters, an assigned owner, a sensitivity classification, and current technical metadata no more than 7 days old. Higher standards apply to critical assets: complete glossary term linkages for all columns containing business data, documented usage guidance, and verified accuracy sign-off from the owner within the past 90 days.
Review cycles prompt periodic validation. Quarterly reviews ask stewards to verify their assets’ metadata remains accurate. Annual reviews confirm ownership assignments remain valid, particularly important given staff turnover in mission-driven organisations. Stale content detection flags assets with metadata unchanged for extended periods, prompting verification that the data itself still exists and the documentation still applies.
Metrics track catalogue health. Coverage measures what percentage of known data assets appear in the catalogue with minimum documentation. Freshness measures how current harvested metadata is across the catalogue. Engagement measures how frequently users search, browse, and access asset details. Documentation quality measures completeness of business metadata: what percentage of assets have descriptions, owners, and glossary linkages.
Implementation considerations
For organisations with limited IT capacity
Start with a scope that delivers value without overwhelming available capacity. Catalogue the 10 to 20 most important data sources first: the programme database, the primary data collection platform, the finance system, and shared file storage containing critical documents. Use DataHub’s Docker Compose deployment, which a single IT person can install in a day and operate with a few hours per month.
Prioritise technical metadata automation over business metadata completeness. Getting all assets into the catalogue with accurate schemas matters more than perfecting descriptions. Programme staff can add business context gradually as they use the catalogue, distributing documentation effort across the organisation.
Integrate catalogue links into existing workflows. When someone asks “where can I find beneficiary data?”, answer by sharing a catalogue link rather than database credentials. When creating reports, reference catalogue assets in documentation. This integration drives adoption without requiring a separate launch initiative.
Minimal viable catalogue:
- DataHub on Docker Compose (4GB RAM server)
- PostgreSQL connector for primary programme database
- File scanner for shared storage
- 3 to 5 glossary terms for core entities
- Implementation time: 2 to 3 days; ongoing: 4 to 8 hours monthly
For organisations with established data functions
Comprehensive cataloguing across all data sources requires investment in connector configuration, steward capacity, and governance processes. Deploy DataHub on Kubernetes for production reliability, with multiple connector instances scanning dozens of sources on scheduled cycles.
Establish stewardship as an explicit responsibility. Each domain needs an identified steward with allocated time for catalogue maintenance. Quarterly steward meetings review coverage gaps, documentation quality, and usage patterns. Annual certification cycles verify all critical assets have current, accurate metadata.
Integrate the catalogue with other data management tools. Feed lineage from orchestration tools such as Airflow into the catalogue. Connect quality monitoring to display current quality scores alongside asset metadata. Enable single sign-on so catalogue access respects existing identity management. Embed catalogue search in the intranet or collaboration platform so users discover it naturally.
Full catalogue implementation:
- DataHub on Kubernetes (3-node cluster)
- 15 to 30 source connectors
- Business glossary with 100+ terms
- Designated stewards for each domain
- Implementation time: 3 to 6 months; ongoing: 0.5 to 1 FTE equivalent across stewards
For federated organisations
When country offices or programmes operate autonomously with their own systems, cataloguing must balance local control with global visibility. A federated model maintains local catalogues at the country or programme level while aggregating selected metadata to a global catalogue for cross-organisation discovery.
Local catalogues contain detailed metadata for local systems, maintained by local staff who understand the context. The global catalogue contains summary metadata that enables someone at headquarters to discover that the Kenya programme has beneficiary registration data without exposing every column definition. Federation queries route from global to local when users need detail beyond what the summary provides.
Data sharing agreements govern what metadata flows between levels. Sensitive programmes might share only that a dataset exists without revealing structure or content. Standard programmes share schema and descriptions. Open data initiatives share comprehensive metadata including sample values.