Data Catalogue and Governance

Data catalogue and governance platforms provide centralised metadata management, enabling organisations to discover, understand, and trust their data assets. These platforms maintain inventories of data assets across databases, data lakes, warehouses, and applications while tracking relationships, lineage, ownership, and quality. The category encompasses metadata ingestion, search and discovery, business glossary management, data lineage visualisation, and governance workflow automation.

This page covers platforms where metadata management and governance are the primary function. Adjacent tools with overlapping capabilities exist: data quality platforms (covered in Data Quality Tools), data integration platforms with built-in cataloguing, and database-native metadata features. The platforms assessed here provide standalone or primary-purpose cataloguing with governance capabilities that extend across heterogeneous data environments.

Assessment methodology

Tool assessments are based on official vendor documentation, published API references, release notes, and technical specifications as of 2026-01-25. Feature availability varies by product tier, deployment model, and region. Verify current capabilities directly with vendors during procurement. Community-reported information is excluded; only documented features are assessed.

Requirements taxonomy

This taxonomy defines evaluation criteria for data catalogue and governance platforms. Requirements are organised by functional area and weighted by typical priority for mission-driven organisations operating across multiple data systems.

Functional requirements

Core capabilities defining what the platform must do to support metadata management and governance.

Metadata ingestion and connectors

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F1.1	Database connector breadth	Pre-built connectors for relational databases including PostgreSQL, MySQL, SQL Server, Oracle, and cloud-native databases such as Snowflake, BigQuery, Redshift, and Databricks	Full: 30+ native connectors covering major databases. Partial: 15-29 connectors or gaps in common systems. Minimal: under 15 connectors.	Review connector documentation; verify specific systems required	Essential
F1.2	BI and reporting tool connectors	Ingestion from business intelligence platforms including Tableau, Power BI, Looker, Metabase, and Superset	Full: native connectors for 5+ major BI tools with dashboard and report metadata. Partial: 2-4 tools. Minimal: single tool or none.	Review BI connector documentation; check metadata depth captured	Important
F1.3	Data pipeline tool connectors	Integration with orchestration and ETL tools including Airflow, dbt, Fivetran, and Spark	Full: native connectors capturing pipeline metadata and lineage. Partial: limited pipeline coverage. Minimal: manual entry only.	Review pipeline connector documentation; verify lineage capture	Important
F1.4	Custom connector framework	SDK or framework for building connectors to unsupported systems	Full: documented SDK with examples, community connector repository. Partial: API-only ingestion. None: no extensibility mechanism.	Review developer documentation; check connector development guides	Important
F1.5	Incremental metadata ingestion	Ability to ingest only changed metadata rather than full scans	Full: change detection for all connector types, configurable schedules. Partial: incremental for some connectors. None: full scan only.	Review ingestion documentation; check scheduling options	Important
F1.6	Schema change detection	Automatic detection and notification of schema changes in connected sources	Full: automated detection, change history, notifications. Partial: detection without alerting. None: manual discovery only.	Review change detection documentation	Desirable

Search and discovery

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F2.1	Full-text search	Search across asset names, descriptions, column names, tags, and documentation	Full: relevance ranking, type filtering, faceted search, search suggestions. Partial: basic keyword matching. Minimal: exact match only.	Test search functionality in trial; review search documentation	Essential
F2.2	Faceted filtering	Ability to filter search results by asset type, owner, domain, tags, certification status	Full: 8+ filter dimensions, combinable filters, saved filter sets. Partial: 4-7 dimensions. Minimal: under 4 dimensions.	Review search interface documentation; test filtering capabilities	Essential
F2.3	Asset popularity and usage signals	Display of asset usage patterns such as query frequency, user access counts, and downstream dependencies	Full: usage metrics visible in search ranking and asset pages. Partial: limited metrics. None: no usage signals.	Review usage analytics documentation	Important
F2.4	Saved searches and collections	Ability to save search queries and curate asset collections for reuse	Full: personal and shared collections, scheduled search alerts. Partial: personal saves only. None: no persistence.	Review collection and bookmark documentation	Desirable
F2.5	Natural language search	Support for conversational queries beyond keyword matching	Full: NLP processing with intent recognition. Partial: basic synonym handling. None: keyword only.	Review AI/ML search documentation; test with natural queries	Desirable
F2.6	Cross-asset search	Unified search across tables, columns, dashboards, pipelines, and glossary terms	Full: single search box with type-specific results. Partial: separate searches per type. None: siloed search.	Review search scope documentation	Essential

Data lineage

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F3.1	Column-level lineage	Tracking of data flow at column granularity, not just table level	Full: automatic column-level extraction for supported sources. Partial: manual column mapping. None: table level only.	Review lineage documentation; check supported sources	Essential
F3.2	Cross-system lineage	Lineage tracking across different data systems (e.g., database to warehouse to BI)	Full: automatic stitching across heterogeneous systems. Partial: manual linking required. None: single-system only.	Review multi-system lineage documentation	Essential
F3.3	Lineage visualisation	Interactive graph visualisation of upstream and downstream dependencies	Full: expandable graph with filtering, impact analysis highlighting, export options. Partial: static diagrams. None: list view only.	Review lineage UI documentation; test visualisation	Important
F3.4	Manual lineage editing	Ability to manually define or correct lineage relationships	Full: UI and API for manual lineage, version tracking. Partial: UI only. None: no manual editing.	Review lineage editing documentation	Important
F3.5	SQL parsing for lineage	Automatic extraction of lineage from SQL queries and transformations	Full: parsing of complex SQL including CTEs, subqueries, unions. Partial: simple query parsing. None: no SQL parsing.	Review SQL lineage documentation; check dialect support	Important
F3.6	dbt model lineage	Native integration with dbt for model dependencies and documentation	Full: automatic dbt manifest ingestion, model-level lineage. Partial: manual import. None: no dbt support.	Review dbt integration documentation	Context-dependent

Business glossary and terminology

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F4.1	Glossary term management	Creation and maintenance of business term definitions with approval workflows	Full: hierarchical terms, versioning, approval workflows, term relationships. Partial: flat term list. Minimal: no glossary feature.	Review glossary documentation; check term management features	Essential
F4.2	Term-to-asset linking	Association of glossary terms with data assets (tables, columns, reports)	Full: bulk linking, automatic suggestions, bidirectional navigation. Partial: manual one-by-one linking. None: no linking.	Review term linking documentation	Essential
F4.3	Glossary import and export	Bulk import and export of glossary content	Full: multiple formats (CSV, Excel, JSON), relationship preservation on import. Partial: single format. None: manual entry only.	Review import/export documentation; check format support	Important
F4.4	Controlled vocabulary enforcement	Ability to restrict tagging and annotation to approved glossary terms	Full: validation against glossary, restricted free-text. Partial: suggestions only. None: no enforcement.	Review tagging policy documentation	Desirable
F4.5	Multi-domain glossaries	Support for separate glossaries per business domain with cross-references	Full: domain-specific glossaries, cross-domain linking, domain-based permissions. Partial: single glossary with domain tags. None: single flat glossary.	Review domain and glossary documentation	Important

Data classification and tagging

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F5.1	Tag management	Hierarchical tagging system with controlled vocabularies	Full: nested tags, tag governance, usage tracking. Partial: flat tags. Minimal: no tagging.	Review tag management documentation	Essential
F5.2	Automated data classification	Automatic detection of sensitive data types (PII, financial, health)	Full: pattern-based and ML classification, custom classifiers, confidence scoring. Partial: basic pattern matching. None: manual only.	Review classification documentation; check classifier types	Essential
F5.3	Classification propagation	Automatic propagation of classifications through lineage	Full: downstream propagation with override controls. Partial: suggestions only. None: no propagation.	Review propagation documentation	Important
F5.4	Custom classification rules	Ability to define organisation-specific classification patterns	Full: regex, dictionary, and ML-based custom rules. Partial: limited customisation. None: built-in only.	Review custom classification documentation	Important
F5.5	Data domain assignment	Organisation of assets into business domains	Full: hierarchical domains, domain owners, cross-domain relationships. Partial: single-level domains. None: no domain concept.	Review domain documentation	Important

Data quality integration

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F6.1	Native data quality checks	Built-in data quality rule definition and execution	Full: completeness, uniqueness, validity checks with scheduling. Partial: limited check types. None: no native quality.	Review data quality documentation	Important
F6.2	External quality tool integration	Ingestion of quality scores from tools like Great Expectations or Monte Carlo	Full: native integrations with score display. Partial: API-based ingestion. None: no integration.	Review quality integration documentation	Important
F6.3	Quality score visibility	Display of data quality scores alongside asset metadata	Full: quality metrics on asset pages, quality-based filtering. Partial: separate quality view. None: no visibility.	Review quality display documentation	Important
F6.4	Quality alerting	Notifications when quality thresholds are breached	Full: configurable thresholds, multiple notification channels. Partial: fixed thresholds. None: no alerting.	Review alerting documentation	Desirable

Collaboration and documentation

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
F7.1	Asset documentation	Rich-text documentation attached to assets with formatting and links	Full: Markdown support, embedded images, version history. Partial: plain text only. None: no documentation field.	Review documentation features	Essential
F7.2	Ownership and stewardship	Assignment of owners and stewards to assets with contact visibility	Full: multiple owner types, ownership inheritance, accountability tracking. Partial: single owner field. None: no ownership.	Review ownership documentation	Essential
F7.3	Commenting and discussion	Threaded discussions on assets for questions and clarifications	Full: threaded comments, mentions, notifications, resolution tracking. Partial: flat comments. None: no comments.	Review collaboration documentation	Important
F7.4	Request and feedback workflows	Workflows for requesting access, asking questions, or suggesting edits	Full: configurable request types, routing, SLA tracking. Partial: basic request forms. None: no workflows.	Review request workflow documentation	Desirable
F7.5	Announcements and news	Ability to publish data-related announcements to users	Full: targeted announcements, acknowledgment tracking. Partial: global announcements. None: no announcement feature.	Review announcement documentation	Desirable

Technical requirements

Infrastructure, architecture, and deployment considerations.

Deployment and hosting

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
T1.1	Self-hosted deployment	Ability to deploy on organisation-controlled infrastructure	Full: complete feature parity, documented deployment, ongoing support for self-hosted. Partial: available but with feature gaps. None: SaaS only.	Review deployment documentation; compare feature matrices	Important
T1.2	Container deployment	Official Docker images and Kubernetes Helm charts	Full: maintained official images, Helm charts, documented orchestration. Partial: community images. None: no container support.	Check Docker Hub and artifact registries; review Helm chart documentation	Important
T1.3	Cloud-agnostic deployment	Ability to deploy on AWS, Azure, GCP, or on-premises equivalently	Full: documented deployment for 3+ clouds and on-premises. Partial: single cloud focus. None: vendor-locked.	Review multi-cloud deployment documentation	Important
T1.4	High availability architecture	Documented HA deployment eliminating single points of failure	Full: HA architecture documentation, automatic failover, tested recovery. Partial: manual failover. None: single instance only.	Review HA documentation; check clustering support	Context-dependent
T1.5	Managed service option	Vendor-operated SaaS deployment reducing operational overhead	Full: fully managed with SLA, regional options, data residency controls. Partial: shared infrastructure. None: self-hosted only.	Review SaaS documentation; check regional availability	Context-dependent

Scalability and performance

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
T2.1	Metadata volume capacity	Documented capacity for number of assets, columns, relationships	Full: published limits with sizing guidance (millions of assets). Partial: general capacity claims. None: undocumented.	Review sizing documentation; check published limits	Important
T2.2	Search performance	Search response times at scale	Full: documented query latency targets, index optimisation guidance. Partial: general performance claims. None: no performance data.	Review performance documentation; test at scale if possible	Important
T2.3	Ingestion throughput	Rate of metadata ingestion supported	Full: documented throughput limits, parallel ingestion support. Partial: serial ingestion. None: undocumented.	Review ingestion performance documentation	Important
T2.4	Horizontal scaling	Ability to scale by adding nodes	Full: documented horizontal scaling for all components. Partial: selective scaling. None: vertical only.	Review scaling architecture documentation	Context-dependent

Integration architecture

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
T3.1	REST API completeness	Comprehensive API covering all platform functions	Full: 90%+ feature coverage via API, versioned, documented. Partial: limited API coverage. Minimal: no API or undocumented.	Review API documentation; compare to UI capabilities	Essential
T3.2	GraphQL API	GraphQL endpoint for flexible metadata queries	Full: complete GraphQL schema, documented queries. Partial: limited GraphQL. None: REST only.	Review GraphQL documentation	Desirable
T3.3	Python SDK	Official Python SDK for programmatic access	Full: maintained SDK, pip installable, comprehensive examples. Partial: basic SDK. None: raw API only.	Review SDK documentation; check PyPI package	Important
T3.4	Event streaming	Publication of metadata change events for downstream consumption	Full: Kafka or equivalent streaming, documented event schema. Partial: polling-based. None: no event streaming.	Review event streaming documentation	Important
T3.5	Webhook support	Configurable webhooks for event notifications	Full: event-specific webhooks, retry logic, payload customisation. Partial: limited events. None: no webhooks.	Review webhook documentation	Important
T3.6	OpenMetadata standards	Compliance with OpenMetadata or similar open standards	Full: native OpenMetadata API compliance. Partial: export compatibility. None: proprietary only.	Review standards compliance documentation	Desirable

Security requirements

Security controls and data protection capabilities.

Authentication and access control

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
S1.1	SSO integration	Single sign-on via SAML 2.0 or OIDC	Full: SAML and OIDC support, multiple IdP support. Partial: single protocol. None: local auth only.	Review SSO documentation; check protocol support	Essential
S1.2	Role-based access control	Granular permissions based on user roles	Full: custom roles, asset-level permissions, policy inheritance. Partial: fixed role set. Minimal: admin/user only.	Review RBAC documentation; check permission granularity	Essential
S1.3	Attribute-based access control	Access decisions based on asset attributes (domain, classification, tags)	Full: policy engine with attribute conditions. Partial: limited attributes. None: role-only.	Review ABAC documentation	Important
S1.4	Row and column-level security	Restriction of metadata visibility by data sensitivity	Full: column masking, row filtering based on user attributes. Partial: asset-level only. None: full visibility.	Review fine-grained access documentation	Context-dependent
S1.5	API authentication	Secure API access methods	Full: OAuth 2.0, API keys, service accounts with rotation. Partial: single method. None: unauthenticated.	Review API authentication documentation	Essential

Data protection

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
S2.1	Encryption at rest	Encryption of stored metadata	Full: AES-256 or equivalent, customer-managed keys option. Partial: vendor-managed keys only. None: unencrypted.	Review encryption documentation; check key management	Essential
S2.2	Encryption in transit	TLS for all network communications	Full: TLS 1.2+ enforced, certificate management. Partial: optional TLS. None: unencrypted allowed.	Review transport security documentation	Essential
S2.3	Audit logging	Comprehensive logging of user actions and system events	Full: tamper-evident logs, configurable retention, export capability. Partial: limited logging. None: no audit trail.	Review audit logging documentation	Essential
S2.4	Data masking in samples	Masking of sensitive data in sample previews	Full: configurable masking rules, automatic PII detection. Partial: all-or-nothing masking. None: no masking.	Review sample data documentation	Important
S2.5	Data residency controls	Control over geographic location of stored data	Full: regional deployment options, documented data flows. Partial: limited regions. None: single region, undisclosed.	Review data residency documentation	Context-dependent

Compliance and certification

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
S3.1	SOC 2 certification	SOC 2 Type II compliance for SaaS deployments	Full: current SOC 2 Type II report available. Partial: Type I only. None: no SOC 2.	Request SOC 2 report; verify currency	Important
S3.2	GDPR compliance features	Features supporting GDPR compliance (data subject rights, consent tracking)	Full: documented GDPR features, DPA available. Partial: basic privacy features. None: no GDPR support.	Review GDPR documentation; request DPA	Essential
S3.3	ISO 27001 certification	Information security management certification	Full: current ISO 27001 certificate. Partial: in progress. None: no certification.	Request certificate; verify currency	Desirable
S3.4	HIPAA compliance	Compliance features for healthcare data	Full: BAA available, HIPAA-specific documentation. Partial: general controls. None: no HIPAA support.	Review HIPAA documentation; request BAA	Context-dependent

Operational requirements

Administration, maintenance, and support capabilities.

Administration and configuration

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
O1.1	Web-based administration	Browser-based interface for system configuration	Full: complete admin UI, no command-line required. Partial: some tasks require CLI. None: CLI only.	Review admin documentation; test admin interface	Important
O1.2	Configuration as code	Ability to manage configuration through version-controlled files	Full: full config in YAML/JSON, GitOps compatible. Partial: partial config export. None: UI only.	Review configuration documentation	Desirable
O1.3	Multi-tenancy	Support for separate tenants within single deployment	Full: tenant isolation, per-tenant configuration. Partial: logical separation only. None: single tenant.	Review multi-tenancy documentation	Context-dependent
O1.4	Bulk operations	Administrative bulk actions (user management, asset operations)	Full: bulk via UI and API, import/export. Partial: API bulk only. None: individual operations only.	Review bulk operation documentation	Important

Monitoring and observability

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
O2.1	Health monitoring	System health dashboards and endpoints	Full: health endpoints, component status, performance metrics. Partial: basic health check. None: no monitoring.	Review monitoring documentation	Important
O2.2	Metrics export	Export of platform metrics to monitoring systems	Full: Prometheus, DataDog, or equivalent integration. Partial: custom metrics only. None: no export.	Review metrics documentation; check integrations	Desirable
O2.3	Alerting integration	Integration with alerting systems for operational issues	Full: native alerting plus PagerDuty, Slack, email. Partial: email only. None: no alerting.	Review alerting documentation	Desirable

Backup and recovery

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
O3.1	Backup procedures	Documented backup processes for metadata store	Full: automated backup, point-in-time recovery, documented procedures. Partial: manual backup. None: no documentation.	Review backup documentation	Essential
O3.2	Disaster recovery	Documented DR procedures with RTO/RPO targets	Full: DR runbooks, tested recovery, documented targets. Partial: basic DR guidance. None: no DR documentation.	Review DR documentation	Important
O3.3	Data export	Full export of all metadata for migration or backup	Full: complete export in standard formats. Partial: partial export. None: no export capability.	Review export documentation; test export completeness	Essential

Commercial requirements

Pricing, licensing, and vendor considerations.

ID	Requirement	Description	Assessment criteria	Verification method	Typical priority
C1.1	Transparent pricing	Published pricing or pricing model transparency	Full: public pricing, calculator available. Partial: pricing on request. None: undisclosed.	Review pricing page; request quote	Important
C1.2	Nonprofit discount	Reduced pricing for registered nonprofits	Full: documented nonprofit programme with significant discount. Partial: case-by-case. None: standard pricing only.	Review nonprofit programme documentation	Important
C1.3	Free tier or open source	Availability of no-cost option for evaluation or small deployments	Full: feature-complete open source or unlimited free tier. Partial: limited free tier. None: paid only.	Review licensing and free tier documentation	Important
C1.4	Contract flexibility	Flexible contract terms (monthly, annual, multi-year)	Full: multiple term options without penalties. Partial: annual minimum. None: multi-year lock-in.	Review contract documentation; request terms	Desirable
C1.5	Data portability	Ability to export all data if leaving the platform	Full: complete export, documented migration paths. Partial: limited export. None: vendor lock-in.	Review export and migration documentation	Essential

Comparison matrices

Comparison matrices use the following rating scale:

Symbol	Meaning
●	Full support as documented
◐	Partial support with limitations (see notes)
○	Minimal or basic support
✗	Not supported
-	Not applicable
$	Requires paid tier
E	Enterprise edition only
β	Beta or preview feature

Tool overview

Attribute	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Type	FOSS	FOSS	FOSS	FOSS	Commercial	Commercial
Licence	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0	Proprietary	Proprietary
Current version	1.11.4	1.3.0	4.3.0	2.4.0	2025.08	Continuous
First release	2021	2020	2019	2015	2008	2020
Primary maintainer	Collate (commercial)	Acryl Data (commercial)	LF AI Foundation	Apache Foundation	Collibra Inc.	Microsoft
Managed service	Collate Cloud	DataHub Cloud	None	None	Collibra Cloud	Azure Purview
Deployment model	Self-hosted, SaaS	Self-hosted, SaaS	Self-hosted	Self-hosted	SaaS, self-hosted	SaaS

Functional capability matrix

Metadata ingestion

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Database connectors	● (70+)	● (60+)	◐ (20+)	◐ (15+)	● (100+)	● (90+)
BI tool connectors	●	●	◐	○	●	●
Pipeline connectors	●	●	◐	◐	●	●
Custom connector SDK	●	●	●	●	●	●
Incremental ingestion	●	●	○	○	●	●
Schema change detection	●	●	○	○	●	●

Assessment notes:

OpenMetadata provides the broadest FOSS connector library with 70+ connectors documented in version 1.11.
DataHub’s connector count is comparable, with strong coverage across cloud warehouses and BI tools.
Amundsen connectors require more configuration; the “databuilder” library provides extraction but with less turnkey setup.
Apache Atlas connectors are primarily Hadoop-ecosystem focused (Hive, HBase, Kafka) with limited cloud coverage.
Commercial platforms offer the widest connector ranges but include proprietary systems less relevant to many mission-driven organisations.

Search and discovery

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Full-text search	●	●	●	●	●	●
Faceted filtering	●	●	●	◐	●	●
Usage popularity signals	●	●	●	○	●	●
Saved searches	●	●	○	○	●	●
Natural language search	●$	●$	✗	✗	●	●$
Cross-asset search	●	●	●	◐	●	●

Assessment notes:

All platforms provide competent search; differentiation is in advanced features.
Natural language search requires AI features available in commercial tiers of OpenMetadata (Collate) and DataHub (Acryl).
Amundsen pioneered PageRank-style popularity ranking in FOSS catalogues.
Apache Atlas search is functional but the UI is dated compared to modern alternatives.

Data lineage

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Column-level lineage	●	●	○	◐	●	●
Cross-system lineage	●	●	◐	◐	●	●
Lineage visualisation	●	●	◐	◐	●	●
Manual lineage editing	●	●	○	●	●	●
SQL parsing	●	●	○	○	●	●
dbt integration	●	●	◐	✗	●	●

Assessment notes:

OpenMetadata and DataHub provide comparable column-level lineage with automatic SQL parsing.
Amundsen lineage is table-level by default; column-level requires custom implementation.
Apache Atlas lineage works well within Hadoop ecosystem but cross-system stitching requires manual effort.
Commercial platforms offer more sophisticated lineage impact analysis and business lineage layers.

Business glossary

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Glossary term management	●	●	○	●	●	●
Term-to-asset linking	●	●	○	●	●	●
Glossary import/export	●	●	○	◐	●	●
Vocabulary enforcement	●	◐	✗	○	●	●
Multi-domain glossaries	●	●	✗	◐	●	●
Approval workflows	●	●$	✗	○	●	●

Assessment notes:

OpenMetadata’s glossary module is comprehensive with approval workflows in open source.
DataHub’s glossary workflows are available in open source but advanced governance in DataHub Cloud.
Amundsen lacks native glossary functionality; requires external glossary integration.
Collibra’s business glossary is industry-leading with extensive workflow capabilities.

Data classification

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Tag management	●	●	●	●	●	●
Automated classification	●	●$	✗	○	●	●
Classification propagation	●	●	✗	○	●	●
Custom classification rules	●	●	✗	◐	●	●
Data domain organisation	●	●	○	◐	●	●

Assessment notes:

OpenMetadata includes PII detection in the open source version.
DataHub’s advanced classification requires the paid cloud tier.
Amundsen supports tags but lacks automated classification capabilities.
Microsoft Purview’s classification integrates with Microsoft Information Protection labels.

Technical capability matrix

Deployment options

Option	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Docker	●	●	●	●	-	-
Kubernetes Helm	●	●	●	◐	◐	-
Self-hosted	●	●	●	●	●E	✗
Managed SaaS	● (Collate)	● (Acryl)	✗	✗	●	●
Air-gapped	●	●	●	●	●E	✗

Assessment notes:

FOSS platforms provide full self-hosted deployment flexibility.
Collibra self-hosted requires enterprise licensing and is typically hybrid with cloud components.
Microsoft Purview is Azure-native with no self-hosted option.

Infrastructure requirements (self-hosted)

Component	OpenMetadata	DataHub	Amundsen	Apache Atlas
Metadata store	MySQL or PostgreSQL	MySQL or PostgreSQL	PostgreSQL or Neo4j	HBase or JanusGraph
Search engine	Elasticsearch or OpenSearch	Elasticsearch	Elasticsearch	Solr
Message queue	-	Kafka	-	Kafka
Minimum RAM	8 GB	16 GB	8 GB	16 GB
Minimum CPU	4 cores	4 cores	4 cores	4 cores

Assessment notes:

DataHub’s Kafka dependency adds infrastructure complexity but enables event streaming.
Apache Atlas’s HBase requirement makes it heavier than alternatives for small deployments.
OpenMetadata has the lightest footprint among feature-complete options.

API capabilities

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
REST API	●	●	●	●	●	●
GraphQL API	◐	●	✗	✗	●	◐β
Python SDK	●	●	●	●	●	●
Java SDK	●	●	✗	●	●	●
Event streaming	●β	●	✗	●	●	◐
Webhooks	●	●	✗	◐	●	◐

Assessment notes:

DataHub has the most mature GraphQL API among FOSS options.
OpenMetadata’s Python SDK follows the “ingestion framework” pattern, well-documented.
Amundsen APIs are functional but less comprehensive than newer platforms.
Apache Atlas integrates with Kafka for Atlas Notifications.

Security capability matrix

Authentication methods

Method	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
SAML 2.0	●	●	●	●	●	●
OIDC	●	●	●	●	●	●
LDAP	●	●	●	●	●	-
Local auth	●	●	●	●	●	-
Service accounts	●	●	◐	●	●	●

Access control

Capability	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
Role-based access	●	●	◐	●	●	●
Attribute-based access	◐	●	✗	●	●	●
Asset-level permissions	●	●	○	●	●	●
Column-level masking	◐	◐	✗	◐	●	●

Assessment notes:

Apache Atlas has mature ABAC through integration with Apache Ranger.
DataHub’s policy framework supports attribute-based conditions.
OpenMetadata permissions are role-based with team-level scoping.
Commercial platforms offer finer-grained access control options.

Certifications and compliance

Certification	OpenMetadata	DataHub	Amundsen	Apache Atlas	Collibra	Microsoft Purview
SOC 2 Type II	● (Collate)	● (Acryl)	-	-	●	●
ISO 27001	● (Collate)	◐	-	-	●	●
GDPR features	●	●	○	○	●	●
HIPAA capable	◐ (Collate)	◐ (Acryl)	-	-	●	●
FedRAMP	✗	✗	-	-	●	●

Assessment notes:

Certifications apply to managed service offerings; self-hosted inherits customer infrastructure controls.
Open source platforms can achieve compliance in customer environments but lack turnkey certification.
FedRAMP certification is available only on commercial platforms for US government requirements.

Commercial comparison matrix

Pricing models

Platform	Model	Free tier	Entry point	Enterprise
OpenMetadata	Open source + SaaS	Full FOSS	Collate: contact for pricing	Collate Enterprise
DataHub	Open source + SaaS	Full FOSS	DataHub Cloud: contact for pricing	DataHub Cloud Enterprise
Amundsen	Open source	Full FOSS	Self-hosted only	No commercial offering
Apache Atlas	Open source	Full FOSS	Self-hosted only	No commercial offering
Collibra	SaaS subscription	None	Contact for pricing	Per-user + platform fee
Microsoft Purview	Azure consumption	Limited free	Pay-as-you-go from $0.10/asset	Enterprise Agreement

Assessment notes:

OpenMetadata and DataHub offer full-featured open source with optional managed services.
Amundsen and Apache Atlas have no commercial backing; support is community-based.
Collibra pricing is enterprise-grade; expect $100,000+ annually for meaningful deployments.
Microsoft Purview consumption pricing varies significantly based on asset count and scan frequency.

Nonprofit programmes

Platform	Programme	Discount	Requirements
OpenMetadata (Collate)	Contact sales	Case-by-case	Registered nonprofit
DataHub (Acryl)	Contact sales	Case-by-case	Registered nonprofit
Collibra	Collibra for Nonprofits	Undisclosed	501(c)(3) or equivalent
Microsoft Purview	Microsoft Nonprofits	Up to 75% on Azure credits	Registered nonprofit via Tech Soup

Assessment notes:

FOSS options require no discount; full functionality is free.
Microsoft nonprofit pricing is most transparent through the Tech Soup/Microsoft Nonprofits programme.
Enterprise vendors negotiate nonprofit pricing case-by-case; budget 30-50% of list price.

Individual tool assessments

OpenMetadata

Attribute	Value
Type	Open source
Licence	Apache 2.0
Current version	1.11.4 (December 2025)
Repository	github.com/open-metadata/OpenMetadata
Documentation	docs.open-metadata.org
Commercial offering	Collate (managed service)

Overview

OpenMetadata is a unified metadata platform providing data discovery, data quality, observability, and governance through a central metadata repository. The project emerged from Uber’s Databook and was open-sourced in 2021 by Collate, a company founded by former Uber data infrastructure engineers. Development follows a rapid release cadence with major versions approximately every 6-8 weeks.

The architecture centres on a metadata repository storing entities (tables, databases, dashboards, pipelines) and relationships using a MySQL or PostgreSQL backend with Elasticsearch for search. The platform distinguishes itself through comprehensive FOSS functionality; features like data quality, lineage, and glossary workflows are available without commercial licensing. The commercial Collate offering adds AI-powered features, managed infrastructure, and enterprise support.

OpenMetadata’s connector philosophy emphasises “no-code” ingestion where metadata extraction runs via configuration rather than custom coding. The platform supports 70+ connectors spanning databases, data warehouses, BI tools, and pipeline orchestrators with consistent metadata models across sources.

Strengths

Comprehensive FOSS feature set: Unlike competitors that reserve governance workflows for commercial tiers, OpenMetadata includes glossary approval workflows, data quality rules, and role-based access control in the open source version. Organisations can implement full catalogue governance without licensing costs.

Modern, intuitive interface: The React-based UI provides responsive search, inline editing, and streamlined navigation. The interface design reflects contemporary SaaS standards rather than traditional enterprise software patterns, reducing training overhead.

Active development trajectory: The project’s rapid release cadence (version 1.10 in October 2025, 1.11 in December 2025) demonstrates ongoing investment. Feature parity with commercial alternatives has improved substantially in recent releases.

Lightweight deployment: Minimum requirements of 8 GB RAM and 4 CPU cores with MySQL/PostgreSQL and Elasticsearch make OpenMetadata deployable on modest infrastructure. The Docker Compose quickstart enables evaluation in under 10 minutes.

Limitations

Limited ABAC capabilities: Access control is primarily role-based with team scoping. Attribute-based policies (e.g., access based on classification level) require workarounds through team structures rather than native policy expressions.

Event streaming still maturing: While Kafka integration exists for change events, the streaming capabilities are less mature than DataHub’s event-driven architecture. Organisations requiring real-time metadata synchronisation should evaluate carefully.

Managed service geographic coverage: Collate Cloud regions are limited compared to hyperscaler-native options. Organisations with strict data residency requirements outside US and EU should verify regional availability.

No on-premises commercial support: The managed Collate service is cloud-only. Organisations requiring vendor support for on-premises deployments must rely on community support or third-party consultants.

Deployment considerations

Self-hosted requirements:

MySQL 8.0+ or PostgreSQL 12+ for metadata storage
Elasticsearch 7.x or OpenSearch 2.x for search
Airflow 2.x for scheduled ingestion (optional; can use standalone CLI)
8 GB RAM minimum; 16 GB recommended for production
Helm chart available for Kubernetes deployment

Operational overhead: Moderate. Requires Elasticsearch cluster management and MySQL/PostgreSQL administration. Upgrade path is well-documented with database migration scripts.

Integration capabilities

Integration type	Coverage
Databases	PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, Databricks, Athena, Presto, Trino, Oracle, SQL Server, and 50+ others
BI tools	Tableau, Looker, Metabase, Superset, Power BI, Mode, Redash
Pipelines	Airflow, Dagster, dbt, Fivetran, NiFi, Flink
Storage	S3, GCS, ADLS
Messaging	Kafka

Organisational fit

Best suited for:

Organisations prioritising open source with no licensing dependency
Teams with PostgreSQL/MySQL and Elasticsearch operational expertise
Deployments requiring data quality and governance in a single platform
Environments where rapid feature evolution is valued over stability guarantees

Less suitable for:

Organisations requiring attribute-based access control policies
Deployments without container orchestration capabilities
Teams needing FedRAMP or similar government certifications
Environments requiring real-time streaming metadata updates

DataHub

Attribute	Value
Type	Open source
Licence	Apache 2.0
Current version	1.3.0 (October 2025)
Repository	github.com/datahub-project/datahub
Documentation	datahubproject.io/docs
Commercial offering	DataHub Cloud (Acryl Data)

Overview

DataHub is an event-driven metadata platform originally developed at LinkedIn and open-sourced in 2020. The architecture fundamentally differs from competitors through its use of Kafka for metadata change events, enabling real-time streaming integrations and event-driven workflows. Acryl Data, founded by DataHub’s LinkedIn creators, provides the commercial DataHub Cloud service.

The platform models metadata as a graph with typed entities and relationships stored in a MySQL or PostgreSQL backend, graph views in Elasticsearch, and change streams via Kafka. This architecture supports the “metadata as a service” pattern where metadata changes propagate to downstream consumers in near-real-time.

DataHub reached version 1.0 in March 2025 after five years of development, signalling maturity and API stability commitments. The project maintains active development with quarterly minor releases and strong enterprise adoption (Netflix, Visa, Slack, Pinterest are documented users).

Strengths

Event-driven architecture: Kafka-based metadata change events enable real-time integrations, streaming analytics, and event-driven workflows. Organisations already operating Kafka infrastructure gain natural integration points.

Mature GraphQL API: DataHub’s GraphQL interface provides flexible, efficient queries for custom integrations. The API is well-documented with comprehensive schema coverage, making it suitable for building custom experiences.

Strong enterprise adoption: Documented deployments at scale (Netflix, Visa, Airtel) provide confidence in production readiness. The project benefits from contributions and bug reports from demanding environments.

Comprehensive SDK options: Both Python and Java SDKs receive active development, enabling programmatic metadata management in either ecosystem. The SDKs abstract API complexity while preserving flexibility.

Limitations

Infrastructure complexity: The Kafka dependency increases deployment complexity compared to non-streaming alternatives. Organisations without existing Kafka expertise face additional operational burden.

Higher resource requirements: Baseline deployment requires 16 GB RAM and includes multiple services (GMS, frontend, Kafka, Elasticsearch, MySQL). Small organisations may find the footprint disproportionate.

UI feature parity with API: Some capabilities available via API require GraphQL knowledge to access; not all functionality is exposed through the UI. Technical users are better served than business users in some workflows.

Commercial features for governance: Advanced governance features including automated classification and some workflow capabilities require DataHub Cloud licensing rather than the open source version.

Deployment considerations

Self-hosted requirements:

MySQL 5.7+ or PostgreSQL 12+ for metadata storage
Elasticsearch 7.x for search and graph views
Kafka 2.x for metadata change events
16 GB RAM minimum; 32 GB recommended for production
Helm chart available; Docker Compose for evaluation

Operational overhead: High. Kafka cluster management adds significant operational complexity. Organisations should have Kafka operational expertise or consider the managed DataHub Cloud service.

Integration capabilities

Integration type	Coverage
Databases	Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, MySQL, Hive, Presto, Trino, Oracle, SQL Server, and 50+ others
BI tools	Tableau, Looker, Power BI, Superset, Metabase, Mode, Sigma
Pipelines	Airflow, dbt, Dagster, Prefect, Fivetran, Spark
Storage	S3, GCS, ADLS
Messaging	Kafka, Pulsar

Organisational fit

Best suited for:

Organisations with existing Kafka infrastructure and expertise
Deployments requiring real-time metadata streaming and events
Teams building custom metadata applications via API
Environments with strong engineering capacity for platform operation

Less suitable for:

Small organisations without dedicated data platform teams
Deployments prioritising operational simplicity over streaming capabilities
Teams primarily needing business glossary and governance workflows
Environments without Kafka expertise or willingness to acquire it

Amundsen

Attribute	Value
Type	Open source
Licence	Apache 2.0
Current version	4.3.0 (July 2025)
Repository	github.com/amundsen-io/amundsen
Documentation	amundsen.io
Commercial offering	None

Overview

Amundsen is a data discovery and metadata engine developed at Lyft and open-sourced in 2019 under the LF AI & Data Foundation. The project pioneered PageRank-style search ranking based on usage patterns, surfacing frequently queried tables above less-used alternatives. Amundsen follows a microservices architecture with separate frontend, search, and metadata services.

Development pace has slowed compared to OpenMetadata and DataHub, with community contributions driving most recent changes. The project lacks commercial backing, meaning support relies entirely on community resources. However, the architecture’s modularity enables organisations to adopt components selectively and integrate with existing infrastructure.

Amundsen’s primary strength lies in discovery; the platform excels at helping users find relevant data through intelligent search ranking. Governance capabilities (glossary, classification, quality) are minimal compared to newer alternatives.

Strengths

Proven discovery algorithms: Amundsen’s search ranking incorporates usage signals effectively, surfacing popular and frequently-queried tables. Organisations with large table counts benefit from intelligent relevance ranking.

Modular architecture: Separate services for frontend, search, and metadata enable selective adoption and integration with existing systems. Organisations can replace individual components (e.g., swap Neo4j for Neptune) without full platform replacement.

Lightweight for discovery: For organisations primarily needing data discovery without governance workflows, Amundsen provides focused functionality without feature bloat.

LF AI Foundation governance: Foundation membership provides neutral governance and reduces single-vendor dependency risks, though it also limits commercial investment.

Limitations

Minimal governance features: No native glossary, approval workflows, or automated classification. Organisations requiring governance workflows must integrate external tools or choose alternative platforms.

Slower development pace: Release frequency and feature additions lag behind commercially-backed alternatives. Major capability gaps may persist longer than with OpenMetadata or DataHub.

Table-level lineage only: Native lineage is table-level; column-level lineage requires custom implementation. This limitation is significant for impact analysis use cases.

No managed service option: Organisations must self-host with no vendor support option. Community support via Slack is available but response times and depth vary.

Legacy technology choices: Python 3.8/3.9 requirements and Node.js 10/12 are dated. Dependency updates may require careful testing.

Deployment considerations

Self-hosted requirements:

PostgreSQL or Neo4j for metadata storage
Elasticsearch for search
Python 3.8-3.10, Node.js 10-12
8 GB RAM minimum
Docker Compose available; Helm charts community-maintained

Operational overhead: Moderate. Simpler than DataHub (no Kafka) but requires Neo4j expertise if using graph backend. Documentation assumes significant user self-sufficiency.

Integration capabilities

Integration type	Coverage
Databases	Hive, Redshift, PostgreSQL, Snowflake, BigQuery, Athena, Presto, MySQL (via databuilder extractors)
BI tools	Tableau, Superset, Mode (limited)
Pipelines	Airflow (via DAG extractors)

Organisational fit

Best suited for:

Organisations primarily needing data discovery without governance
Environments with Neo4j or graph database expertise
Teams comfortable with significant self-service and customisation
Deployments prioritising simplicity over comprehensive features

Less suitable for:

Organisations requiring business glossary and governance workflows
Teams needing column-level lineage and impact analysis
Deployments without engineering capacity for custom integration work
Environments expecting vendor or commercial support

Apache Atlas

Attribute	Value
Type	Open source
Licence	Apache 2.0
Current version	2.4.0 (January 2025)
Repository	github.com/apache/atlas
Documentation	atlas.apache.org
Commercial offering	None (Hadoop vendor distributions include Atlas)

Overview

Apache Atlas is the original open source data governance and metadata management framework for the Hadoop ecosystem, first released in 2015 as an Apache incubator project. The platform provides metadata services, classification, and lineage tracking with deep integration into Hadoop components including Hive, HBase, Kafka, and Sqoop.

Atlas architecture uses JanusGraph or HBase for metadata storage and Solr for search, reflecting its Hadoop-native heritage. The platform excels within Hadoop environments but shows its age when applied to modern cloud data warehouses and SaaS tools. Integration with Apache Ranger provides attribute-based access control enforcement.

Development continues under the Apache Foundation with moderate community activity. Atlas 2.4.0 in January 2025 demonstrates ongoing maintenance, though feature velocity is lower than commercially-backed alternatives.

Strengths

Hadoop ecosystem integration: Native hooks for Hive, HBase, Kafka, Sqoop, Storm, and Falcon provide automatic lineage capture within Hadoop environments. Organisations with significant Hadoop investment benefit from seamless integration.

Apache Ranger integration: Combined with Ranger, Atlas enables attribute-based access control where classifications and tags drive data access policies. This integration is unique among open source catalogues.

Mature and stable: Nearly a decade of production use provides confidence in stability for core use cases. The type system and API are well-established.

Foundation governance: Apache Foundation stewardship ensures neutral governance and long-term project continuity independent of commercial interests.

Limitations

Dated user interface: The web UI reflects 2015-era design patterns. User experience lags significantly behind modern catalogue interfaces, increasing training requirements and reducing adoption.

Hadoop-centric architecture: HBase or JanusGraph requirements assume Hadoop-style infrastructure. Organisations without existing Hadoop infrastructure face significant deployment overhead.

Limited cloud warehouse support: Connectors for Snowflake, BigQuery, and Databricks are community-contributed with varying quality. Cloud-native organisations will find gaps.

Minimal BI and pipeline coverage: Dashboard and pipeline metadata support is limited compared to newer platforms. Comprehensive cataloguing requires supplemental tools.

No managed service: Self-hosted only with no commercial support option. Hadoop distribution vendors (Cloudera, Hortonworks legacy) include Atlas but general enterprise support is unavailable.

Deployment considerations

Self-hosted requirements:

HBase or JanusGraph for metadata graph storage
Solr for search
Kafka for hook messaging
Zookeeper for coordination
16 GB RAM minimum; 32 GB recommended
Java 8+ runtime

Operational overhead: High. Requires HBase/JanusGraph operational expertise plus Solr and Kafka management. Deployment complexity exceeds all other options in this category.

Integration capabilities

Integration type	Coverage
Databases	Hive, HBase, Oracle, SQL Server, MySQL, PostgreSQL, Cassandra, Couchbase
BI tools	Limited (custom integration required)
Pipelines	Sqoop, Storm, Falcon, Spark (via hooks)
Messaging	Kafka

Organisational fit

Best suited for:

Organisations with significant Hadoop/HBase infrastructure investment
Environments requiring Ranger-based access control integration
Deployments where stability and maturity outweigh UI modernisation needs
Teams with Java/Hadoop operational expertise

Less suitable for:

Cloud-native organisations without Hadoop infrastructure
Teams prioritising user experience and adoption
Deployments requiring modern BI tool and pipeline integration
Organisations without Java/Hadoop operational capabilities

Collibra

Attribute	Value
Type	Commercial
Pricing model	Subscription (per-user + platform)
Current version	2025.08 (continuous release)
Documentation	productresources.collibra.com
API documentation	developer.collibra.com
Deployment	SaaS (primary), on-premises (enterprise)

Overview

Collibra Data Intelligence Platform is an enterprise data governance and catalogue solution founded in 2008, making it one of the longest-established vendors in the category. The platform emphasises business-user accessibility, governance workflows, and data stewardship alongside technical metadata management.

Collibra’s architecture centres on a knowledge graph storing business and technical metadata with extensive workflow automation for governance processes. The platform targets enterprise buyers with comprehensive feature sets, professional services, and global support infrastructure.

The product follows a continuous release model with monthly updates. Collibra’s market position is enterprise-focused with pricing reflecting that segment; most deployments exceed $100,000 annually.

Strengths

Industry-leading business glossary: Collibra’s business glossary and stewardship capabilities are best-in-class. Approval workflows, term relationships, and certification processes are more sophisticated than alternatives.

Extensive governance workflows: Configurable workflows for data certification, access requests, issue management, and stewardship tasks provide enterprise-grade governance automation.

Broadest connector library: 100+ connectors covering databases, BI tools, cloud platforms, and enterprise applications. Most organisations find pre-built connectors for their stack.

Professional services ecosystem: Global system integrator partnerships, professional services, and training programmes support enterprise deployments. Organisations with limited internal data management expertise benefit from implementation support.

Limitations

Enterprise pricing: Entry-level deployments start at $100,000+ annually with costs scaling significantly for larger user counts and data volumes. Budget-constrained organisations will find Collibra inaccessible.

Complexity: Feature breadth creates complexity. Implementation timelines of 6-12 months are common for enterprise deployments with extensive configuration requirements.

SaaS preference: While on-premises deployment exists, Collibra strongly prefers cloud deployment. Self-hosted customers may experience feature delays and reduced support priority.

Vendor lock-in concerns: Proprietary data models and workflows create switching costs. Data export capabilities exist but migration to alternatives requires significant effort.

Deployment considerations

SaaS deployment:

Multi-tenant cloud hosted by Collibra
Regional options (US, EU, APAC)
SOC 2 Type II, ISO 27001 certified
99.9% uptime SLA

Self-hosted (enterprise):

Kubernetes-based deployment
Customer-managed infrastructure
Collibra Edge for hybrid connectivity
Requires enterprise licensing tier

Operational overhead: Low for SaaS (vendor-managed). Self-hosted requires dedicated infrastructure team and Collibra-specific expertise.

Integration capabilities

Integration type	Coverage
Databases	Snowflake, BigQuery, Redshift, Databricks, Azure Synapse, Oracle, SQL Server, PostgreSQL, MySQL, Teradata, and 70+ others
BI tools	Tableau, Power BI, Looker, Qlik, MicroStrategy, SAP BusinessObjects
Pipelines	Informatica, Talend, dbt, Airflow, Azure Data Factory
ERP/CRM	SAP, Salesforce, Workday
Cloud platforms	AWS, Azure, GCP native services

Organisational fit

Best suited for:

Large enterprises with substantial data governance budgets
Organisations prioritising business glossary and stewardship workflows
Deployments requiring professional services and implementation support
Environments needing broadest connector coverage

Less suitable for:

Budget-constrained organisations under $100,000 annual budget
Small organisations without dedicated data governance teams
Deployments prioritising self-service implementation
Technical teams preferring open source foundations

Microsoft Purview

Attribute	Value
Type	Commercial
Pricing model	Azure consumption-based
Current version	Continuous release (Unified Catalog GA 2025)
Documentation	learn.microsoft.com/purview
API documentation	learn.microsoft.com/rest/api/purview
Deployment	Azure SaaS only

Overview

Microsoft Purview is Microsoft’s unified data governance service combining data cataloguing, classification, and compliance capabilities within the Azure ecosystem. The platform evolved from Azure Purview (2020) with significant expansion in 2024-2025 to become Microsoft Purview with broader scope including data security, risk, and compliance features.

Purview’s architecture integrates with Microsoft’s Data Map for metadata storage, Microsoft Graph for relationships, and Azure services for compute. The platform leverages Microsoft’s information protection labels enabling unified classification across Microsoft 365, Azure data services, and third-party sources.

Purview uses consumption-based pricing where costs scale with asset count, scan frequency, and data classification volume. This model suits variable workloads but requires monitoring to avoid unexpected costs.

Strengths

Microsoft ecosystem integration: Native integration with Azure Synapse, Azure SQL, Power BI, Microsoft 365, and Fabric provides seamless metadata capture for Microsoft-centric environments.

Unified classification: Integration with Microsoft Information Protection enables consistent sensitivity labels across data catalogue, SharePoint, Teams, and email. Organisations already using Microsoft classification benefit from extension to data assets.

Consumption-based pricing: Pay-per-use model enables starting small and scaling with data volume. Organisations uncertain about scope can begin with limited assets and expand.

Data security integration: Purview combines cataloguing with data loss prevention, insider risk management, and compliance features unavailable in pure catalogue products.

Limitations

Azure lock-in: Purview is Azure-native with no self-hosted or alternative cloud option. Organisations avoiding Azure dependency cannot use Purview.

Multi-cloud limitations: While Purview scans non-Azure sources (AWS, GCP), integration depth is inferior to Azure-native sources. Multi-cloud organisations may find inconsistent capabilities.

Complex pricing: Consumption pricing across multiple meters (data map, scans, insights) requires careful monitoring. Organisations report difficulty predicting costs.

Feature maturity: The Unified Catalog (GA November 2025) is newer than alternatives. Some features remain in preview with stability and completeness implications.

API maturity: APIs are evolving with some capabilities in public preview. Organisations requiring stable programmatic access should evaluate current coverage against requirements.

Deployment considerations

Deployment:

Azure subscription required
Single-tenant instance per Azure tenant
Automatically regional based on Azure subscription
No self-hosted option

Operational overhead: Low for Azure-native deployments. Microsoft manages infrastructure. Scanning configuration and classification rules require governance team attention.

Integration capabilities

Integration type	Coverage
Azure services	Azure SQL, Synapse, Data Lake, Blob Storage, Databricks, Cosmos DB, Fabric (native)
AWS	S3, RDS, Redshift, Glue (via scanner)
GCP	BigQuery, Cloud Storage (via scanner)
Databases	SQL Server, Oracle, PostgreSQL, MySQL, SAP HANA, Teradata, Snowflake
BI tools	Power BI (native), Tableau, Looker (limited)
Pipelines	Azure Data Factory (native), dbt

Organisational fit

Best suited for:

Organisations committed to Microsoft Azure ecosystem
Deployments requiring unified data and document classification
Environments already using Microsoft Information Protection
Teams preferring consumption pricing over committed spend

Less suitable for:

Organisations avoiding cloud vendor lock-in
Multi-cloud deployments with significant non-Azure workloads
Teams requiring predictable fixed costs
Environments needing self-hosted deployment options

Selection guidance

Decision framework

              What is your primary deployment preference?
                                     |
    +--------------------------------+--------------------------------+
    |                                |                                |
    v                                v                                v
+--------------+              +--------------+                +--------------+
| Self-hosted  |              |   Managed    |                | Azure-native |
| (FOSS pref)  |              |   Service    |                |  (Required)  |
+------+-------+              +------+-------+                +------+-------+
       |                             |                               |
       v                             v                               v
+--------------+              +--------------+                +--------------+
| Use Kafka?   |              | Annual Spend |                |   Microsoft  |
+---+------+---+              +---+------+---+                |    Purview   |
    |      |                      |      |                    +--------------+
    |      |            +---------+      +---------+
    v      v            v                          v
  [Yes]   [No]      [>$100k]                   [<$100k]
    |      |            |                          |
    |      |            v                          v
    |      |      +------------+            +--------------+
    |      |      |  Collibra  |            | OpenMetadata |
    |      |      |     or     |            |  (Collate)   |
    |      |      |  DataHub   |            |      or      |
    |      |      |   Cloud    |            | DataHub Cloud|
    |      v      +------------+            +--------------+
    |   +--------------+
    |   | OpenMetadata |
    v   +--------------+
+--------------+
|   DataHub    |
+--------------+

Recommendations by context

Organisations with minimal IT capacity

Recommended: OpenMetadata with Collate Cloud or DataHub with DataHub Cloud

Managed services eliminate infrastructure operational burden while providing full catalogue functionality. Both offer straightforward onboarding with guided setup wizards. Collate and DataHub Cloud pricing is negotiable for smaller organisations; request nonprofit or startup pricing.

Alternative: Microsoft Purview (if Azure-committed)

For organisations already invested in Azure, Purview’s consumption model enables starting small. Native Microsoft integrations reduce connector configuration effort.

Avoid: Self-hosted deployments requiring Kafka, HBase, or complex orchestration. Apache Atlas and self-hosted DataHub require infrastructure expertise unavailable in minimal IT contexts.

Organisations with established IT capacity

Recommended: OpenMetadata (self-hosted) or DataHub (self-hosted)

Self-hosted FOSS deployments provide maximum control, no licensing costs, and full feature access. OpenMetadata offers lighter infrastructure requirements; DataHub suits organisations with existing Kafka expertise.

Selection criteria:

Choose OpenMetadata if Kafka infrastructure is unavailable and operational simplicity is valued
Choose DataHub if real-time metadata streaming and event-driven architecture are requirements
Choose DataHub if GraphQL API access is priority for custom integrations

Alternative: Collibra (if budget allows)

Organisations with governance maturity requiring sophisticated stewardship workflows and business glossary management may justify Collibra investment. Evaluate whether FOSS alternatives meet workflow requirements before committing enterprise spend.

Organisations with Hadoop infrastructure

Recommended: Apache Atlas (if governance integration with Ranger is required) or DataHub (for modernisation path)

Apache Atlas integrates natively with Hadoop ecosystem and Apache Ranger for access control. Organisations with significant HBase and Hive investment benefit from seamless lineage capture.

DataHub provides a modernisation path, ingesting Hive and Hadoop metadata while offering superior UI and broader connector support for hybrid environments.

Organisations prioritising data sovereignty

Recommended: OpenMetadata or DataHub (self-hosted)

Self-hosted FOSS deployments keep all metadata on organisation-controlled infrastructure with no data transmission to external services. Both platforms support air-gapped deployment for high-security environments.

Avoid: SaaS deployments where data residency cannot be guaranteed or verified. Evaluate managed service data processing locations carefully if considering cloud options.

Migration paths

From	To	Complexity	Approach	Timeline
Amundsen	OpenMetadata	Medium	Export via Amundsen API; import using OpenMetadata bulk loader	2-4 weeks
Amundsen	DataHub	Medium	Export via Amundsen API; import using DataHub Python SDK	2-4 weeks
Apache Atlas	OpenMetadata	Medium	OpenMetadata provides Atlas connector for metadata import	2-4 weeks
Apache Atlas	DataHub	Medium	DataHub provides Atlas source connector	2-4 weeks
OpenMetadata	DataHub	Low-Medium	Export via OpenMetadata API; transform to DataHub model	2-3 weeks
DataHub	OpenMetadata	Low-Medium	Export via DataHub API; transform to OpenMetadata model	2-3 weeks
Any FOSS	Collibra	High	Collibra professional services engagement typical; custom migration scripts	2-4 months
Collibra	Any FOSS	High	Export via Collibra API; significant model transformation required	2-4 months

Resources and references

Official documentation

Open source platforms

Tool	Documentation	API reference	GitHub repository
OpenMetadata	docs.open-metadata.org	docs.open-metadata.org/swagger	github.com/open-metadata/OpenMetadata
DataHub	datahubproject.io/docs	datahubproject.io/docs/api	github.com/datahub-project/datahub
Amundsen	amundsen.io	github.com/amundsen-io/amundsen	github.com/amundsen-io/amundsen
Apache Atlas	atlas.apache.org	atlas.apache.org/api	github.com/apache/atlas

Commercial platforms

Tool	Documentation	API reference	Developer portal
Collibra	productresources.collibra.com	developer.collibra.com/api	developer.collibra.com
Microsoft Purview	learn.microsoft.com/purview	learn.microsoft.com/rest/api/purview	learn.microsoft.com/purview/developer

Relevant standards

Standard	Description	URL
Open Metadata and Governance (OMAG)	Egeria project metadata interoperability standards	egeria-project.org
Apache Atlas REST API	De facto standard for metadata exchange in Hadoop ecosystems	atlas.apache.org/api/v2
W3C DCAT	Data Catalog Vocabulary for describing datasets	w3.org/TR/vocab-dcat-3
ISO 11179	Metadata registries standard	iso.org/standard/78916.html