Skip to main content

Data Catalogue and Governance

Data Catalogue and Governance

Data catalogue and governance platforms provide centralised metadata management, enabling organisations to discover, understand, and trust their data assets. These platforms maintain inventories of data assets across databases, data lakes, warehouses, and applications while tracking relationships, lineage, ownership, and quality. The category encompasses metadata ingestion, search and discovery, business glossary management, data lineage visualisation, and governance workflow automation.

This page covers platforms where metadata management and governance are the primary function. Adjacent tools with overlapping capabilities exist: data quality platforms (covered in Data Quality Tools), data integration platforms with built-in cataloguing, and database-native metadata features. The platforms assessed here provide standalone or primary-purpose cataloguing with governance capabilities that extend across heterogeneous data environments.

Assessment methodology

Tool assessments are based on official vendor documentation, published API references, release notes, and technical specifications as of 2026-01-25. Feature availability varies by product tier, deployment model, and region. Verify current capabilities directly with vendors during procurement. Community-reported information is excluded; only documented features are assessed.

Requirements taxonomy

This taxonomy defines evaluation criteria for data catalogue and governance platforms. Requirements are organised by functional area and weighted by typical priority for mission-driven organisations operating across multiple data systems.

Functional requirements

Core capabilities defining what the platform must do to support metadata management and governance.

Metadata ingestion and connectors

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F1.1Database connector breadthPre-built connectors for relational databases including PostgreSQL, MySQL, SQL Server, Oracle, and cloud-native databases such as Snowflake, BigQuery, Redshift, and DatabricksFull: 30+ native connectors covering major databases. Partial: 15-29 connectors or gaps in common systems. Minimal: under 15 connectors.Review connector documentation; verify specific systems requiredEssential
F1.2BI and reporting tool connectorsIngestion from business intelligence platforms including Tableau, Power BI, Looker, Metabase, and SupersetFull: native connectors for 5+ major BI tools with dashboard and report metadata. Partial: 2-4 tools. Minimal: single tool or none.Review BI connector documentation; check metadata depth capturedImportant
F1.3Data pipeline tool connectorsIntegration with orchestration and ETL tools including Airflow, dbt, Fivetran, and SparkFull: native connectors capturing pipeline metadata and lineage. Partial: limited pipeline coverage. Minimal: manual entry only.Review pipeline connector documentation; verify lineage captureImportant
F1.4Custom connector frameworkSDK or framework for building connectors to unsupported systemsFull: documented SDK with examples, community connector repository. Partial: API-only ingestion. None: no extensibility mechanism.Review developer documentation; check connector development guidesImportant
F1.5Incremental metadata ingestionAbility to ingest only changed metadata rather than full scansFull: change detection for all connector types, configurable schedules. Partial: incremental for some connectors. None: full scan only.Review ingestion documentation; check scheduling optionsImportant
F1.6Schema change detectionAutomatic detection and notification of schema changes in connected sourcesFull: automated detection, change history, notifications. Partial: detection without alerting. None: manual discovery only.Review change detection documentationDesirable

Search and discovery

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F2.1Full-text searchSearch across asset names, descriptions, column names, tags, and documentationFull: relevance ranking, type filtering, faceted search, search suggestions. Partial: basic keyword matching. Minimal: exact match only.Test search functionality in trial; review search documentationEssential
F2.2Faceted filteringAbility to filter search results by asset type, owner, domain, tags, certification statusFull: 8+ filter dimensions, combinable filters, saved filter sets. Partial: 4-7 dimensions. Minimal: under 4 dimensions.Review search interface documentation; test filtering capabilitiesEssential
F2.3Asset popularity and usage signalsDisplay of asset usage patterns such as query frequency, user access counts, and downstream dependenciesFull: usage metrics visible in search ranking and asset pages. Partial: limited metrics. None: no usage signals.Review usage analytics documentationImportant
F2.4Saved searches and collectionsAbility to save search queries and curate asset collections for reuseFull: personal and shared collections, scheduled search alerts. Partial: personal saves only. None: no persistence.Review collection and bookmark documentationDesirable
F2.5Natural language searchSupport for conversational queries beyond keyword matchingFull: NLP processing with intent recognition. Partial: basic synonym handling. None: keyword only.Review AI/ML search documentation; test with natural queriesDesirable
F2.6Cross-asset searchUnified search across tables, columns, dashboards, pipelines, and glossary termsFull: single search box with type-specific results. Partial: separate searches per type. None: siloed search.Review search scope documentationEssential

Data lineage

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F3.1Column-level lineageTracking of data flow at column granularity, not just table levelFull: automatic column-level extraction for supported sources. Partial: manual column mapping. None: table level only.Review lineage documentation; check supported sourcesEssential
F3.2Cross-system lineageLineage tracking across different data systems (e.g., database to warehouse to BI)Full: automatic stitching across heterogeneous systems. Partial: manual linking required. None: single-system only.Review multi-system lineage documentationEssential
F3.3Lineage visualisationInteractive graph visualisation of upstream and downstream dependenciesFull: expandable graph with filtering, impact analysis highlighting, export options. Partial: static diagrams. None: list view only.Review lineage UI documentation; test visualisationImportant
F3.4Manual lineage editingAbility to manually define or correct lineage relationshipsFull: UI and API for manual lineage, version tracking. Partial: UI only. None: no manual editing.Review lineage editing documentationImportant
F3.5SQL parsing for lineageAutomatic extraction of lineage from SQL queries and transformationsFull: parsing of complex SQL including CTEs, subqueries, unions. Partial: simple query parsing. None: no SQL parsing.Review SQL lineage documentation; check dialect supportImportant
F3.6dbt model lineageNative integration with dbt for model dependencies and documentationFull: automatic dbt manifest ingestion, model-level lineage. Partial: manual import. None: no dbt support.Review dbt integration documentationContext-dependent

Business glossary and terminology

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F4.1Glossary term managementCreation and maintenance of business term definitions with approval workflowsFull: hierarchical terms, versioning, approval workflows, term relationships. Partial: flat term list. Minimal: no glossary feature.Review glossary documentation; check term management featuresEssential
F4.2Term-to-asset linkingAssociation of glossary terms with data assets (tables, columns, reports)Full: bulk linking, automatic suggestions, bidirectional navigation. Partial: manual one-by-one linking. None: no linking.Review term linking documentationEssential
F4.3Glossary import and exportBulk import and export of glossary contentFull: multiple formats (CSV, Excel, JSON), relationship preservation on import. Partial: single format. None: manual entry only.Review import/export documentation; check format supportImportant
F4.4Controlled vocabulary enforcementAbility to restrict tagging and annotation to approved glossary termsFull: validation against glossary, restricted free-text. Partial: suggestions only. None: no enforcement.Review tagging policy documentationDesirable
F4.5Multi-domain glossariesSupport for separate glossaries per business domain with cross-referencesFull: domain-specific glossaries, cross-domain linking, domain-based permissions. Partial: single glossary with domain tags. None: single flat glossary.Review domain and glossary documentationImportant

Data classification and tagging

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F5.1Tag managementHierarchical tagging system with controlled vocabulariesFull: nested tags, tag governance, usage tracking. Partial: flat tags. Minimal: no tagging.Review tag management documentationEssential
F5.2Automated data classificationAutomatic detection of sensitive data types (PII, financial, health)Full: pattern-based and ML classification, custom classifiers, confidence scoring. Partial: basic pattern matching. None: manual only.Review classification documentation; check classifier typesEssential
F5.3Classification propagationAutomatic propagation of classifications through lineageFull: downstream propagation with override controls. Partial: suggestions only. None: no propagation.Review propagation documentationImportant
F5.4Custom classification rulesAbility to define organisation-specific classification patternsFull: regex, dictionary, and ML-based custom rules. Partial: limited customisation. None: built-in only.Review custom classification documentationImportant
F5.5Data domain assignmentOrganisation of assets into business domainsFull: hierarchical domains, domain owners, cross-domain relationships. Partial: single-level domains. None: no domain concept.Review domain documentationImportant

Data quality integration

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F6.1Native data quality checksBuilt-in data quality rule definition and executionFull: completeness, uniqueness, validity checks with scheduling. Partial: limited check types. None: no native quality.Review data quality documentationImportant
F6.2External quality tool integrationIngestion of quality scores from tools like Great Expectations or Monte CarloFull: native integrations with score display. Partial: API-based ingestion. None: no integration.Review quality integration documentationImportant
F6.3Quality score visibilityDisplay of data quality scores alongside asset metadataFull: quality metrics on asset pages, quality-based filtering. Partial: separate quality view. None: no visibility.Review quality display documentationImportant
F6.4Quality alertingNotifications when quality thresholds are breachedFull: configurable thresholds, multiple notification channels. Partial: fixed thresholds. None: no alerting.Review alerting documentationDesirable

Collaboration and documentation

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
F7.1Asset documentationRich-text documentation attached to assets with formatting and linksFull: Markdown support, embedded images, version history. Partial: plain text only. None: no documentation field.Review documentation featuresEssential
F7.2Ownership and stewardshipAssignment of owners and stewards to assets with contact visibilityFull: multiple owner types, ownership inheritance, accountability tracking. Partial: single owner field. None: no ownership.Review ownership documentationEssential
F7.3Commenting and discussionThreaded discussions on assets for questions and clarificationsFull: threaded comments, mentions, notifications, resolution tracking. Partial: flat comments. None: no comments.Review collaboration documentationImportant
F7.4Request and feedback workflowsWorkflows for requesting access, asking questions, or suggesting editsFull: configurable request types, routing, SLA tracking. Partial: basic request forms. None: no workflows.Review request workflow documentationDesirable
F7.5Announcements and newsAbility to publish data-related announcements to usersFull: targeted announcements, acknowledgment tracking. Partial: global announcements. None: no announcement feature.Review announcement documentationDesirable

Technical requirements

Infrastructure, architecture, and deployment considerations.

Deployment and hosting

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
T1.1Self-hosted deploymentAbility to deploy on organisation-controlled infrastructureFull: complete feature parity, documented deployment, ongoing support for self-hosted. Partial: available but with feature gaps. None: SaaS only.Review deployment documentation; compare feature matricesImportant
T1.2Container deploymentOfficial Docker images and Kubernetes Helm chartsFull: maintained official images, Helm charts, documented orchestration. Partial: community images. None: no container support.Check Docker Hub and artifact registries; review Helm chart documentationImportant
T1.3Cloud-agnostic deploymentAbility to deploy on AWS, Azure, GCP, or on-premises equivalentlyFull: documented deployment for 3+ clouds and on-premises. Partial: single cloud focus. None: vendor-locked.Review multi-cloud deployment documentationImportant
T1.4High availability architectureDocumented HA deployment eliminating single points of failureFull: HA architecture documentation, automatic failover, tested recovery. Partial: manual failover. None: single instance only.Review HA documentation; check clustering supportContext-dependent
T1.5Managed service optionVendor-operated SaaS deployment reducing operational overheadFull: fully managed with SLA, regional options, data residency controls. Partial: shared infrastructure. None: self-hosted only.Review SaaS documentation; check regional availabilityContext-dependent

Scalability and performance

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
T2.1Metadata volume capacityDocumented capacity for number of assets, columns, relationshipsFull: published limits with sizing guidance (millions of assets). Partial: general capacity claims. None: undocumented.Review sizing documentation; check published limitsImportant
T2.2Search performanceSearch response times at scaleFull: documented query latency targets, index optimisation guidance. Partial: general performance claims. None: no performance data.Review performance documentation; test at scale if possibleImportant
T2.3Ingestion throughputRate of metadata ingestion supportedFull: documented throughput limits, parallel ingestion support. Partial: serial ingestion. None: undocumented.Review ingestion performance documentationImportant
T2.4Horizontal scalingAbility to scale by adding nodesFull: documented horizontal scaling for all components. Partial: selective scaling. None: vertical only.Review scaling architecture documentationContext-dependent

Integration architecture

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
T3.1REST API completenessComprehensive API covering all platform functionsFull: 90%+ feature coverage via API, versioned, documented. Partial: limited API coverage. Minimal: no API or undocumented.Review API documentation; compare to UI capabilitiesEssential
T3.2GraphQL APIGraphQL endpoint for flexible metadata queriesFull: complete GraphQL schema, documented queries. Partial: limited GraphQL. None: REST only.Review GraphQL documentationDesirable
T3.3Python SDKOfficial Python SDK for programmatic accessFull: maintained SDK, pip installable, comprehensive examples. Partial: basic SDK. None: raw API only.Review SDK documentation; check PyPI packageImportant
T3.4Event streamingPublication of metadata change events for downstream consumptionFull: Kafka or equivalent streaming, documented event schema. Partial: polling-based. None: no event streaming.Review event streaming documentationImportant
T3.5Webhook supportConfigurable webhooks for event notificationsFull: event-specific webhooks, retry logic, payload customisation. Partial: limited events. None: no webhooks.Review webhook documentationImportant
T3.6OpenMetadata standardsCompliance with OpenMetadata or similar open standardsFull: native OpenMetadata API compliance. Partial: export compatibility. None: proprietary only.Review standards compliance documentationDesirable

Security requirements

Security controls and data protection capabilities.

Authentication and access control

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
S1.1SSO integrationSingle sign-on via SAML 2.0 or OIDCFull: SAML and OIDC support, multiple IdP support. Partial: single protocol. None: local auth only.Review SSO documentation; check protocol supportEssential
S1.2Role-based access controlGranular permissions based on user rolesFull: custom roles, asset-level permissions, policy inheritance. Partial: fixed role set. Minimal: admin/user only.Review RBAC documentation; check permission granularityEssential
S1.3Attribute-based access controlAccess decisions based on asset attributes (domain, classification, tags)Full: policy engine with attribute conditions. Partial: limited attributes. None: role-only.Review ABAC documentationImportant
S1.4Row and column-level securityRestriction of metadata visibility by data sensitivityFull: column masking, row filtering based on user attributes. Partial: asset-level only. None: full visibility.Review fine-grained access documentationContext-dependent
S1.5API authenticationSecure API access methodsFull: OAuth 2.0, API keys, service accounts with rotation. Partial: single method. None: unauthenticated.Review API authentication documentationEssential

Data protection

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
S2.1Encryption at restEncryption of stored metadataFull: AES-256 or equivalent, customer-managed keys option. Partial: vendor-managed keys only. None: unencrypted.Review encryption documentation; check key managementEssential
S2.2Encryption in transitTLS for all network communicationsFull: TLS 1.2+ enforced, certificate management. Partial: optional TLS. None: unencrypted allowed.Review transport security documentationEssential
S2.3Audit loggingComprehensive logging of user actions and system eventsFull: tamper-evident logs, configurable retention, export capability. Partial: limited logging. None: no audit trail.Review audit logging documentationEssential
S2.4Data masking in samplesMasking of sensitive data in sample previewsFull: configurable masking rules, automatic PII detection. Partial: all-or-nothing masking. None: no masking.Review sample data documentationImportant
S2.5Data residency controlsControl over geographic location of stored dataFull: regional deployment options, documented data flows. Partial: limited regions. None: single region, undisclosed.Review data residency documentationContext-dependent

Compliance and certification

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
S3.1SOC 2 certificationSOC 2 Type II compliance for SaaS deploymentsFull: current SOC 2 Type II report available. Partial: Type I only. None: no SOC 2.Request SOC 2 report; verify currencyImportant
S3.2GDPR compliance featuresFeatures supporting GDPR compliance (data subject rights, consent tracking)Full: documented GDPR features, DPA available. Partial: basic privacy features. None: no GDPR support.Review GDPR documentation; request DPAEssential
S3.3ISO 27001 certificationInformation security management certificationFull: current ISO 27001 certificate. Partial: in progress. None: no certification.Request certificate; verify currencyDesirable
S3.4HIPAA complianceCompliance features for healthcare dataFull: BAA available, HIPAA-specific documentation. Partial: general controls. None: no HIPAA support.Review HIPAA documentation; request BAAContext-dependent

Operational requirements

Administration, maintenance, and support capabilities.

Administration and configuration

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
O1.1Web-based administrationBrowser-based interface for system configurationFull: complete admin UI, no command-line required. Partial: some tasks require CLI. None: CLI only.Review admin documentation; test admin interfaceImportant
O1.2Configuration as codeAbility to manage configuration through version-controlled filesFull: full config in YAML/JSON, GitOps compatible. Partial: partial config export. None: UI only.Review configuration documentationDesirable
O1.3Multi-tenancySupport for separate tenants within single deploymentFull: tenant isolation, per-tenant configuration. Partial: logical separation only. None: single tenant.Review multi-tenancy documentationContext-dependent
O1.4Bulk operationsAdministrative bulk actions (user management, asset operations)Full: bulk via UI and API, import/export. Partial: API bulk only. None: individual operations only.Review bulk operation documentationImportant

Monitoring and observability

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
O2.1Health monitoringSystem health dashboards and endpointsFull: health endpoints, component status, performance metrics. Partial: basic health check. None: no monitoring.Review monitoring documentationImportant
O2.2Metrics exportExport of platform metrics to monitoring systemsFull: Prometheus, DataDog, or equivalent integration. Partial: custom metrics only. None: no export.Review metrics documentation; check integrationsDesirable
O2.3Alerting integrationIntegration with alerting systems for operational issuesFull: native alerting plus PagerDuty, Slack, email. Partial: email only. None: no alerting.Review alerting documentationDesirable

Backup and recovery

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
O3.1Backup proceduresDocumented backup processes for metadata storeFull: automated backup, point-in-time recovery, documented procedures. Partial: manual backup. None: no documentation.Review backup documentationEssential
O3.2Disaster recoveryDocumented DR procedures with RTO/RPO targetsFull: DR runbooks, tested recovery, documented targets. Partial: basic DR guidance. None: no DR documentation.Review DR documentationImportant
O3.3Data exportFull export of all metadata for migration or backupFull: complete export in standard formats. Partial: partial export. None: no export capability.Review export documentation; test export completenessEssential

Commercial requirements

Pricing, licensing, and vendor considerations.

IDRequirementDescriptionAssessment criteriaVerification methodTypical priority
C1.1Transparent pricingPublished pricing or pricing model transparencyFull: public pricing, calculator available. Partial: pricing on request. None: undisclosed.Review pricing page; request quoteImportant
C1.2Nonprofit discountReduced pricing for registered nonprofitsFull: documented nonprofit programme with significant discount. Partial: case-by-case. None: standard pricing only.Review nonprofit programme documentationImportant
C1.3Free tier or open sourceAvailability of no-cost option for evaluation or small deploymentsFull: feature-complete open source or unlimited free tier. Partial: limited free tier. None: paid only.Review licensing and free tier documentationImportant
C1.4Contract flexibilityFlexible contract terms (monthly, annual, multi-year)Full: multiple term options without penalties. Partial: annual minimum. None: multi-year lock-in.Review contract documentation; request termsDesirable
C1.5Data portabilityAbility to export all data if leaving the platformFull: complete export, documented migration paths. Partial: limited export. None: vendor lock-in.Review export and migration documentationEssential

Comparison matrices

Comparison matrices use the following rating scale:

SymbolMeaning
Full support as documented
Partial support with limitations (see notes)
Minimal or basic support
Not supported
-Not applicable
$Requires paid tier
EEnterprise edition only
βBeta or preview feature

Tool overview

AttributeOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
TypeFOSSFOSSFOSSFOSSCommercialCommercial
LicenceApache 2.0Apache 2.0Apache 2.0Apache 2.0ProprietaryProprietary
Current version1.11.41.3.04.3.02.4.02025.08Continuous
First release202120202019201520082020
Primary maintainerCollate (commercial)Acryl Data (commercial)LF AI FoundationApache FoundationCollibra Inc.Microsoft
Managed serviceCollate CloudDataHub CloudNoneNoneCollibra CloudAzure Purview
Deployment modelSelf-hosted, SaaSSelf-hosted, SaaSSelf-hostedSelf-hostedSaaS, self-hostedSaaS

Functional capability matrix

Metadata ingestion

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Database connectors● (70+)● (60+)◐ (20+)◐ (15+)● (100+)● (90+)
BI tool connectors
Pipeline connectors
Custom connector SDK
Incremental ingestion
Schema change detection

Assessment notes:

  • OpenMetadata provides the broadest FOSS connector library with 70+ connectors documented in version 1.11.
  • DataHub’s connector count is comparable, with strong coverage across cloud warehouses and BI tools.
  • Amundsen connectors require more configuration; the “databuilder” library provides extraction but with less turnkey setup.
  • Apache Atlas connectors are primarily Hadoop-ecosystem focused (Hive, HBase, Kafka) with limited cloud coverage.
  • Commercial platforms offer the widest connector ranges but include proprietary systems less relevant to many mission-driven organisations.

Search and discovery

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Full-text search
Faceted filtering
Usage popularity signals
Saved searches
Natural language search●$●$●$
Cross-asset search

Assessment notes:

  • All platforms provide competent search; differentiation is in advanced features.
  • Natural language search requires AI features available in commercial tiers of OpenMetadata (Collate) and DataHub (Acryl).
  • Amundsen pioneered PageRank-style popularity ranking in FOSS catalogues.
  • Apache Atlas search is functional but the UI is dated compared to modern alternatives.

Data lineage

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Column-level lineage
Cross-system lineage
Lineage visualisation
Manual lineage editing
SQL parsing
dbt integration

Assessment notes:

  • OpenMetadata and DataHub provide comparable column-level lineage with automatic SQL parsing.
  • Amundsen lineage is table-level by default; column-level requires custom implementation.
  • Apache Atlas lineage works well within Hadoop ecosystem but cross-system stitching requires manual effort.
  • Commercial platforms offer more sophisticated lineage impact analysis and business lineage layers.

Business glossary

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Glossary term management
Term-to-asset linking
Glossary import/export
Vocabulary enforcement
Multi-domain glossaries
Approval workflows●$

Assessment notes:

  • OpenMetadata’s glossary module is comprehensive with approval workflows in open source.
  • DataHub’s glossary workflows are available in open source but advanced governance in DataHub Cloud.
  • Amundsen lacks native glossary functionality; requires external glossary integration.
  • Collibra’s business glossary is industry-leading with extensive workflow capabilities.

Data classification

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Tag management
Automated classification●$
Classification propagation
Custom classification rules
Data domain organisation

Assessment notes:

  • OpenMetadata includes PII detection in the open source version.
  • DataHub’s advanced classification requires the paid cloud tier.
  • Amundsen supports tags but lacks automated classification capabilities.
  • Microsoft Purview’s classification integrates with Microsoft Information Protection labels.

Technical capability matrix

Deployment options

OptionOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Docker--
Kubernetes Helm-
Self-hosted●E
Managed SaaS● (Collate)● (Acryl)
Air-gapped●E

Assessment notes:

  • FOSS platforms provide full self-hosted deployment flexibility.
  • Collibra self-hosted requires enterprise licensing and is typically hybrid with cloud components.
  • Microsoft Purview is Azure-native with no self-hosted option.

Infrastructure requirements (self-hosted)

ComponentOpenMetadataDataHubAmundsenApache Atlas
Metadata storeMySQL or PostgreSQLMySQL or PostgreSQLPostgreSQL or Neo4jHBase or JanusGraph
Search engineElasticsearch or OpenSearchElasticsearchElasticsearchSolr
Message queue-Kafka-Kafka
Minimum RAM8 GB16 GB8 GB16 GB
Minimum CPU4 cores4 cores4 cores4 cores

Assessment notes:

  • DataHub’s Kafka dependency adds infrastructure complexity but enables event streaming.
  • Apache Atlas’s HBase requirement makes it heavier than alternatives for small deployments.
  • OpenMetadata has the lightest footprint among feature-complete options.

API capabilities

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
REST API
GraphQL API◐β
Python SDK
Java SDK
Event streaming●β
Webhooks

Assessment notes:

  • DataHub has the most mature GraphQL API among FOSS options.
  • OpenMetadata’s Python SDK follows the “ingestion framework” pattern, well-documented.
  • Amundsen APIs are functional but less comprehensive than newer platforms.
  • Apache Atlas integrates with Kafka for Atlas Notifications.

Security capability matrix

Authentication methods

MethodOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
SAML 2.0
OIDC
LDAP-
Local auth-
Service accounts

Access control

CapabilityOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
Role-based access
Attribute-based access
Asset-level permissions
Column-level masking

Assessment notes:

  • Apache Atlas has mature ABAC through integration with Apache Ranger.
  • DataHub’s policy framework supports attribute-based conditions.
  • OpenMetadata permissions are role-based with team-level scoping.
  • Commercial platforms offer finer-grained access control options.

Certifications and compliance

CertificationOpenMetadataDataHubAmundsenApache AtlasCollibraMicrosoft Purview
SOC 2 Type II● (Collate)● (Acryl)--
ISO 27001● (Collate)--
GDPR features
HIPAA capable◐ (Collate)◐ (Acryl)--
FedRAMP--

Assessment notes:

  • Certifications apply to managed service offerings; self-hosted inherits customer infrastructure controls.
  • Open source platforms can achieve compliance in customer environments but lack turnkey certification.
  • FedRAMP certification is available only on commercial platforms for US government requirements.

Commercial comparison matrix

Pricing models

PlatformModelFree tierEntry pointEnterprise
OpenMetadataOpen source + SaaSFull FOSSCollate: contact for pricingCollate Enterprise
DataHubOpen source + SaaSFull FOSSDataHub Cloud: contact for pricingDataHub Cloud Enterprise
AmundsenOpen sourceFull FOSSSelf-hosted onlyNo commercial offering
Apache AtlasOpen sourceFull FOSSSelf-hosted onlyNo commercial offering
CollibraSaaS subscriptionNoneContact for pricingPer-user + platform fee
Microsoft PurviewAzure consumptionLimited freePay-as-you-go from $0.10/assetEnterprise Agreement

Assessment notes:

  • OpenMetadata and DataHub offer full-featured open source with optional managed services.
  • Amundsen and Apache Atlas have no commercial backing; support is community-based.
  • Collibra pricing is enterprise-grade; expect $100,000+ annually for meaningful deployments.
  • Microsoft Purview consumption pricing varies significantly based on asset count and scan frequency.

Nonprofit programmes

PlatformProgrammeDiscountRequirements
OpenMetadata (Collate)Contact salesCase-by-caseRegistered nonprofit
DataHub (Acryl)Contact salesCase-by-caseRegistered nonprofit
CollibraCollibra for NonprofitsUndisclosed501(c)(3) or equivalent
Microsoft PurviewMicrosoft NonprofitsUp to 75% on Azure creditsRegistered nonprofit via Tech Soup

Assessment notes:

  • FOSS options require no discount; full functionality is free.
  • Microsoft nonprofit pricing is most transparent through the Tech Soup/Microsoft Nonprofits programme.
  • Enterprise vendors negotiate nonprofit pricing case-by-case; budget 30-50% of list price.

Individual tool assessments

OpenMetadata

AttributeValue
TypeOpen source
LicenceApache 2.0
Current version1.11.4 (December 2025)
Repositorygithub.com/open-metadata/OpenMetadata
Documentationdocs.open-metadata.org
Commercial offeringCollate (managed service)

Overview

OpenMetadata is a unified metadata platform providing data discovery, data quality, observability, and governance through a central metadata repository. The project emerged from Uber’s Databook and was open-sourced in 2021 by Collate, a company founded by former Uber data infrastructure engineers. Development follows a rapid release cadence with major versions approximately every 6-8 weeks.

The architecture centres on a metadata repository storing entities (tables, databases, dashboards, pipelines) and relationships using a MySQL or PostgreSQL backend with Elasticsearch for search. The platform distinguishes itself through comprehensive FOSS functionality; features like data quality, lineage, and glossary workflows are available without commercial licensing. The commercial Collate offering adds AI-powered features, managed infrastructure, and enterprise support.

OpenMetadata’s connector philosophy emphasises “no-code” ingestion where metadata extraction runs via configuration rather than custom coding. The platform supports 70+ connectors spanning databases, data warehouses, BI tools, and pipeline orchestrators with consistent metadata models across sources.

Strengths

Comprehensive FOSS feature set: Unlike competitors that reserve governance workflows for commercial tiers, OpenMetadata includes glossary approval workflows, data quality rules, and role-based access control in the open source version. Organisations can implement full catalogue governance without licensing costs.

Modern, intuitive interface: The React-based UI provides responsive search, inline editing, and streamlined navigation. The interface design reflects contemporary SaaS standards rather than traditional enterprise software patterns, reducing training overhead.

Active development trajectory: The project’s rapid release cadence (version 1.10 in October 2025, 1.11 in December 2025) demonstrates ongoing investment. Feature parity with commercial alternatives has improved substantially in recent releases.

Lightweight deployment: Minimum requirements of 8 GB RAM and 4 CPU cores with MySQL/PostgreSQL and Elasticsearch make OpenMetadata deployable on modest infrastructure. The Docker Compose quickstart enables evaluation in under 10 minutes.

Limitations

Limited ABAC capabilities: Access control is primarily role-based with team scoping. Attribute-based policies (e.g., access based on classification level) require workarounds through team structures rather than native policy expressions.

Event streaming still maturing: While Kafka integration exists for change events, the streaming capabilities are less mature than DataHub’s event-driven architecture. Organisations requiring real-time metadata synchronisation should evaluate carefully.

Managed service geographic coverage: Collate Cloud regions are limited compared to hyperscaler-native options. Organisations with strict data residency requirements outside US and EU should verify regional availability.

No on-premises commercial support: The managed Collate service is cloud-only. Organisations requiring vendor support for on-premises deployments must rely on community support or third-party consultants.

Deployment considerations

Self-hosted requirements:

  • MySQL 8.0+ or PostgreSQL 12+ for metadata storage
  • Elasticsearch 7.x or OpenSearch 2.x for search
  • Airflow 2.x for scheduled ingestion (optional; can use standalone CLI)
  • 8 GB RAM minimum; 16 GB recommended for production
  • Helm chart available for Kubernetes deployment

Operational overhead: Moderate. Requires Elasticsearch cluster management and MySQL/PostgreSQL administration. Upgrade path is well-documented with database migration scripts.

Integration capabilities

Integration typeCoverage
DatabasesPostgreSQL, MySQL, Snowflake, BigQuery, Redshift, Databricks, Athena, Presto, Trino, Oracle, SQL Server, and 50+ others
BI toolsTableau, Looker, Metabase, Superset, Power BI, Mode, Redash
PipelinesAirflow, Dagster, dbt, Fivetran, NiFi, Flink
StorageS3, GCS, ADLS
MessagingKafka

Organisational fit

Best suited for:

  • Organisations prioritising open source with no licensing dependency
  • Teams with PostgreSQL/MySQL and Elasticsearch operational expertise
  • Deployments requiring data quality and governance in a single platform
  • Environments where rapid feature evolution is valued over stability guarantees

Less suitable for:

  • Organisations requiring attribute-based access control policies
  • Deployments without container orchestration capabilities
  • Teams needing FedRAMP or similar government certifications
  • Environments requiring real-time streaming metadata updates

DataHub

AttributeValue
TypeOpen source
LicenceApache 2.0
Current version1.3.0 (October 2025)
Repositorygithub.com/datahub-project/datahub
Documentationdatahubproject.io/docs
Commercial offeringDataHub Cloud (Acryl Data)

Overview

DataHub is an event-driven metadata platform originally developed at LinkedIn and open-sourced in 2020. The architecture fundamentally differs from competitors through its use of Kafka for metadata change events, enabling real-time streaming integrations and event-driven workflows. Acryl Data, founded by DataHub’s LinkedIn creators, provides the commercial DataHub Cloud service.

The platform models metadata as a graph with typed entities and relationships stored in a MySQL or PostgreSQL backend, graph views in Elasticsearch, and change streams via Kafka. This architecture supports the “metadata as a service” pattern where metadata changes propagate to downstream consumers in near-real-time.

DataHub reached version 1.0 in March 2025 after five years of development, signalling maturity and API stability commitments. The project maintains active development with quarterly minor releases and strong enterprise adoption (Netflix, Visa, Slack, Pinterest are documented users).

Strengths

Event-driven architecture: Kafka-based metadata change events enable real-time integrations, streaming analytics, and event-driven workflows. Organisations already operating Kafka infrastructure gain natural integration points.

Mature GraphQL API: DataHub’s GraphQL interface provides flexible, efficient queries for custom integrations. The API is well-documented with comprehensive schema coverage, making it suitable for building custom experiences.

Strong enterprise adoption: Documented deployments at scale (Netflix, Visa, Airtel) provide confidence in production readiness. The project benefits from contributions and bug reports from demanding environments.

Comprehensive SDK options: Both Python and Java SDKs receive active development, enabling programmatic metadata management in either ecosystem. The SDKs abstract API complexity while preserving flexibility.

Limitations

Infrastructure complexity: The Kafka dependency increases deployment complexity compared to non-streaming alternatives. Organisations without existing Kafka expertise face additional operational burden.

Higher resource requirements: Baseline deployment requires 16 GB RAM and includes multiple services (GMS, frontend, Kafka, Elasticsearch, MySQL). Small organisations may find the footprint disproportionate.

UI feature parity with API: Some capabilities available via API require GraphQL knowledge to access; not all functionality is exposed through the UI. Technical users are better served than business users in some workflows.

Commercial features for governance: Advanced governance features including automated classification and some workflow capabilities require DataHub Cloud licensing rather than the open source version.

Deployment considerations

Self-hosted requirements:

  • MySQL 5.7+ or PostgreSQL 12+ for metadata storage
  • Elasticsearch 7.x for search and graph views
  • Kafka 2.x for metadata change events
  • 16 GB RAM minimum; 32 GB recommended for production
  • Helm chart available; Docker Compose for evaluation

Operational overhead: High. Kafka cluster management adds significant operational complexity. Organisations should have Kafka operational expertise or consider the managed DataHub Cloud service.

Integration capabilities

Integration typeCoverage
DatabasesSnowflake, BigQuery, Redshift, Databricks, PostgreSQL, MySQL, Hive, Presto, Trino, Oracle, SQL Server, and 50+ others
BI toolsTableau, Looker, Power BI, Superset, Metabase, Mode, Sigma
PipelinesAirflow, dbt, Dagster, Prefect, Fivetran, Spark
StorageS3, GCS, ADLS
MessagingKafka, Pulsar

Organisational fit

Best suited for:

  • Organisations with existing Kafka infrastructure and expertise
  • Deployments requiring real-time metadata streaming and events
  • Teams building custom metadata applications via API
  • Environments with strong engineering capacity for platform operation

Less suitable for:

  • Small organisations without dedicated data platform teams
  • Deployments prioritising operational simplicity over streaming capabilities
  • Teams primarily needing business glossary and governance workflows
  • Environments without Kafka expertise or willingness to acquire it

Amundsen

AttributeValue
TypeOpen source
LicenceApache 2.0
Current version4.3.0 (July 2025)
Repositorygithub.com/amundsen-io/amundsen
Documentationamundsen.io
Commercial offeringNone

Overview

Amundsen is a data discovery and metadata engine developed at Lyft and open-sourced in 2019 under the LF AI & Data Foundation. The project pioneered PageRank-style search ranking based on usage patterns, surfacing frequently queried tables above less-used alternatives. Amundsen follows a microservices architecture with separate frontend, search, and metadata services.

Development pace has slowed compared to OpenMetadata and DataHub, with community contributions driving most recent changes. The project lacks commercial backing, meaning support relies entirely on community resources. However, the architecture’s modularity enables organisations to adopt components selectively and integrate with existing infrastructure.

Amundsen’s primary strength lies in discovery; the platform excels at helping users find relevant data through intelligent search ranking. Governance capabilities (glossary, classification, quality) are minimal compared to newer alternatives.

Strengths

Proven discovery algorithms: Amundsen’s search ranking incorporates usage signals effectively, surfacing popular and frequently-queried tables. Organisations with large table counts benefit from intelligent relevance ranking.

Modular architecture: Separate services for frontend, search, and metadata enable selective adoption and integration with existing systems. Organisations can replace individual components (e.g., swap Neo4j for Neptune) without full platform replacement.

Lightweight for discovery: For organisations primarily needing data discovery without governance workflows, Amundsen provides focused functionality without feature bloat.

LF AI Foundation governance: Foundation membership provides neutral governance and reduces single-vendor dependency risks, though it also limits commercial investment.

Limitations

Minimal governance features: No native glossary, approval workflows, or automated classification. Organisations requiring governance workflows must integrate external tools or choose alternative platforms.

Slower development pace: Release frequency and feature additions lag behind commercially-backed alternatives. Major capability gaps may persist longer than with OpenMetadata or DataHub.

Table-level lineage only: Native lineage is table-level; column-level lineage requires custom implementation. This limitation is significant for impact analysis use cases.

No managed service option: Organisations must self-host with no vendor support option. Community support via Slack is available but response times and depth vary.

Legacy technology choices: Python 3.8/3.9 requirements and Node.js 10/12 are dated. Dependency updates may require careful testing.

Deployment considerations

Self-hosted requirements:

  • PostgreSQL or Neo4j for metadata storage
  • Elasticsearch for search
  • Python 3.8-3.10, Node.js 10-12
  • 8 GB RAM minimum
  • Docker Compose available; Helm charts community-maintained

Operational overhead: Moderate. Simpler than DataHub (no Kafka) but requires Neo4j expertise if using graph backend. Documentation assumes significant user self-sufficiency.

Integration capabilities

Integration typeCoverage
DatabasesHive, Redshift, PostgreSQL, Snowflake, BigQuery, Athena, Presto, MySQL (via databuilder extractors)
BI toolsTableau, Superset, Mode (limited)
PipelinesAirflow (via DAG extractors)

Organisational fit

Best suited for:

  • Organisations primarily needing data discovery without governance
  • Environments with Neo4j or graph database expertise
  • Teams comfortable with significant self-service and customisation
  • Deployments prioritising simplicity over comprehensive features

Less suitable for:

  • Organisations requiring business glossary and governance workflows
  • Teams needing column-level lineage and impact analysis
  • Deployments without engineering capacity for custom integration work
  • Environments expecting vendor or commercial support

Apache Atlas

AttributeValue
TypeOpen source
LicenceApache 2.0
Current version2.4.0 (January 2025)
Repositorygithub.com/apache/atlas
Documentationatlas.apache.org
Commercial offeringNone (Hadoop vendor distributions include Atlas)

Overview

Apache Atlas is the original open source data governance and metadata management framework for the Hadoop ecosystem, first released in 2015 as an Apache incubator project. The platform provides metadata services, classification, and lineage tracking with deep integration into Hadoop components including Hive, HBase, Kafka, and Sqoop.

Atlas architecture uses JanusGraph or HBase for metadata storage and Solr for search, reflecting its Hadoop-native heritage. The platform excels within Hadoop environments but shows its age when applied to modern cloud data warehouses and SaaS tools. Integration with Apache Ranger provides attribute-based access control enforcement.

Development continues under the Apache Foundation with moderate community activity. Atlas 2.4.0 in January 2025 demonstrates ongoing maintenance, though feature velocity is lower than commercially-backed alternatives.

Strengths

Hadoop ecosystem integration: Native hooks for Hive, HBase, Kafka, Sqoop, Storm, and Falcon provide automatic lineage capture within Hadoop environments. Organisations with significant Hadoop investment benefit from seamless integration.

Apache Ranger integration: Combined with Ranger, Atlas enables attribute-based access control where classifications and tags drive data access policies. This integration is unique among open source catalogues.

Mature and stable: Nearly a decade of production use provides confidence in stability for core use cases. The type system and API are well-established.

Foundation governance: Apache Foundation stewardship ensures neutral governance and long-term project continuity independent of commercial interests.

Limitations

Dated user interface: The web UI reflects 2015-era design patterns. User experience lags significantly behind modern catalogue interfaces, increasing training requirements and reducing adoption.

Hadoop-centric architecture: HBase or JanusGraph requirements assume Hadoop-style infrastructure. Organisations without existing Hadoop infrastructure face significant deployment overhead.

Limited cloud warehouse support: Connectors for Snowflake, BigQuery, and Databricks are community-contributed with varying quality. Cloud-native organisations will find gaps.

Minimal BI and pipeline coverage: Dashboard and pipeline metadata support is limited compared to newer platforms. Comprehensive cataloguing requires supplemental tools.

No managed service: Self-hosted only with no commercial support option. Hadoop distribution vendors (Cloudera, Hortonworks legacy) include Atlas but general enterprise support is unavailable.

Deployment considerations

Self-hosted requirements:

  • HBase or JanusGraph for metadata graph storage
  • Solr for search
  • Kafka for hook messaging
  • Zookeeper for coordination
  • 16 GB RAM minimum; 32 GB recommended
  • Java 8+ runtime

Operational overhead: High. Requires HBase/JanusGraph operational expertise plus Solr and Kafka management. Deployment complexity exceeds all other options in this category.

Integration capabilities

Integration typeCoverage
DatabasesHive, HBase, Oracle, SQL Server, MySQL, PostgreSQL, Cassandra, Couchbase
BI toolsLimited (custom integration required)
PipelinesSqoop, Storm, Falcon, Spark (via hooks)
MessagingKafka

Organisational fit

Best suited for:

  • Organisations with significant Hadoop/HBase infrastructure investment
  • Environments requiring Ranger-based access control integration
  • Deployments where stability and maturity outweigh UI modernisation needs
  • Teams with Java/Hadoop operational expertise

Less suitable for:

  • Cloud-native organisations without Hadoop infrastructure
  • Teams prioritising user experience and adoption
  • Deployments requiring modern BI tool and pipeline integration
  • Organisations without Java/Hadoop operational capabilities

Collibra

AttributeValue
TypeCommercial
Pricing modelSubscription (per-user + platform)
Current version2025.08 (continuous release)
Documentationproductresources.collibra.com
API documentationdeveloper.collibra.com
DeploymentSaaS (primary), on-premises (enterprise)

Overview

Collibra Data Intelligence Platform is an enterprise data governance and catalogue solution founded in 2008, making it one of the longest-established vendors in the category. The platform emphasises business-user accessibility, governance workflows, and data stewardship alongside technical metadata management.

Collibra’s architecture centres on a knowledge graph storing business and technical metadata with extensive workflow automation for governance processes. The platform targets enterprise buyers with comprehensive feature sets, professional services, and global support infrastructure.

The product follows a continuous release model with monthly updates. Collibra’s market position is enterprise-focused with pricing reflecting that segment; most deployments exceed $100,000 annually.

Strengths

Industry-leading business glossary: Collibra’s business glossary and stewardship capabilities are best-in-class. Approval workflows, term relationships, and certification processes are more sophisticated than alternatives.

Extensive governance workflows: Configurable workflows for data certification, access requests, issue management, and stewardship tasks provide enterprise-grade governance automation.

Broadest connector library: 100+ connectors covering databases, BI tools, cloud platforms, and enterprise applications. Most organisations find pre-built connectors for their stack.

Professional services ecosystem: Global system integrator partnerships, professional services, and training programmes support enterprise deployments. Organisations with limited internal data management expertise benefit from implementation support.

Limitations

Enterprise pricing: Entry-level deployments start at $100,000+ annually with costs scaling significantly for larger user counts and data volumes. Budget-constrained organisations will find Collibra inaccessible.

Complexity: Feature breadth creates complexity. Implementation timelines of 6-12 months are common for enterprise deployments with extensive configuration requirements.

SaaS preference: While on-premises deployment exists, Collibra strongly prefers cloud deployment. Self-hosted customers may experience feature delays and reduced support priority.

Vendor lock-in concerns: Proprietary data models and workflows create switching costs. Data export capabilities exist but migration to alternatives requires significant effort.

Deployment considerations

SaaS deployment:

  • Multi-tenant cloud hosted by Collibra
  • Regional options (US, EU, APAC)
  • SOC 2 Type II, ISO 27001 certified
  • 99.9% uptime SLA

Self-hosted (enterprise):

  • Kubernetes-based deployment
  • Customer-managed infrastructure
  • Collibra Edge for hybrid connectivity
  • Requires enterprise licensing tier

Operational overhead: Low for SaaS (vendor-managed). Self-hosted requires dedicated infrastructure team and Collibra-specific expertise.

Integration capabilities

Integration typeCoverage
DatabasesSnowflake, BigQuery, Redshift, Databricks, Azure Synapse, Oracle, SQL Server, PostgreSQL, MySQL, Teradata, and 70+ others
BI toolsTableau, Power BI, Looker, Qlik, MicroStrategy, SAP BusinessObjects
PipelinesInformatica, Talend, dbt, Airflow, Azure Data Factory
ERP/CRMSAP, Salesforce, Workday
Cloud platformsAWS, Azure, GCP native services

Organisational fit

Best suited for:

  • Large enterprises with substantial data governance budgets
  • Organisations prioritising business glossary and stewardship workflows
  • Deployments requiring professional services and implementation support
  • Environments needing broadest connector coverage

Less suitable for:

  • Budget-constrained organisations under $100,000 annual budget
  • Small organisations without dedicated data governance teams
  • Deployments prioritising self-service implementation
  • Technical teams preferring open source foundations

Microsoft Purview

AttributeValue
TypeCommercial
Pricing modelAzure consumption-based
Current versionContinuous release (Unified Catalog GA 2025)
Documentationlearn.microsoft.com/purview
API documentationlearn.microsoft.com/rest/api/purview
DeploymentAzure SaaS only

Overview

Microsoft Purview is Microsoft’s unified data governance service combining data cataloguing, classification, and compliance capabilities within the Azure ecosystem. The platform evolved from Azure Purview (2020) with significant expansion in 2024-2025 to become Microsoft Purview with broader scope including data security, risk, and compliance features.

Purview’s architecture integrates with Microsoft’s Data Map for metadata storage, Microsoft Graph for relationships, and Azure services for compute. The platform leverages Microsoft’s information protection labels enabling unified classification across Microsoft 365, Azure data services, and third-party sources.

Purview uses consumption-based pricing where costs scale with asset count, scan frequency, and data classification volume. This model suits variable workloads but requires monitoring to avoid unexpected costs.

Strengths

Microsoft ecosystem integration: Native integration with Azure Synapse, Azure SQL, Power BI, Microsoft 365, and Fabric provides seamless metadata capture for Microsoft-centric environments.

Unified classification: Integration with Microsoft Information Protection enables consistent sensitivity labels across data catalogue, SharePoint, Teams, and email. Organisations already using Microsoft classification benefit from extension to data assets.

Consumption-based pricing: Pay-per-use model enables starting small and scaling with data volume. Organisations uncertain about scope can begin with limited assets and expand.

Data security integration: Purview combines cataloguing with data loss prevention, insider risk management, and compliance features unavailable in pure catalogue products.

Limitations

Azure lock-in: Purview is Azure-native with no self-hosted or alternative cloud option. Organisations avoiding Azure dependency cannot use Purview.

Multi-cloud limitations: While Purview scans non-Azure sources (AWS, GCP), integration depth is inferior to Azure-native sources. Multi-cloud organisations may find inconsistent capabilities.

Complex pricing: Consumption pricing across multiple meters (data map, scans, insights) requires careful monitoring. Organisations report difficulty predicting costs.

Feature maturity: The Unified Catalog (GA November 2025) is newer than alternatives. Some features remain in preview with stability and completeness implications.

API maturity: APIs are evolving with some capabilities in public preview. Organisations requiring stable programmatic access should evaluate current coverage against requirements.

Deployment considerations

Deployment:

  • Azure subscription required
  • Single-tenant instance per Azure tenant
  • Automatically regional based on Azure subscription
  • No self-hosted option

Operational overhead: Low for Azure-native deployments. Microsoft manages infrastructure. Scanning configuration and classification rules require governance team attention.

Integration capabilities

Integration typeCoverage
Azure servicesAzure SQL, Synapse, Data Lake, Blob Storage, Databricks, Cosmos DB, Fabric (native)
AWSS3, RDS, Redshift, Glue (via scanner)
GCPBigQuery, Cloud Storage (via scanner)
DatabasesSQL Server, Oracle, PostgreSQL, MySQL, SAP HANA, Teradata, Snowflake
BI toolsPower BI (native), Tableau, Looker (limited)
PipelinesAzure Data Factory (native), dbt

Organisational fit

Best suited for:

  • Organisations committed to Microsoft Azure ecosystem
  • Deployments requiring unified data and document classification
  • Environments already using Microsoft Information Protection
  • Teams preferring consumption pricing over committed spend

Less suitable for:

  • Organisations avoiding cloud vendor lock-in
  • Multi-cloud deployments with significant non-Azure workloads
  • Teams requiring predictable fixed costs
  • Environments needing self-hosted deployment options

Selection guidance

Decision framework

What is your primary deployment preference?
|
+--------------------------------+--------------------------------+
| | |
v v v
+--------------+ +--------------+ +--------------+
| Self-hosted | | Managed | | Azure-native |
| (FOSS pref) | | Service | | (Required) |
+------+-------+ +------+-------+ +------+-------+
| | |
v v v
+--------------+ +--------------+ +--------------+
| Use Kafka? | | Annual Spend | | Microsoft |
+---+------+---+ +---+------+---+ | Purview |
| | | | +--------------+
| | +---------+ +---------+
v v v v
[Yes] [No] [>$100k] [<$100k]
| | | |
| | v v
| | +------------+ +--------------+
| | | Collibra | | OpenMetadata |
| | | or | | (Collate) |
| | | DataHub | | or |
| | | Cloud | | DataHub Cloud|
| v +------------+ +--------------+
| +--------------+
| | OpenMetadata |
v +--------------+
+--------------+
| DataHub |
+--------------+

Recommendations by context

Organisations with minimal IT capacity

Recommended: OpenMetadata with Collate Cloud or DataHub with DataHub Cloud

Managed services eliminate infrastructure operational burden while providing full catalogue functionality. Both offer straightforward onboarding with guided setup wizards. Collate and DataHub Cloud pricing is negotiable for smaller organisations; request nonprofit or startup pricing.

Alternative: Microsoft Purview (if Azure-committed)

For organisations already invested in Azure, Purview’s consumption model enables starting small. Native Microsoft integrations reduce connector configuration effort.

Avoid: Self-hosted deployments requiring Kafka, HBase, or complex orchestration. Apache Atlas and self-hosted DataHub require infrastructure expertise unavailable in minimal IT contexts.

Organisations with established IT capacity

Recommended: OpenMetadata (self-hosted) or DataHub (self-hosted)

Self-hosted FOSS deployments provide maximum control, no licensing costs, and full feature access. OpenMetadata offers lighter infrastructure requirements; DataHub suits organisations with existing Kafka expertise.

Selection criteria:

  • Choose OpenMetadata if Kafka infrastructure is unavailable and operational simplicity is valued
  • Choose DataHub if real-time metadata streaming and event-driven architecture are requirements
  • Choose DataHub if GraphQL API access is priority for custom integrations

Alternative: Collibra (if budget allows)

Organisations with governance maturity requiring sophisticated stewardship workflows and business glossary management may justify Collibra investment. Evaluate whether FOSS alternatives meet workflow requirements before committing enterprise spend.

Organisations with Hadoop infrastructure

Recommended: Apache Atlas (if governance integration with Ranger is required) or DataHub (for modernisation path)

Apache Atlas integrates natively with Hadoop ecosystem and Apache Ranger for access control. Organisations with significant HBase and Hive investment benefit from seamless lineage capture.

DataHub provides a modernisation path, ingesting Hive and Hadoop metadata while offering superior UI and broader connector support for hybrid environments.

Organisations prioritising data sovereignty

Recommended: OpenMetadata or DataHub (self-hosted)

Self-hosted FOSS deployments keep all metadata on organisation-controlled infrastructure with no data transmission to external services. Both platforms support air-gapped deployment for high-security environments.

Avoid: SaaS deployments where data residency cannot be guaranteed or verified. Evaluate managed service data processing locations carefully if considering cloud options.

Migration paths

FromToComplexityApproachTimeline
AmundsenOpenMetadataMediumExport via Amundsen API; import using OpenMetadata bulk loader2-4 weeks
AmundsenDataHubMediumExport via Amundsen API; import using DataHub Python SDK2-4 weeks
Apache AtlasOpenMetadataMediumOpenMetadata provides Atlas connector for metadata import2-4 weeks
Apache AtlasDataHubMediumDataHub provides Atlas source connector2-4 weeks
OpenMetadataDataHubLow-MediumExport via OpenMetadata API; transform to DataHub model2-3 weeks
DataHubOpenMetadataLow-MediumExport via DataHub API; transform to OpenMetadata model2-3 weeks
Any FOSSCollibraHighCollibra professional services engagement typical; custom migration scripts2-4 months
CollibraAny FOSSHighExport via Collibra API; significant model transformation required2-4 months

Resources and references

Official documentation

Open source platforms

ToolDocumentationAPI referenceGitHub repository
OpenMetadatadocs.open-metadata.orgdocs.open-metadata.org/swaggergithub.com/open-metadata/OpenMetadata
DataHubdatahubproject.io/docsdatahubproject.io/docs/apigithub.com/datahub-project/datahub
Amundsenamundsen.iogithub.com/amundsen-io/amundsengithub.com/amundsen-io/amundsen
Apache Atlasatlas.apache.orgatlas.apache.org/apigithub.com/apache/atlas

Commercial platforms

ToolDocumentationAPI referenceDeveloper portal
Collibraproductresources.collibra.comdeveloper.collibra.com/apideveloper.collibra.com
Microsoft Purviewlearn.microsoft.com/purviewlearn.microsoft.com/rest/api/purviewlearn.microsoft.com/purview/developer

Relevant standards

StandardDescriptionURL
Open Metadata and Governance (OMAG)Egeria project metadata interoperability standardsegeria-project.org
Apache Atlas REST APIDe facto standard for metadata exchange in Hadoop ecosystemsatlas.apache.org/api/v2
W3C DCATData Catalog Vocabulary for describing datasetsw3.org/TR/vocab-dcat-3
ISO 11179Metadata registries standardiso.org/standard/78916.html

See also

Internal documentation relevant to data catalogue selection and implementation: