Master Data Management
Master data management (MDM) creates and maintains authoritative, consistent versions of core business entities that span multiple operational systems. For mission-driven organisations, these entities include beneficiaries, geographic locations, implementing partners, donors, projects, and organisational units. Without MDM, the same beneficiary appears in the case management system, the distribution platform, and the M&E database as three separate records with inconsistent names, identifiers, and demographic data. MDM resolves this fragmentation by establishing a single authoritative record and synchronising it across systems.
- Master data
- Core business entities that are referenced across multiple systems and transactions. Master data changes infrequently compared to transactional data and provides context for transactions. A beneficiary record is master data; a distribution event recording that beneficiary receiving assistance is transactional data.
- Golden record
- The authoritative, trusted version of a master data entity created by consolidating and reconciling data from multiple sources. The golden record represents the organisation’s single version of truth for that entity.
- Survivorship
- Rules that determine which attribute values survive into the golden record when multiple source systems provide conflicting values for the same attribute. Survivorship rules encode trust hierarchies among systems.
- Match
- The determination that two or more records represent the same real-world entity despite differences in how they are recorded. Matching uses deterministic rules, probabilistic scoring, or machine learning.
- Merge
- The combination of matched records into a single golden record by applying survivorship rules to resolve attribute conflicts.
- Hub
- The central repository that stores golden records and manages their relationships. The hub may be a dedicated MDM platform or a designated system of record.
Master data domains
Master data domains are categories of entities that share governance requirements and lifecycle characteristics. Each domain requires specific matching logic, survivorship rules, and stewardship practices suited to its nature.
Person domains encompass beneficiaries, staff, volunteers, and contacts. Beneficiary data presents the greatest complexity due to name transliteration variations, absent formal identification, and the ethical imperative to avoid re-registration burden. A beneficiary named “محمد أحمد” in Arabic may appear as “Mohammed Ahmed”, “Muhammad Ahmad”, or “Mohamed Ahmet” across systems depending on transliteration conventions and data entry variations. Matching must account for these variations while avoiding false positives that would merge distinct individuals.
Organisation domains include implementing partners, donors, government entities, and vendors. Organisations present challenges around hierarchies (a country office versus its headquarters), name variations (acronyms versus full names), and lifecycle events (mergers, name changes). An organisation recorded as “Save the Children UK” in one system and “SC UK” in another requires matching rules that recognise common abbreviation patterns.
Location domains cover administrative boundaries, facilities, distribution points, and project sites. Location master data must align with authoritative geographic references such as Common Operational Datasets (CODs) for humanitarian contexts. A village recorded with local spelling variations must resolve to the correct P-code in the national gazetteer.
Reference domains include currencies, languages, sectors, and other classification schemes. These domains have simpler structures but require alignment with external standards (ISO currency codes, cluster classifications) and consistent application across systems.
Project domains encompass programmes, projects, grants, and activities. Project hierarchies vary by donor and internal structure, requiring flexible modelling that supports multiple reporting views.
| Domain | Example entities | Primary challenges | External standards |
|---|---|---|---|
| Person | Beneficiaries, staff, contacts | Name variations, absent ID, duplicates | None universal |
| Organisation | Partners, donors, vendors | Hierarchies, abbreviations, mergers | IATI organisation IDs |
| Location | Admin boundaries, facilities | Local naming, boundary changes | P-codes, ISO 3166 |
| Reference | Currencies, sectors | Standard alignment | ISO 4217, cluster codes |
| Project | Programmes, grants | Hierarchy variations | IATI activity IDs |
MDM architectural styles
Four architectural styles govern how master data flows between the MDM hub and operational systems. The choice depends on existing system landscapes, technical capacity, and tolerance for implementation complexity.
Registry style
The registry style maintains cross-references between records across systems without storing master data attributes centrally. Each system retains its own version of the data; the registry stores only identifiers and links. When the case management system needs to know that its beneficiary record 4523 corresponds to distribution system record BEN-7891, the registry provides that mapping.
+---------------------------------------------------------------------+| REGISTRY STYLE |+---------------------------------------------------------------------+| || +------------------+ +------------------+ +--------------+ || | Case Management | | MDM REGISTRY | | Distribution | || | | | | | | || | Beneficiary: | | Cross-references:| | Beneficiary: | || | ID: 4523 +---->| 4523 <-> BEN-7891+<----+ ID: BEN-7891 | || | Name: M. Ahmed | | 4523 <-> P-00234 | | Name: Mohamed| || | DOB: 1985-03-15 | | | | DOB: 1985-03 | || +------------------+ +--------+---------+ +--------------+ || | || v || +--------+---------+ || | M&E Platform | || | | || | Participant: | || | ID: P-00234 | || | Name: Mohammed A | || +------------------+ |+---------------------------------------------------------------------+Figure 1: Registry style maintains cross-references without centralised attributes
The registry style minimises implementation complexity and respects system autonomy. Each system continues to manage its own data without modification. The registry adds value through linkage, enabling queries that span systems (which beneficiaries in the case management system also received distributions?). The limitation is that data inconsistencies persist across systems. The case management system records date of birth as “1985-03-15” while the distribution system stores only “1985-03”, and both values remain in use. The registry cannot resolve this conflict because it stores no attributes.
Registry style suits organisations with established systems that cannot be modified, limited technical capacity for integration, or political constraints where systems refuse to accept externally-managed data. Implementation requires 2-4 weeks for basic cross-referencing, extending to 2-3 months with automated matching.
Consolidation style
The consolidation style copies data from source systems into the MDM hub, matches records, creates golden records, but does not write data back to source systems. The hub becomes the authoritative source for analytical and reporting purposes while operational systems continue unchanged.
+--------------------------------------------------------------------+| CONSOLIDATION STYLE |+--------------------------------------------------------------------+| || +----------------+ +----------------+ || | Case Mgmt | | Distribution | || | (source) | | (source) | || +-------+--------+ +--------+-------+ || | | || | Extract Extract | || v v || +-------+---------------------------------------------+-------+ || | MDM HUB | || | +--------------------------------------------------------+ | || | | STAGING AREA | | || | | Case: 4523, M. Ahmed, 1985-03-15, Aleppo | | || | | Dist: BEN-7891, Mohamed, 1985-03, Halab | | || | +------------------------+-------------------------------+ | || | | | || | Match + Merge | || | v | || | +------------------------+-------------------------------+ | || | | GOLDEN RECORDS | | || | | GR-001: Mohammed Ahmed, 1985-03-15, Aleppo | | || | | Sources: [4523, BEN-7891] | | || | +--------------------------------------------------------+ | || +-------------------------------------------------------------+ || | || v || +-------+--------+ || | Reporting & | || | Analytics | || +----------------+ |+--------------------------------------------------------------------+Figure 2: Consolidation style creates golden records for analytics without modifying sources
The consolidation style delivers the analytical benefits of consistent master data without requiring integration work on operational systems. Reporting dashboards, donor reports, and programme analytics draw from golden records with consistent entity identification. The limitation is that operational staff continue working with inconsistent data. A case worker sees “M. Ahmed” while the distribution team sees “Mohamed”, and neither benefits from the consolidated view.
Consolidation suits organisations prioritising reporting accuracy, those unable to modify operational systems, or those seeking incremental MDM adoption. Consolidated views can demonstrate value before attempting full integration. Implementation requires 1-3 months for a single domain, extending to 6-12 months for comprehensive coverage across domains.
Coexistence style
The coexistence style extends consolidation by writing golden record attributes back to source systems. After matching and merging creates a golden record, changed attributes propagate to sources. Systems retain their own records but receive updates from the hub.
+--------------------------------------------------------------------+| COEXISTENCE STYLE |+--------------------------------------------------------------------+| || +----------------+ +----------------+ || | Case Mgmt | | Distribution | || | (coexists) | | (coexists) | || +-------+--------+ +--------+-------+ || | | || | Extract Extract | || v v || +-------+---------------------------------------------+-------+ || | MDM HUB | || | | || | +-----------+ +-----------+ +------------------+ | || | | Staging +---->| Match & +---->| Golden Records | | || | | | | Merge | | | | || | +-----------+ +-----------+ +--------+---------+ | || | | | || +-----------------------------------------------+-------------+ || ^ | || | v || | +--------+---------+ || | | Publish Updates | || | +--------+---------+ || | | || +----------------+----------------------+ || | | || Update | Update | || v v || +--------+-------+ +--------+-------+ || | Case Mgmt | | Distribution | || | Updated: Ahmed | | Updated: Ahmed | || +----------------+ +----------------+ |+--------------------------------------------------------------------+Figure 3: Coexistence style synchronises golden record updates to source systems
Coexistence improves data consistency across operational systems while respecting their autonomy. Each system continues to create and manage its own records, but receives corrections and standardisations from the hub. The complexity lies in bidirectional synchronisation: changes in source systems must flow to the hub, and golden record updates must flow back without creating loops or overwriting legitimate local changes.
Conflict resolution becomes critical. If a case worker corrects a beneficiary’s phone number in the case management system while the hub simultaneously pushes a standardised name spelling, both changes must survive. Coexistence implementations require clear rules about which attributes flow in which direction, typically reserving certain attributes (name, demographics) for hub management while leaving operational attributes (case status, distribution history) under system control.
Implementation requires 3-6 months for basic coexistence, extending to 12-18 months for complex environments with many systems and bidirectional attribute flows.
Centralised style
The centralised style designates the MDM hub as the sole authority for master data creation and maintenance. Operational systems cannot create master records; they must reference records that exist in the hub. All changes to master data occur in the hub and propagate to consuming systems.
+-------------------------------------------------------------------+| CENTRALISED STYLE |+-------------------------------------------------------------------+| || +--------------------+ || | MDM HUB | || | (sole authority) | || | | || | Create new entity | || | Modify attributes | || | Manage lifecycle | || +----------+---------+ || | || +-----------------------+-----------------------+ || | | | || v v v || +------+-------+ +-------+------+ +-------+------+ || | Case Mgmt | | Distribution | | M&E Platform | || | (consumer) | | (consumer) | | (consumer) | || | | | | | | || | References | | References | | References | || | hub records | | hub records | | hub records | || | Read-only | | Read-only | | Read-only | || | master data | | master data | | master data | || +--------------+ +--------------+ +--------------+ || |+-------------------------------------------------------------------+Figure 4: Centralised style makes the hub the sole authority for master data
Centralised MDM eliminates duplication at the source and guarantees consistency. A beneficiary registered in the hub immediately becomes available to all consuming systems with identical attributes. No matching or merging is required because duplication cannot occur. This style delivers the strongest data quality outcomes but demands the highest implementation effort.
The centralised style requires all operational systems to integrate with the hub for master data operations. Systems must be modified to look up existing records before creating transactions and to accept hub-managed identifiers. This integration work is substantial, particularly for commercial off-the-shelf systems with limited extensibility.
Centralised MDM fits organisations building new system landscapes, those with strong integration capabilities, or those managing high-sensitivity data where duplicate records create significant operational or protection risks. Implementation requires 6-18 months depending on system landscape complexity.
| Style | Data authority | Duplication eliminated | Integration effort | When to use |
|---|---|---|---|---|
| Registry | Distributed | No | Low | Legacy systems, political constraints |
| Consolidation | Hub (read) | For analytics only | Medium | Reporting focus, incremental adoption |
| Coexistence | Shared | Partially | High | Gradual consistency improvement |
| Centralised | Hub (read/write) | Yes | Very high | New builds, high-sensitivity data |
Golden record creation
Golden record creation transforms multiple source records representing the same entity into a single authoritative record. The process comprises three stages: candidate identification, matching, and merging.
Candidate identification
Candidate identification selects records that might represent the same entity without performing exhaustive comparisons. Comparing every record against every other record in a database of 100,000 beneficiaries would require 5 billion comparisons. Blocking reduces this to manageable numbers by grouping records that share common characteristics.
A blocking key groups records by shared attributes likely to identify the same entity. For beneficiary matching, blocking by first three characters of family name and birth year creates groups small enough for detailed comparison while unlikely to separate true matches. Records for “Ahmed, born 1985” and “Ahmad, born 1985” fall into the same block for comparison, while “Ahmed, born 1985” and “Hassan, born 1990” remain in separate blocks and are never compared.
Effective blocking balances group size against separation risk. Overly specific blocking (full family name plus full birth date) creates tiny groups but separates records with minor variations that should match. Overly broad blocking (birth year only) creates large groups requiring excessive comparisons. Multiple blocking passes with different keys catch matches that any single key would miss.
Matching
Matching determines whether candidate records represent the same entity. Deterministic matching applies exact rules: if national ID numbers match exactly, the records match. Probabilistic matching calculates similarity scores across multiple attributes and accepts matches above a threshold.
Deterministic rules work well when reliable identifiers exist. Two beneficiary records sharing the same UNHCR individual ID are the same person, regardless of name spelling differences. The rule is simple and produces no false positives when the identifier is trustworthy.
Probabilistic matching handles the common case where reliable identifiers are absent. A probabilistic matcher compares name similarity (using algorithms like Jaro-Winkler that handle transpositions and phonetic similarity), date of birth similarity (accounting for missing day/month components), and location proximity. Each comparison produces a score; scores combine using weights reflecting each attribute’s reliability.
Consider matching two beneficiary records:
| Attribute | Record A | Record B | Similarity | Weight | Contribution |
|---|---|---|---|---|---|
| Family name | Ahmed | Ahmad | 0.92 | 0.30 | 0.276 |
| Given name | Mohammed | Mohamed | 0.95 | 0.25 | 0.238 |
| Birth year | 1985 | 1985 | 1.00 | 0.20 | 0.200 |
| Birth month | 03 | 03 | 1.00 | 0.10 | 0.100 |
| Location | Aleppo | Halab | 0.85 | 0.15 | 0.128 |
| Total | 0.942 |
With a match threshold of 0.85, this pair scores 0.942 and qualifies as a match. The location comparison recognised “Aleppo” and “Halab” as the same city (Arabic name versus English transliteration) using a location alias lookup.
Machine learning matchers train on confirmed match/non-match pairs to learn patterns that distinguish true matches from coincidental similarities. Training requires labelled data: pairs of records with human-verified match status. The model learns to weight attributes and recognise patterns specific to the organisation’s data characteristics. ML matching delivers higher accuracy than rule-based probabilistic matching but requires more implementation effort and ongoing model maintenance.
Merge and survivorship
Merging combines matched records into a golden record by applying survivorship rules that select attribute values from among sources. Survivorship rules encode trust hierarchies, recency preferences, and completeness criteria.
Source trust assigns priority to systems based on data quality reputation. If the case management system is known to collect beneficiary names carefully during intake while the distribution system records names hastily during busy distributions, case management data survives over distribution data for name attributes.
Recency prefers the most recently updated value on the assumption that newer data reflects corrections or current status. Phone numbers and addresses benefit from recency rules because they change over time.
Completeness prefers values with more information. A birth date of “1985-03-15” survives over “1985-03” and “1985” because it contains the most precision.
Aggregation combines values rather than selecting one. For address data, aggregation might retain both a primary address and a previous address from different sources. For identifiers, aggregation retains all known IDs from all sources.
Consider golden record creation for a beneficiary appearing in three systems:
| Attribute | Case Mgmt (trust: 1) | Distribution (trust: 3) | M&E (trust: 2) | Survivorship rule | Golden value |
|---|---|---|---|---|---|
| Family name | Ahmed | Ahmad | Ahmed | Source trust | Ahmed |
| Given name | Mohammed | Mohamed | Mohammed | Source trust | Mohammed |
| Birth date | 1985-03-15 | 1985-03 | 1985-03-15 | Completeness | 1985-03-15 |
| Phone | +963-912-345-678 | +963-912-345-999 | null | Recency (Dist newer) | +963-912-345-999 |
| National ID | null | 28501234567 | null | Any non-null | 28501234567 |
The golden record for this beneficiary uses the family name and given name from case management (highest trust), birth date from any source with full precision, phone from distribution (most recent), and national ID from distribution (only source with the value).
+------------------------------------------------------------------+| GOLDEN RECORD CREATION |+------------------------------------------------------------------+| || Source Records || +--------------+ +--------------+ +--------------+ || | Case Mgmt | | Distribution | | M&E Platform | || | ID: 4523 | | ID: BEN-7891 | | ID: P-00234 | || | Ahmed | | Ahmad | | Ahmed | || | Mohammed | | Mohamed | | Mohammed | || | 1985-03-15 | | 1985-03 | | 1985-03-15 | || | +963-912-678 | | +963-912-999 | | (null) | || | (null) | | 28501234567 | | (null) | || +------+-------+ +------+-------+ +------+-------+ || | | | || v v v || +------+-----------------+-----------------+------+ || | MATCHING ENGINE | || | Blocking: Family name prefix + birth year | || | Scoring: 0.942 (above 0.85 threshold) | || +------------------------+------------------------+ || | || v || +------------------------+------------------------+ || | SURVIVORSHIP ENGINE | || | Name: Trust priority (Case Mgmt) | || | DOB: Completeness (most precise) | || | Phone: Recency (Distribution) | || | ID: Aggregation (any non-null) | || +------------------------+------------------------+ || | || v || +------------------------+------------------------+ || | GOLDEN RECORD | || | GR-001 | || | Family name: Ahmed | || | Given name: Mohammed | || | Birth date: 1985-03-15 | || | Phone: +963-912-345-999 | || | National ID: 28501234567 | || | Sources: [4523, BEN-7891, P-00234] | || +------------------------+------------------------+ |+------------------------------------------------------------------+Figure 5: Golden record creation through matching and survivorship
Synchronisation patterns
Synchronisation keeps the MDM hub and operational systems aligned as data changes. Three patterns address different latency and complexity requirements.
Batch synchronisation extracts, transforms, and loads data on a schedule. A nightly batch job extracts new and changed records from source systems, performs matching against existing golden records, creates or updates golden records, and publishes changes to consuming systems. Batch synchronisation suits organisations with limited integration infrastructure, tolerance for day-old data, and straightforward data volumes.
A batch schedule for beneficiary MDM might run at 02:00 UTC when system load is minimal. The job extracts 500 new registrations from case management, 200 modified records from distribution, matches them against 50,000 existing golden records, identifies 15 potential duplicates requiring steward review, creates 480 new golden records, updates 150 existing records, and publishes all changes to the data warehouse by 05:00.
Event-driven synchronisation processes changes as they occur. When a case worker updates a beneficiary’s phone number, an event triggers immediate propagation to the hub. The hub validates the change, updates the golden record, and publishes the change to other systems within seconds.
Event-driven synchronisation requires integration infrastructure capable of reliable message delivery, typically Apache Kafka, RabbitMQ, or cloud equivalents. Systems must emit change events and consume update events. The complexity is higher than batch synchronisation, but data consistency improves from daily to near-real-time.
Hybrid synchronisation combines patterns for different data types. High-priority changes (new registrations, identity corrections) flow through event-driven channels for immediate consistency. Lower-priority changes (address updates, preference changes) accumulate for batch processing. The hybrid approach balances implementation effort against business requirements for data freshness.
+--------------------------------------------------------------------+| HYBRID SYNCHRONISATION |+--------------------------------------------------------------------+| || +------------------+ Event: New registration || | Case Management +------------------------------------------+ || | | | || | +----------------------------------+ | || +------------------+ Batch: Address updates (nightly)| | || | | || +-------------------------------v-------v-+ || | MDM HUB | || | +------------+ +------------+ | || | | Event | | Batch | | || | | Processor | | Processor | | || | | (real-time)| | (nightly) | | || | +-----+------+ +------+-----+ | || | | | | || | v v | || | +-----+-------------------+-----+ | || | | Golden Records | | || | +-----+-------------------+-----+ | || | | | | || +--------+-------------------+------------+ || | | || Event: New GR | Batch: Updates | || +-----------------------+ | || | | || v v || +------+-------+ +-------+------+ || | Distribution | | Data | || | (real-time) | | Warehouse | || +--------------+ | (nightly) | || +--------------+ |+--------------------------------------------------------------------+Figure 6: Hybrid synchronisation combining event-driven and batch patterns
Data stewardship for master data
Data stewards perform operational governance of master data, resolving issues that automated processes cannot handle. Stewardship activities include potential match review, merge undo, data correction, and hierarchy management.
Potential match review addresses cases where matching scores fall in an uncertain range. A match score of 0.75 against a threshold of 0.85 might represent a true match with poor data quality or a genuine near-miss. Stewards review the records, examine additional context (case notes, distribution history), and make authoritative match/non-match decisions. These decisions train future matching models and refine matching rules.
Merge undo corrects incorrect merges. When two distinct individuals are mistakenly merged, the golden record conflates their information. Undoing a merge requires separating the golden record back into source records, determining which attributes belong to which individual, and updating all consuming systems. Merge undo is operationally complex; minimising false positive matches through careful threshold setting and steward review reduces the need for it.
Data correction updates golden record attributes when source data is incorrect and cannot be fixed at the source. A steward might correct a misspelled name in the golden record when the source system is decommissioned or inaccessible. Corrections flow to active consuming systems through normal synchronisation.
Hierarchy management maintains relationships between entities in hierarchical domains. Organisational hierarchies require steward decisions about where new partners fit, how mergers and splits affect the structure, and which relationships are current versus historical.
Stewardship workload varies by data quality and matching effectiveness. An organisation with strong upstream data quality and well-tuned matching rules might generate 50 stewardship tasks per week. An organisation with inconsistent data entry and conservative matching thresholds might generate 500 tasks per week. Stewardship capacity must align with expected workload; backlogs of unreviewed potential matches defeat the purpose of MDM.
MDM governance
MDM governance defines authority over master data: who can create entities, who approves changes, who sets standards, and who resolves disputes between systems claiming different truths.
Domain ownership assigns accountability for each master data domain to a business function. The programmes team owns beneficiary master data because they bear the consequences of beneficiary data problems. Finance owns vendor master data. Partnership owns implementing partner data. Domain owners set standards for their domains, approve data model changes, and arbitrate disputes.
Stewardship assignment delegates operational governance to individuals with appropriate access and expertise. A programme manager in each country office might serve as beneficiary data steward for that country’s records. Central stewards handle cross-country issues and maintain global standards. Stewardship cannot be an afterthought; it requires defined role expectations, allocated time, and training.
Change governance controls modifications to matching rules, survivorship rules, and data models. Changes that affect how records match or how golden records form can have cascading effects. A seemingly minor adjustment to a matching threshold might suddenly merge 1,000 previously separate records or split 500 existing golden records. Change governance requires impact analysis, testing, and approval before production changes.
Quality standards set expectations for data accuracy, completeness, and timeliness within each domain. Beneficiary records might require 95% name completeness, 80% date of birth completeness, and 99% unique identification within 24 hours of registration. Standards provide targets for data entry, thresholds for stewardship escalation, and metrics for governance reporting.
MDM for beneficiary data
Beneficiary master data presents distinctive challenges that merit specific consideration. Beneficiaries often lack formal identification documents, making matching dependent on attributes that vary (names, locations) rather than authoritative identifiers (national ID, passport). Ethical obligations require minimising registration burden while maximising data protection. Programme requirements demand accurate deduplication (to ensure fair distribution) without false positive merges (which conflate distinct individuals’ assistance records).
Identity confidence levels classify beneficiary records by the strength of their identity verification. A beneficiary verified against a national ID card at registration has high confidence; a beneficiary registered from a third-party list with only name and approximate age has low confidence. Matching thresholds can adjust by confidence level: high-confidence records match only at stringent thresholds while low-confidence records match at lower thresholds where false negatives are more costly than false positives.
Household relationships link individual beneficiaries into household units that share assistance or have related eligibility. MDM must model both individual identity and household membership, recognising that household composition changes over time (births, deaths, separations). A household golden record aggregates individual golden records with relationship metadata.
Cross-programme consent tracks whether beneficiaries have consented to data sharing across programmes. One beneficiary might consent to sharing between nutrition and health programmes but not with livelihoods programmes. The golden record must reflect these consent boundaries, and synchronisation must respect them, potentially providing different views of the golden record to different consuming systems.
Protection classification identifies beneficiaries whose data requires elevated protection. Survivors of gender-based violence, children at risk, and persons with security concerns may have records that should not appear in standard distributions or reports. MDM must support access controls that hide sensitive records from systems or users without appropriate clearance while maintaining the golden record for authorised purposes.
Technology options
MDM platforms range from purpose-built commercial products to open source frameworks to custom-built solutions using general-purpose data tools.
Open source options
Talend Open Studio for MDM provides matching, merging, and stewardship workflow capabilities in an open source package. It integrates with Talend’s broader data integration platform. The open source version covers core MDM functionality; advanced features (workflow management, metadata management) require commercial licensing. Talend suits organisations with existing Talend investments or those seeking a path from open source to commercial support.
Apache Atlas provides metadata management and lineage tracking that supports MDM implementations built on other tools. Atlas does not provide matching or merging natively but can govern master data assets managed elsewhere. Organisations building custom MDM solutions on data platforms like Apache Spark can use Atlas for governance.
RecordLinkage (R) and dedupe (Python) provide matching and deduplication libraries for organisations building custom solutions. These libraries implement probabilistic matching algorithms without prescribing architecture or workflow. They suit organisations with data engineering capacity who want to embed matching within existing pipelines rather than deploying a separate MDM platform.
Commercial options with nonprofit programmes
Informatica MDM offers comprehensive MDM capabilities including matching, survivorship, hierarchy management, and stewardship workflow. Informatica provides nonprofit pricing through TechSoup in some regions. The platform is enterprise-grade with corresponding implementation complexity; it suits large organisations with dedicated data management teams.
Reltio provides cloud-native MDM with modern architecture suited to organisations without on-premises infrastructure investments. Reltio’s graph-based data model handles complex relationships well. Nonprofit pricing availability varies by region and should be verified directly.
Microsoft Dynamics 365 includes MDM capabilities for organisations already invested in the Microsoft ecosystem. Integration with Power Platform enables citizen developer involvement in stewardship workflows. Microsoft offers substantial nonprofit discounts making this accessible to organisations already using Microsoft 365.
Custom solutions
Organisations with strong data engineering capabilities can build MDM solutions using general-purpose tools. A combination of Apache Spark for matching at scale, PostgreSQL for golden record storage, Apache Airflow for orchestration, and a custom stewardship application provides full MDM functionality without platform licensing costs. Custom solutions demand ongoing engineering investment but offer complete flexibility and avoid vendor lock-in.
| Option | Matching | Stewardship UI | Best for |
|---|---|---|---|
| Talend Open Studio | Built-in | Basic | Existing Talend users |
| Apache Atlas | None | Metadata only | Governance layer |
| dedupe (Python) | Library | None (build custom) | Custom solutions |
| Informatica MDM | Built-in | Comprehensive | Large enterprises |
| Reltio | Built-in | Modern | Cloud-native orgs |
| Custom build | Build with libraries | Build custom | Strong engineering teams |
Implementation considerations
For organisations with limited IT capacity
MDM implementation without dedicated data management staff requires pragmatic scoping. Start with a single domain (beneficiary data across two systems) rather than comprehensive MDM. Use registry style to add value through cross-referencing without complex integration. A simple matching implementation using dedupe or RecordLinkage with manual stewardship in spreadsheets provides meaningful deduplication without platform investment.
Minimum viable MDM for a single IT person:
- Scope: Beneficiary domain, case management + distribution systems only
- Style: Registry with manual cross-reference maintenance
- Matching: Monthly batch using Python/dedupe, 2 hours of steward review
- Technology: PostgreSQL for cross-references, Python scripts for matching
- Timeline: 4-6 weeks implementation, 4 hours/month ongoing
For organisations with established IT functions
Established IT teams can implement coexistence or centralised MDM with formal governance. Multi-domain MDM covering beneficiaries, partners, locations, and projects delivers comprehensive data consistency. Automated matching with machine learning improves accuracy over time. Stewardship workflows with defined SLAs ensure timely issue resolution.
Investment scales with ambition. A two-domain coexistence implementation (beneficiaries and partners) with an open source platform requires 4-6 months and 0.5 FTE ongoing. A comprehensive five-domain centralised implementation with a commercial platform requires 12-18 months and 1-2 FTE ongoing.
For organisations with federated structures
Federated organisations face MDM challenges when country offices operate autonomous systems. A global MDM hub can consolidate data for headquarters reporting while respecting country office system autonomy. The consolidation style works well here: country systems remain unchanged while the hub provides global visibility.
Cross-office deduplication identifies beneficiaries registered in multiple countries (refugees who move, programmes that span borders). This requires matching across country datasets while respecting data sovereignty constraints that may prevent centralising raw data. Federated matching approaches perform matching computations locally and share only match results, not underlying data.