Data Modelling
Data modelling is the practice of creating abstract representations of data structures that capture both the meaning of data and the rules governing its use. A data model serves as the blueprint for how information is organised, related, and constrained within a system. The discipline spans three distinct levels of abstraction: conceptual models that express business meaning independent of technology, logical models that define precise structures and relationships, and physical models that specify how data is stored in particular database systems.
Effective data models bridge the gap between business understanding and technical implementation. A programme manager describing beneficiary households, their needs assessments, and the assistance they receive is articulating a conceptual model. The database administrator implementing tables with foreign keys and indexes is working with a physical model. The logical model sits between them, precise enough for technical work yet free from implementation details. This separation allows business requirements to remain stable even as underlying technology changes.
- Conceptual model
- A high-level representation of business entities and their relationships, expressed in terms stakeholders understand. Contains no technical implementation details. Used for requirements gathering and communication with non-technical audiences.
- Logical model
- A technology-independent specification of entities, attributes, relationships, and constraints. Defines data types, cardinality, and business rules without reference to specific database platforms.
- Physical model
- The implementation-specific design for a particular database system, including table structures, column definitions, indexes, partitioning schemes, and storage parameters.
- Entity
- A distinct object or concept about which data is stored. In programme contexts: beneficiaries, households, activities, distributions, locations.
- Attribute
- A property or characteristic of an entity. A beneficiary entity might have attributes for name, date of birth, and registration date.
- Relationship
- An association between entities. A household contains beneficiaries; a distribution serves a beneficiary; an activity occurs at a location.
- Cardinality
- The numerical nature of a relationship: one-to-one, one-to-many, or many-to-many. A household contains one-to-many beneficiaries; a beneficiary belongs to exactly one household.
- Normalisation
- The process of organising data to reduce redundancy and improve integrity by dividing large tables into smaller, related tables.
- Denormalisation
- The deliberate introduction of redundancy to improve query performance, typically for analytical workloads.
Modelling levels
The three modelling levels form a refinement chain, each adding detail while preserving the semantics established at higher levels. Movement between levels is not strictly linear; insights from physical modelling constraints often feed back into logical design decisions.
+-------------------------------------------------------------------+| CONCEPTUAL MODEL || || Audience: Business stakeholders, programme managers || Content: Major entities, key relationships, business rules || Detail: Low - focuses on "what" not "how" || || +-------------+ +-------------+ +-------------+ || | | | | | | || | Beneficiary +---------+ Household +---------+ Location | || | | | | | | || +------+------+ +-------------+ +-------------+ || | || | receives || v || +------+------+ +-------------+ || | | | | || | Assistance +---------+ Activity | || | | | | || +-------------+ +-------------+ || |+----------------------------------+--------------------------------+ | | Refinement v+----------------------------------+--------------------------------+| LOGICAL MODEL || || Audience: Data architects, analysts, developers || Content: All entities, attributes, data types, constraints || Detail: High - complete specification || || Beneficiary Household || +------------------------+ +------------------------+ || | beneficiary_id: UUID | | household_id: UUID | || | household_id: UUID (FK)|---->| head_of_household: UUID| || | given_name: String(100)| | address: String(500) | || | family_name: String(100) | location_id: UUID (FK) | || | date_of_birth: Date | | registration_date: Date| || | gender: Enum | | status: Enum | || | vulnerability_score: Int +------------------------+ || +------------------------+ || |+----------------------------------+--------------------------------+ | | Implementation v+----------------------------------+--------------------------------+| PHYSICAL MODEL || || Audience: Database administrators, operations || Content: Tables, indexes, partitions, storage settings || Detail: Implementation-specific || || CREATE TABLE beneficiary ( || beneficiary_id UUID PRIMARY KEY, || household_id UUID NOT NULL REFERENCES household, || given_name VARCHAR(100) NOT NULL, || family_name VARCHAR(100), || date_of_birth DATE, || gender SMALLINT CHECK (gender IN (1,2,3)), || vulnerability_score SMALLINT DEFAULT 0, || created_at TIMESTAMPTZ DEFAULT NOW(), || updated_at TIMESTAMPTZ DEFAULT NOW() || ); || CREATE INDEX idx_beneficiary_household || ON beneficiary(household_id); || CREATE INDEX idx_beneficiary_location || ON beneficiary(location_id) WHERE status = 'active'; || |+------------------------------------------------------------------+Figure 1: Three modelling levels showing progressive refinement from business concepts to implementation
The conceptual model captures that beneficiaries belong to households and receive assistance through activities. It deliberately omits technical details: stakeholders need not know whether household membership is implemented via a foreign key or a junction table. The logical model adds precision: beneficiary_id is a UUID, vulnerability_score is an integer, gender uses an enumerated type. The physical model commits to PostgreSQL-specific syntax, adds indexes for query patterns, and specifies storage behaviours.
Maintaining traceability between levels enables impact analysis. When a business requirement changes at the conceptual level (households can now have multiple locations), the logical model changes follow systematically (location_id moves from household to a junction table), and physical changes propagate from there (new table, new indexes, migration script).
Entity-relationship modelling
Entity-relationship (ER) modelling represents data as entities connected by relationships, with attributes describing entity properties. The approach dominates operational database design where transactional integrity and update efficiency matter most.
+-------------------------------------------------------------------+| ER NOTATION REFERENCE |+-------------------------------------------------------------------+| || ENTITIES || +-------------+ || | Entity | Rectangle: represents a distinct object || +-------------+ || || RELATIONSHIPS || || -------- 1 -------- One (mandatory) || -------- 0..1 ----- Zero or one (optional) || -------- * -------- Many (zero or more) || -------- 1..* ----- One or more (mandatory many) || || CARDINALITY EXAMPLES || || +----------+ 1 +----------+ || | Country +-----------+ Capital | One-to-one || +----------+ 1 +----------+ || || +----------+ 1 +----------+ || | Household+-----------+ Benefi- | One-to-many || +----------+ * | ciary | || +----------+ || || +----------+ * +----------+ || | Activity +-----------+ Location | Many-to-many || +----------+ * +----------+ (needs junction table) || || ATTRIBUTES || || +------------------------+ || | Entity | || +------------------------+ || | attribute_name: type | Listed inside entity box || | another_attr: type | || +------------------------+ || || PRIMARY KEY Underlined or marked PK || FOREIGN KEY Marked FK, arrow to referenced entity || |+-------------------------------------------------------------------+Figure 2: Entity-relationship notation conventions
Normalisation
Normalisation organises data to eliminate redundancy and ensure consistency. The process applies successive normal forms, each addressing specific anomaly types. First normal form (1NF) requires atomic values: a beneficiary cannot have a multi-valued “phone_numbers” attribute; instead, a separate phone_number table relates to beneficiary. Second normal form (2NF) eliminates partial dependencies: every non-key attribute must depend on the entire primary key, not just part of a composite key. Third normal form (3NF) removes transitive dependencies: non-key attributes must depend directly on the key, not on other non-key attributes.
Consider a denormalised distribution record:
distribution_id | beneficiary_id | beneficiary_name | household_id | location_name | item | quantity | dateThis structure violates 2NF because beneficiary_name depends only on beneficiary_id, not the full key. It violates 3NF because location_name depends on household_id (transitively through household’s location). Normalising yields:
distribution (distribution_id, beneficiary_id, item, quantity, date)beneficiary (beneficiary_id, beneficiary_name, household_id)household (household_id, location_id)location (location_id, location_name)The normalised design stores each fact once. Updating a location name requires changing one row rather than every distribution record referencing that location. Insert anomalies disappear: a new location can exist before any household moves there. Delete anomalies resolve: removing all distributions for a beneficiary does not lose the beneficiary’s information.
Higher normal forms exist (BCNF, 4NF, 5NF) addressing subtler dependency types. Most operational systems target 3NF as the practical balance between integrity and complexity.
Operational model example
A programme management operational model demonstrates ER principles applied to humanitarian data:
+-------------------------------------------------------------------+| PROGRAMME OPERATIONAL MODEL |+-------------------------------------------------------------------+| || +-------------+ 1 +---------------+ || | Project +---------->| Donor | || +------+------+ * +---------------+ || | || | 1 || | || v * || +------+------+ * +---------------+ 1 +--------+ || | Activity +---------->| Location |<----------+ Admin | || +------+------+ * +-------+-------+ * | Level | || | | +--------+ || | 1 | 1 || | | || v * v * || +------+------+ +-------+-------+ || |Distribution | | Household | || +------+------+ +-------+-------+ || | | || | * | 1 || | | || v 1 v * || +------+------+ +-------+-------+ || | Beneficiary |<----------+ Beneficiary | || | (receiver) | * | (member) | || +-------------+ +---------------+ || ^ | || | | || | 1 | * || | v || +------+------+ +-------+-------+ || | Biometric | | Needs | || | Record | | Assessment | || +-------------+ +---------------+ || |+-------------------------------------------------------------------+Figure 3: Operational model for programme management showing entity relationships
The model captures that projects receive funding from donors, projects contain activities, activities occur at locations, and activities produce distributions. Households exist at locations and contain beneficiary members. Distributions go to specific beneficiaries (who may or may not be the household head). Each beneficiary can have biometric records for deduplication and needs assessments for targeting.
Key constraints embedded in this model: a beneficiary belongs to exactly one household (preventing duplicate registrations across households), distributions always link to both an activity and a beneficiary (ensuring traceability), and locations form a hierarchy through administrative levels (enabling aggregation).
Dimensional modelling
Dimensional modelling organises data for analytical queries rather than transactional updates. Where normalised models optimise for write consistency, dimensional models optimise for read performance and analytical flexibility. The approach structures data into facts surrounded by dimensions.
A fact table contains quantitative measurements of business events: distribution quantities, assessment scores, expenditure amounts. Facts are typically numeric and additive, enabling aggregation across dimensions. A dimension table contains descriptive attributes providing context for facts: who received the distribution, where it occurred, when, what item, under which project.
The star schema arranges dimensions around a central fact table, each dimension connecting directly to the fact. The snowflake schema normalises dimensions into sub-dimensions, reducing redundancy at the cost of query complexity.
+-------------------------------------------------------------------+| STAR SCHEMA - DISTRIBUTION FACTS |+-------------------------------------------------------------------+| || +------------------+ || | dim_beneficiary | || +------------------+ || | beneficiary_key | || | beneficiary_id | || | name | || | gender | || | age_group | || | vulnerability | || | household_size | || +--------+---------+ || | || +----------------+ | +------------------+ || | dim_date | | | dim_location | || +----------------+ | +------------------+ || | date_key | | | location_key | || | full_date | | | location_name | || | day | | | admin_level_1 | || | month | | | admin_level_2 | || | quarter | | | admin_level_3 | || | year | | | country | || | fiscal_year | | | region | || | is_holiday | | | coordinates | || +-------+--------+ | +---------+--------+ || | | | || | +--------------+-------------+ | || | | | | || +---->| fact_distribution |<------+ || | | | | || | +----------------------------+ | || | | distribution_key | | || | | date_key (FK) | | || | | beneficiary_key (FK) | | || | | location_key (FK) | | || | | activity_key (FK) | | || | | item_key (FK) | | || | |----------------------------| | || | | quantity | | || | | unit_value | | || | | total_value | | || | +-------------+--------------+ | || | | | || +-------+--------+ | +-----------+-------+ || | dim_item | | | dim_activity | || +----------------+ | +-------------------+ || | item_key |<---------+--------->| activity_key | || | item_code | | activity_name | || | item_name | | project_name | || | category | | donor | || | unit_of_measure| | sector | || | unit_cost | | modality | || +----------------+ +-------------------+ || |+-------------------------------------------------------------------+Figure 4: Star schema for distribution analytics with five dimensions surrounding a central fact table
This star schema supports queries such as “total value distributed by donor and quarter” or “beneficiaries reached by location and item category” through straightforward joins. Each dimension includes denormalised attributes: dim_location contains admin_level_1 through admin_level_3 directly rather than referencing separate admin_level tables. This denormalisation speeds queries by reducing join depth.
Slowly changing dimensions
Dimensions change over time: a beneficiary moves to a different location, a project changes donors, an item is reclassified. Slowly changing dimension (SCD) techniques preserve historical accuracy while enabling current-state queries.
+-------------------------------------------------------------------+| SLOWLY CHANGING DIMENSION PATTERNS |+-------------------------------------------------------------------+| || TYPE 1: OVERWRITE || +------------------------------------------------------------+ || | Before: BEN001 | Nairobi | current | || | After: BEN001 | Mombasa | current | || +------------------------------------------------------------+ || History lost. Use when history irrelevant or corrections. || || TYPE 2: ADD ROW || +------------------------------------------------------------+ || | Row 1: BEN001 | Nairobi | 2023-01-01 | 2024-06-30 | N | || | Row 2: BEN001 | Mombasa | 2024-07-01 | 9999-12-31 | Y | || +------------------------------------------------------------+ || Full history preserved. Current flag or date range identifies || active record. Use when historical analysis required. || || TYPE 3: ADD COLUMN || +------------------------------------------------------------+ || | BEN001 | Mombasa (current) | Nairobi (previous) | || +------------------------------------------------------------+ || Limited history (typically one prior value). Use for simple || before/after comparisons. || || TYPE 6: HYBRID (1 + 2 + 3) || +------------------------------------------------------------+ || | Row 1: BEN001 | Nairobi | Mombasa | 2023-01-01 | 2024-06-30| || | Row 2: BEN001 | Mombasa | Mombasa | 2024-07-01 | 9999-12-31| || +------------------------------------------------------------+ || Current value in every row (column), plus full history (rows), || plus previous value column. Maximum flexibility, highest || storage cost. || |+-------------------------------------------------------------------+Figure 5: Slowly changing dimension types with trade-offs
Type 2 SCD dominates humanitarian analytics where historical accuracy affects reporting and accountability. Distributions must associate with the beneficiary’s location at distribution time, not their current location. A beneficiary receiving food in Nairobi in March and later relocating to Mombasa should not appear in Mombasa’s March distribution reports.
The implementation uses surrogate keys (beneficiary_key) distinct from business keys (beneficiary_id). The fact table references surrogate keys, preserving the exact dimensional context at transaction time. Queries for current state filter on the is_current flag; historical queries join on date ranges.
Grain and aggregation
The grain of a fact table defines what each row represents: one distribution to one beneficiary on one date, or one distribution per beneficiary per month, or one distribution per household per activity. Declaring grain precisely prevents ambiguity in metrics.
Consider a fact table at daily beneficiary grain containing 100,000 distributions. Aggregating to monthly location grain might yield 500 rows. Pre-aggregating common rollups into aggregate fact tables accelerates dashboards while the base fact table preserves detail for drill-down.
Additive facts (quantity, value) aggregate freely across all dimensions. Semi-additive facts (balance, inventory level) aggregate across some dimensions but not others: summing month-end balances across locations makes sense; summing across months does not. Non-additive facts (ratios, percentages) require aggregating components first: averaging vulnerability scores directly produces incorrect results if population sizes differ.
NoSQL modelling patterns
Document databases, key-value stores, wide-column stores, and graph databases require modelling approaches distinct from relational design. The absence of joins shifts complexity into data structure and access pattern design.
Document modelling
Document databases (MongoDB, CouchDB) store nested structures, enabling denormalised designs that embed related data within parent documents. The modelling question becomes: embed or reference?
Embedding suits data accessed together, with bounded growth, and owned by the parent. A beneficiary document embedding their needs assessment responses works well: assessments belong to one beneficiary, the array size is bounded (tens of assessments over years), and retrieving a beneficiary typically requires their assessment history.
Referencing suits data with independent lifecycles, unbounded growth, or many-to-many relationships. Distribution records should reference beneficiaries rather than embedding within them: distributions grow unboundedly, distributions have independent queries (all distributions in a location), and embedding would duplicate beneficiary data across thousands of distribution documents.
// Embedded pattern: Beneficiary with assessments{ "_id": "BEN-2024-00001", "given_name": "Amina", "family_name": "Hassan", "household_id": "HH-2024-00500", "assessments": [ { "date": "2024-01-15", "type": "vulnerability", "score": 7, "responses": { "food_security": 3, "shelter": 2, "health": 2 } }, { "date": "2024-06-20", "type": "vulnerability", "score": 5, "responses": { "food_security": 2, "shelter": 1, "health": 2 } } ]}
// Referenced pattern: Distribution referencing beneficiary{ "_id": "DIST-2024-00001", "beneficiary_id": "BEN-2024-00001", "activity_id": "ACT-2024-00010", "location_id": "LOC-KE-047-001", "date": "2024-03-15", "items": [ { "code": "FOOD-001", "quantity": 25, "unit": "kg" }, { "code": "NFI-005", "quantity": 2, "unit": "pcs" } ]}Wide-column modelling
Wide-column stores (Apache Cassandra, HBase) organise data by row key and column families, optimised for distributed write-heavy workloads. The modelling principle is query-driven: design tables around access patterns, accepting data duplication across tables serving different queries.
A time-series access pattern (all distributions for a location in a date range) suggests a table keyed by (location_id, date) with distribution details in columns. A beneficiary-centric pattern (all distributions for a beneficiary) suggests a separate table keyed by (beneficiary_id, date). The same distribution data appears in both tables, written twice but enabling efficient reads for both patterns.
Graph modelling
Graph databases (Neo4j, Amazon Neptune) model entities as nodes and relationships as edges, both carrying properties. The approach excels for relationship-heavy queries: referral pathways between services, household relationship networks, supply chain tracing.
Modelling decisions concern what becomes a node versus a relationship property. Service referrals benefit from referral as a relationship between service provider nodes, carrying properties like referral date and status. The query “find all beneficiaries referred through more than three services” traverses the graph efficiently without joins.
Modelling for programme data
Mission-driven organisations share common data domains: beneficiaries and households, locations with administrative hierarchies, activities and interventions, outputs and outcomes. Modelling these domains consistently enables interoperability and standard reporting.
Beneficiary and household
The beneficiary-household relationship carries nuance. Individual-level data supports deduplication, targeting accuracy, and rights-based approaches. Household-level data reflects resource sharing, vulnerability dynamics, and distribution efficiency. Models must support both perspectives.
A common pattern establishes household as the registration unit with beneficiaries as members. The household has a head (for communication and collection), a location, and household-level attributes (size, composition). Beneficiaries have individual attributes (demographics, vulnerability indicators) and a role within the household. Distributions can target either level: food baskets to households, vaccinations to individuals.
This structure requires clear rules about: what constitutes a household (co-residence, economic unit, self-identification), how household changes propagate (splitting, merging, relocation), and which beneficiary serves as household proxy when individual data is unavailable.
Location hierarchies
Administrative boundaries form the primary spatial organisation: country, region, district, sub-district, community. P-codes (place codes) from Common Operational Datasets provide standard identifiers enabling interoperability across organisations.
The location model must handle hierarchy navigation (aggregate from community to district), temporal validity (boundaries change over time), and multiple hierarchies (administrative, operational, health system). A location table with self-referential parent relationship supports arbitrary depth; separate columns for each administrative level enable efficient querying at known levels.
Coordinates (latitude/longitude) on locations enable mapping and spatial queries but require decisions about precision, datum, and what point represents a polygon (administrative areas have centroids, not point locations).
Activity and output tracking
Activities group related actions under projects funded by donors. The activity carries modality (cash, in-kind, service), sector (WASH, health, protection), and implementation period. Outputs are the countable products of activities: distributions made, trainings conducted, facilities constructed.
The model links activities to locations (where implemented), beneficiaries (who served), and outputs (what produced). This triple linkage enables 4W/5W reporting: Who did What, Where, When, for Whom.
Output metrics require careful modelling. Counting unique beneficiaries across activities demands deduplication at reporting time or maintaining running counts with proper handling of re-registrations and household changes. Pre-aggregated output tables accelerate dashboards but require clear refresh strategies.
Naming conventions and standards
Consistent naming reduces confusion, enables automation, and signals professional data management.
| Element | Convention | Example |
|---|---|---|
| Table names | Lowercase, snake_case, plural | beneficiaries, distribution_items |
| Column names | Lowercase, snake_case, singular | given_name, distribution_date |
| Primary keys | table_singular_id | beneficiary_id, household_id |
| Foreign keys | referenced_table_id | household_id in beneficiaries table |
| Boolean columns | is_ or has_ prefix | is_active, has_consent |
| Date columns | _date or _at suffix | registration_date, created_at |
| Timestamp columns | _at suffix with timezone | updated_at |
| Status columns | status or state suffix | registration_status |
| Dimension tables | dim_ prefix | dim_beneficiary, dim_location |
| Fact tables | fact_ prefix | fact_distribution |
| Junction tables | both table names | beneficiary_activities |
Reserved words (status, date, time, user) cause problems across database systems. Prefix or suffix them: registration_status, activity_date, created_by_user.
Column comments and table comments document purpose, valid values, and business rules directly in the database catalog. These comments propagate to data catalogues and documentation automatically.
Technology options
Data modelling tools range from simple diagramming to full lifecycle management with versioning, collaboration, and forward/reverse engineering.
Open source and free tools
dbdiagram.io provides browser-based modelling with a domain-specific language for defining tables and relationships. The syntax learns quickly (table_name { column_name type }) and exports to SQL, PDF, and image formats. Suitable for logical modelling and documentation; lacks reverse engineering from existing databases.
SchemaSpy generates documentation from existing databases, producing HTML reports with entity-relationship diagrams, column details, and relationship maps. Useful for understanding legacy systems; does not support forward engineering or active modelling.
pgModeler offers PostgreSQL-specific modelling with full forward and reverse engineering. The desktop application handles large schemas and generates DDL with PostgreSQL-specific features (partial indexes, table inheritance). Limited to PostgreSQL; steeper learning curve.
DBeaver (Community Edition) includes basic ER diagram generation from connected databases. The diagramming is secondary to its database administration focus but sufficient for documentation and simple modelling tasks.
Commercial options
ERwin provides enterprise data modelling across database platforms with versioning, collaboration, and integration with data catalogues. The high license cost positions it for large IT departments with dedicated data architecture functions.
ER/Studio offers similar enterprise capabilities with particularly strong reverse engineering and model comparison features. Useful for organisations managing models across multiple database platforms and versions.
Lucidchart and draw.io provide general diagramming with ER notation support. Accessible for occasional modelling needs but lack database-specific features like DDL generation.
For organisations with limited data architecture capacity, dbdiagram.io or draw.io suffices for documentation and communication. Organisations managing multiple databases benefit from pgModeler (if PostgreSQL-focused) or SchemaSpy for automated documentation.
Implementation considerations
For organisations with limited IT capacity
Start with documentation of existing structures rather than formal modelling tools. Capture current tables in a spreadsheet: table name, column names, data types, relationships, and purpose. This inventory enables impact analysis and supports new staff onboarding without tool investment.
Use naming conventions from the outset for new development. Retrofitting conventions to existing databases creates migration overhead; applying them to new tables costs nothing and accumulates value over time.
Adopt star schemas for reporting databases even without a formal data warehouse. A simple fact table with dimension foreign keys, built as views or materialised queries over operational data, dramatically improves report performance compared to querying normalised structures directly.
Document the conceptual model in terms programme staff use. A diagram showing beneficiaries, households, activities, and distributions communicates across technical and programme teams. Tools like draw.io (free) produce adequate diagrams without specialised modelling software.
For organisations with established data functions
Invest in logical modelling discipline before physical optimisation. A well-documented logical model survives technology migrations; physical optimisations require rework when platforms change.
Implement SCD Type 2 for dimensions requiring historical analysis. The additional complexity in ETL processes pays dividends in reporting accuracy and audit capability. Configure surrogate key generation and effective dating systematically rather than ad-hoc per dimension.
Establish model review processes before production deployment. A data architect or senior developer reviewing models catches normalisation violations, naming inconsistencies, and missing constraints before they become technical debt.
Integrate modelling with metadata management. Models documented in data catalogues (Apache Atlas, DataHub) enable discovery, lineage tracking, and impact analysis. Manual synchronisation between modelling tools and catalogues accumulates drift; investigate integration capabilities.
Field deployment considerations
Offline data collection tools (KoboToolbox, ODK) generate nested data structures that require flattening for relational storage. Design import pipelines that preserve the relationship between parent forms and repeating groups while mapping to normalised tables.
Synchronisation between field systems and central databases requires attention to primary key generation. UUID-based keys enable offline generation without collision; integer sequences require centralised assignment or partition schemes.
Limited bandwidth constrains model complexity for replicated databases. Highly normalised models requiring many table synchronisations strain connectivity more than denormalised designs. Evaluate the join requirement at the field level: if field applications need only local data, synchronise the minimum tables required.
Protection data modelling
Models containing protection data (GBV, child protection, trafficking) require restricted visibility columns or separate schemas with distinct access controls. Standard beneficiary models should not include protection identifiers that could link individuals to sensitive cases. See Protection Data Classification for classification requirements.