• 📖 Cover
  • Contents

Chapter 4: Entity Resolution and Master Data

Chapter Introduction

Every chapter before this one has assumed that you know who is who — that “Patient C1” in one source is the same person as “Patient C1” in another, that “BNP Paribas SA” in a market-data feed is the same legal entity as “BNP PARIBAS” in a regulatory filing, that the power plant “Bayswater Unit 4” in EIA Form 860 is the same generator as plant ID 1378 in PJM’s queue. In the real world this assumption is wrong roughly all of the time. Source systems disagree on identifiers, names get spelled five ways, identifiers get reused, mergers re-key entities silently, and the same human or asset shows up under different IDs depending on which system recorded the event.

Entity resolution is the discipline of joining records that refer to the same real-world entity even when their identifiers don’t match. Master Data Management (MDM) is the operational practice of keeping a canonical, governed registry of the resolved entities so that downstream systems can stop solving the problem repeatedly. Together they are the silent prerequisite for every analytic in the book. Skip them and every subsequent statistical inference is contaminated by entity-noise that the model interprets as signal.

This chapter spends most of its time on the four domain case studies the program centres on:

  • Macroeconomics — matching countries, regions, and statistical agencies across IMF, World Bank, BIS, OECD, UN, and central-bank identifiers; the SDMX agency-codes ontology.
  • Quantitative trading — matching corporate entities across ticker, CUSIP, ISIN, SEDOL, FIGI, RIC, Bloomberg ticker, and LEI; the Global LEI System (GLEIS) and the OpenFIGI registry.
  • Healthcare — matching patients across electronic health records, claims, and registries when MRNs differ; HL7 FHIR’s Patient.identifier model and probabilistic linkage at population-health scale.
  • Energy — matching generating units, substations, and balancing-authority assets across EIA, FERC, NERC, and ISO/RTO queues; ISO 32400 / NAESB asset identifiers.

The chapter teaches three methodological families — deterministic matching, probabilistic record linkage (Fellegi–Sunter), and modern ML-based entity resolution — and one operational discipline (MDM with stewardship, lineage, and adjudication). The browser-runnable examples use synthetic data designed to exercise the same matching difficulties that the real domain datasets present.


Table of Contents

  1. The Entity-Resolution Problem
  2. Deterministic Matching — Rule-Based Joins
  3. Fuzzy String Matching and Token Methods
  4. The Fellegi–Sunter Probabilistic Model
  5. ML-Based Entity Resolution
  6. Master Data Management — the Operational Layer
  7. Case Study: Quantitative Trading — CUSIP, ISIN, FIGI, LEI
  8. Case Study: Healthcare — Patient Matching at Scale
  9. Case Study: Macroeconomics — Country and Agency Codes
  10. Case Study: Energy — Generator and Substation Matching

The Entity-Resolution Problem

Given a stream of records that mention real-world entities, decide for each new record which previously-known entity it refers to — or whether it refers to a new entity that has not been seen before. Mathematically, given two records \(r_1, r_2\) and a similarity function \(s(r_1, r_2)\), the decision is a binary one: match (same entity) or non-match. In practice we add a third class — possible match — that routes to a human reviewer.

Three failure modes that recur in every domain:

  • False matches — declaring two distinct entities the same. In healthcare, this can put one patient’s chart in front of another patient’s doctor. In trading, it can route an order to the wrong instrument. In energy, it can mis-assign generation to the wrong dispatch queue.
  • False non-matches — failing to recognise that two records refer to the same entity. Splits the data across “shadow” identities; the analytics underestimate counts and over-estimate distinct populations.
  • Identifier drift — the right identifier changes over time. A merger re-keys a legal entity; a hospital migrates EHR systems and re-issues MRNs; a country re-codes its statistical reporting after a new agency takes over.

The economic damage of an entity-resolution failure is almost always larger than the cost of doing entity resolution properly. The two together — false matches and false non-matches — silently corrupt every downstream statistic. This is why every serious organisation that has matured past the demo stage runs a dedicated MDM function.

Deterministic Matching — Rule-Based Joins

The simplest entity-resolution approach: declare a match when one or more fields are equal. Use it when at least one authoritative, unique, stable identifier exists. Examples where deterministic matching is enough:

  • LEI (Legal Entity Identifier) for legal entities post-2013.
  • ISIN for securities globally.
  • ISO 3166-1 alpha-3 country codes for countries.
  • LOINC codes for laboratory tests.

The discipline of deterministic matching is not in the SQL JOIN — it is in being sure the identifier is actually unique, authoritative, and stable across all sources you are joining. Three common silent failures:

  • Reuse: a customer_id that was retired and later reissued to a different customer.
  • Trailing-zero corruption: leading zeros stripped by Excel during a CSV round-trip turn 0012345 into 12345, causing both false matches against an unrelated 12345 and false non-matches against the true 0012345.
  • Type coercion: a numeric identifier silently joined to a string identifier of the same value.

Where this fails is also obvious: any record without an LEI cannot be deterministically matched. In practice 20–40% of records in raw data lack an authoritative identifier on first ingest, and the residual must be matched with probabilistic methods.

Fuzzy String Matching and Token Methods

When the keys disagree only in formatting, fuzzy string distance recovers the match. Three measures dominate:

  • Levenshtein distance — minimum edit operations (insertions, deletions, substitutions) to transform one string into another. Sensitive to typos.
  • Jaro–Winkler — gives more weight to a common prefix; the default for human-name matching.
  • Token-set / Jaccard — represent each string as a set of tokens (words or n-grams) and compute set-overlap. Robust to word reordering (“New York City” vs. “City of New York”).

Notice the Jaccard score on the last pair: 0.67. The tokens overlap on China / People’s / Republic but the strings have different ordering and one punctuation difference. Levenshtein alone gives 25 — which suggests no match, even though they refer to the same country. The lesson: in real entity resolution you compute multiple similarity measures and combine them (next section).

Production-grade implementations of these measures live in the rapidfuzz package (pure-C, blazingly fast); the toy NumPy versions above demonstrate the conceptual core.

The Fellegi–Sunter Probabilistic Model

The 1969 paper by Ivan Fellegi and Alan Sunter formalises probabilistic record linkage and is still the basis of every government-scale entity-resolution system (US Census, NHS, every national statistical institute, every healthcare population analytics platform).

For each pair \((r_1, r_2)\) and each comparable field \(f\), define:

  • \(m_f = P(\text{field } f \text{ agrees} \mid \text{the pair is a true match})\)
  • \(u_f = P(\text{field } f \text{ agrees} \mid \text{the pair is a non-match})\)

The log-likelihood ratio for the pair is the sum across fields where they agree minus the sum where they disagree, weighted by \(\log(m_f / u_f)\) and \(\log((1 - m_f) / (1 - u_f))\) respectively. Two thresholds split pairs into match, possible match (sent to human review), and non-match.

The clean property: the per-field weights are learned from data. Fields that almost-never agree by chance (exact_dob, exact_address) carry high positive weight on agreement; fields that often agree by chance (gender == M) carry low weight.

A few hundred lines of code, no ML training, and you have a production-grade probabilistic record linker. The Fellegi–Sunter approach has been deployed continuously at the US Census Bureau for over fifty years; the algorithm is unchanged because it works.

Module reference — production record linkage
  • splink (UK Ministry of Justice, open-source) — modern Fellegi–Sunter on DuckDB / Spark. The current standard for population-scale record linkage. Used by the UK NHS, the Office for National Statistics, the Australian Bureau of Statistics.
  • recordlinkage — pure-Python implementation; good for prototyping.
  • dedupe — Python library with active-learning labelling support.
  • zingg — JVM-based, Spark-native; used in financial-services MDM.

ML-Based Entity Resolution

The contemporary ER stack adds a learned model on top of the Fellegi–Sunter weights. Two practical approaches:

  • Train a classifier on labelled pairs. Construct similarity features for each candidate pair, label a few thousand pairs manually, train a gradient-boosted classifier. Outperforms hand-tuned Fellegi–Sunter when the matching signal is non-linear.
  • Embedding-based blocking + match scoring. Embed each record’s text into a vector; ANN-retrieve candidate pairs (Chapter 8 of Volume I); score the top-k pairs with the classifier above. This is how Google, Amazon, and every major fraud-detection vendor scales ER to billions of records.

The cost-saving move is the blocking step. A naive all-pairs comparison on \(n\) records is \(O(n^2)\); for \(n = 10^9\) this is intractable. Blocking restricts the candidate pairs to records that already share a coarse signal (e.g., same postal code, or top-10 ANN neighbours under an embedding). Modern ER systems achieve million-fold reductions in pair-volume without losing recall.

The classifier replaces hand-tuned weights with an automatic procedure that handles non-linear interactions. Production teams often run both Fellegi–Sunter and a classifier and ensemble the scores — F-S provides interpretable per-field weights for the audit trail, and the classifier captures patterns F-S misses.

Master Data Management — the Operational Layer

Entity resolution is the algorithm. Master Data Management is the operational system that keeps the resolved entities canonical, governed, and consumable. A working MDM stack has these components:

  • Golden record — for each resolved entity, a single canonical row with the best-known values for each attribute, derived from one or more source records.
  • Cross-walk — a table mapping every source identifier to the canonical entity ID. Bidirectional, versioned, audit-logged.
  • Stewardship workflow — humans adjudicate the “possible match” pile and override system decisions. The UI is typically a triage queue.
  • Survivorship rules — when source A and source B disagree on an attribute (one says name = "BNP Paribas SA", the other says "BNP PARIBAS"), the rule decides which value wins (most-trusted source? most-recent value? business-rule override?).
  • Versioned identifiers — the canonical entity ID is stable; the attribute values move through SCD2 (Volume II Chapter 3).
  • Outbound publishing — downstream systems subscribe to changes in the golden record (Kafka topic, webhook, or pull-API).

Production MDM platforms include Informatica MDM, Reltio, Stibo Systems, Profisee, and the MDM modules inside Salesforce, SAP, and Oracle. Open-source: OpenMDM, Talend Data Stewardship, and DIY solutions built on splink + Dagster + Postgres.

A simple MDM bookkeeping structure in pure pandas:

Notice the survivorship rule: when three sources spell the entity name three different ways, the longest one wins. This is one of many possible rules (most-recent, most-trusted source, manual override) and choosing the right one is part of the MDM design.

Probably the opposite — it is evidence that MDM is working. The 12% drop is the duplicate-customer rate that existed across the seven sources before resolution. Always validate this with a sample: take 50 random “merges” produced by the system, manually verify them, and report the precision. If most merges are correct, the count drop is a feature, not a bug.

Case Study: Quantitative Trading — CUSIP, ISIN, FIGI, LEI

Every security has multiple identifiers, each issued by a different organisation for a different purpose:

  • CUSIP (Committee on Uniform Securities Identification Procedures) — 9 characters, US/Canada securities only.
  • ISIN (International Securities Identification Number) — 12 characters, globally unique, built on the CUSIP for US issues.
  • SEDOL (Stock Exchange Daily Official List) — 7 characters, UK-issued.
  • FIGI (Financial Instrument Global Identifier) — 12 characters, OpenFIGI registry, free to use; the modern open standard.
  • RIC (Reuters Instrument Code) — LSEG-proprietary, often the working identifier in trader UIs.
  • Bloomberg ticker — IBM US Equity, 1299 HK Equity; the Bloomberg-proprietary form.
  • LEI (Legal Entity Identifier) — 20 characters, identifies the issuer, not the instrument.

Real production matching at a quant fund:

The lesson: a real quant ontology has at least three levels — Instrument (ISIN/FIGI), Listing (ticker on a venue), Issuer (LEI). Each has its own identifier; the matching is hierarchical and the wrong-level join produces wrong answers.

Practical defaults at modern quant funds:

  • OpenFIGI (free API, Bloomberg-maintained) for the instrument identity.
  • GLEIF (Global LEI Foundation, free) for the issuer identity.
  • The proprietary identifiers (CUSIP, SEDOL, RIC, Bloomberg ticker) are kept as aliases linked to the FIGI/LEI canonical pair.

Case Study: Healthcare — Patient Matching at Scale

A typical US hospital network has 5–20 source EHRs (Epic in one hospital, Cerner in another, custom systems in a third), patient registries (cancer, diabetes, transplant), claims data from multiple payers, and pharmacy records. The same patient appears in dozens of records; only a small fraction share a stable identifier. The patient-matching problem is one of the largest engineering investments at every modern health system.

The HL7 FHIR standard’s Patient.identifier slot accommodates this directly: a patient has multiple identifiers, each tagged with the system that issued it:

The matching is performed offline by an EMPI (Enterprise Master Patient Index) — software that maintains the canonical PT-NNN IDs and the cross-walk to every source identifier. Production EMPIs: Verato, NextGate, IBM Initiate, and the EMPI modules inside Epic and Cerner themselves.

The standard EMPI feature set:

  • Probabilistic matching on first/last/DOB/SSN-last-4/address using Fellegi–Sunter.
  • ML enhancement on hard cases.
  • Stewardship queue for human adjudication of “possible match” pairs.
  • HIPAA-compliant audit log of every merge and unmerge.
  • API surface so the rest of the hospital IT estate can ask “what’s the canonical ID for this MRN?” in real time.

Match rates of 95–98% are typical at well-run health systems; the residual 2–5% requires manual review and creates real safety hazards if mishandled (a record erroneously merged onto the wrong patient can be life-threatening).

Case Study: Macroeconomics — Country and Agency Codes

A macro analyst pulling data from IMF, World Bank, BIS, OECD, UN, and central banks faces the same problem in a more genteel form: each agency uses a different country code, and the codes evolve when countries join, leave, split, or merge.

The main coding systems:

  • ISO 3166-1 alpha-2 (US, CN, HK) and alpha-3 (USA, CHN, HKG).
  • IMF country codes (numeric, e.g., 111 = United States).
  • World Bank country codes (3-letter, mostly aligned with ISO alpha-3 but with idiosyncrasies — WLD for World, EUU for European Union).
  • UN M49 (numeric, used by UN Statistics Division).
  • BIS reporting codes — country + sector + counterparty type.
  • SDMX agency-codes — the metadata-exchange standard maintained by BIS/ECB/OECD/IMF/Eurostat/UN/World Bank; the canonical cross-walk machinery.

The historical complications are real and consequential:

  • Yugoslavia (YU until 1992) split into multiple successor states with new codes; backfilling time series requires conscious decisions about which successor “inherits” the historical series.
  • Czechoslovakia (CS) split into Czech Republic (CZ) and Slovakia (SK) in 1993.
  • South Sudan (SS) emerged from Sudan (SD) in 2011.
  • Hong Kong (HK) is reported separately from China in most series but as a sub-aggregate of China in others; the right treatment depends on which series you’re using.

The structural lesson is that a country is a type of entity with multiple identifiers (alphabetic codes, numeric codes, member-of-aggregate links) and a history. Treating “country code” as a single column instead of a small ontology corrupts every multi-source macro analysis. The SDMX standard is the formal model behind this; in practice most macro shops build a thin canonical layer on top of SDMX and stop there.

Case Study: Energy — Generator and Substation Matching

A US utility-scale generation analyst sees the same plant in EIA Form 860 (annual filing), EIA Form 923 (monthly generation data), FERC Form 714 (planning), the relevant ISO/RTO’s queue (PJM, ERCOT, CAISO, MISO, NYISO, ISO-NE), NERC’s GADS (generator availability data), and emissions data from EPA’s CEMS. Each system uses a different plant identifier; the matching is non-trivial and high-stakes for any analysis that combines emissions, generation, capacity, and market data.

The dominant identifiers:

  • EIA Plant Code — primary US Department of Energy identifier (e.g., 60001 for Bayswater).
  • EIA Generator Unit ID — sub-plant identifier; a plant has multiple units.
  • EIA ORISPL Code — older identifier still used in EPA CEMS data; mostly but not always equal to the EIA Plant Code.
  • ISO/RTO Queue ID — the regional market operator’s identifier, used in interconnection studies and market-clearing.
  • NERC GADS Unit Code — used in reliability/availability reporting.
  • DUNS / LEI — for the operating company (separate entity).

The energy-domain trick is that the plant and the generator unit are both objects in the ontology, with the unit being a sub-component of the plant, and external identifiers attaching at different levels (EIA Plant at the plant level, EIA Unit at the generator-unit level, NERC GADS Unit at the generator-unit level, ISO Queue at the project-or-unit level depending on the ISO’s convention). Modelling this correctly is the difference between a credible emissions analysis and one that double-counts or misses entire facilities.

Chapter Wrap-up

Entity resolution and master data management are the silent prerequisite to every analytic. The three methodological families — deterministic matching, Fellegi–Sunter probabilistic linkage, ML-based ER — each have their place. The operational layer (MDM, stewardship, survivorship, lineage) is what turns ER from a one-off project into an ongoing capability.

The four domain case studies share the same structure: each has multiple identifier systems, each requires modelling at multiple levels (instrument/issuer in finance; patient/episode in healthcare; country/aggregate in macro; plant/unit in energy), and each has historical idiosyncrasies that have to be captured explicitly. The cross-walk is the most-reused asset in any data platform — invest in it once, save hundreds of analyst-hours per year.

Chapter 5 takes up the question of which published ontologies you should adopt before inventing your own — FIBO, SNOMED CT, GICS, GS1, the energy-sector CIM, and the SDMX agency-codes that anchor the macro examples above.

← Chapter 3  ·  Contents  ·  Chapter 5: Standard Ontologies in the Wild →

 

Prof. Xuhu Wan · HKUST ISOM · Domain Modelling in Python