Chapter 10: Trust, Lineage, and Governance
Chapter Introduction
A modelled domain that nobody trusts is operationally worthless. Trust is engineered, not assumed; it is the product of explicit lineage (where every value came from), comprehensive audit trails (every action and decision recorded), formal governance (who is allowed to do what), and proactive compliance with the regulatory frameworks that govern your domain. This final chapter is the discipline that determines whether the previous nine chapters get to operate in the real world.
The chapter is built around three layers that every regulated organisation must implement on top of the ontology:
- Lineage layer — for every property, every action, every prediction, a complete, machine-readable trace of the upstream inputs and the transformations that produced it. Used for impact analysis, reproducibility, regulatory replies, and post-incident forensics.
- Audit layer — append-only record of every action, every change, every access, signed and time-stamped, retained for the period the regulator demands.
- Governance layer — who is allowed to see what, who is allowed to do what, how that authority is granted, how it is reviewed, how it is revoked.
On top of these three layers sits regulatory compliance — the body of specific rules the organisation must follow. The EU AI Act, US OCC SR 11-7 for model risk, FDA’s AI/ML SaMD framework, HIPAA in US healthcare, the FFIEC handbook for financial institutions, NERC CIP for grid operators. Each prescribes specific controls; together they shape every architectural decision in this book.
The four domain case studies are unusually concrete here because every word is enforceable: the EU AI Act’s Annex IV technical documentation, HIPAA’s audit-log retention requirements, NERC CIP-008 incident reporting, MAR Article 6 trading-record retention. The chapter closes the program with the discipline that turns a working ontology into a defensible, deployable system.
Table of Contents
- Why Trust Is Engineered
- Lineage — the OpenLineage Standard
- Audit Trails and Tamper-Evidence
- Access Control — Row, Attribute, and Action Level
- The EU AI Act and Its Annex IV Documentation
- Regulatory Frameworks by Domain
- Putting It Together — A Governance Architecture
Why Trust Is Engineered
Trust in an operational system is a function of three observable properties:
- Reproducibility — given the same inputs and the same code, the same outputs are produced. Anyone with access to the inputs and code can verify a result independently.
- Traceability — for every value the system produces, a complete record of upstream sources and transformations is available on demand.
- Accountability — for every action taken, a specific human or service identity is recorded, with the authority under which it acted.
None of these properties happens by default. They are explicit engineering choices, layered into the ontology and its operational stack. The cost of building them in is modest if planned from the start, and forbidding if retrofitted after the system is already in production. The right time to invest is when the ontology is first deployed; the wrong time is the morning after the regulator’s first information request.
The financial industry’s version of this argument is the Basel Committee BCBS 239 principles (issued 2013, “Risk Data Aggregation and Risk Reporting”). The 14 principles read like a textbook for the practices in this chapter: governance, data architecture, accuracy, completeness, timeliness, adaptability, audit, supervision. Every regulator in every domain has issued a parallel document; the substance is the same.
Lineage — the OpenLineage Standard
Lineage is the machine-readable record of what produced what. Three levels matter:
- Dataset-level lineage — table A was produced from tables B and C via SQL job X at time T.
- Field-level lineage — column
fooin table A is a computed expression involving columns from B and C. - Row-level lineage — a specific row in table A was produced by specific rows in B and C through specific transformations.
The dominant open standard is OpenLineage (Linux Foundation, 2021), supported by Apache Airflow, Dagster, Marquez, Apache Spark, dbt, and Apache Flink. The protocol is straightforward: every job emits a JSON event when it starts and finishes, declaring its inputs, outputs, and (optionally) the transformations applied.
The downstream consumer of these events is Marquez (the OpenLineage reference implementation), or any of the commercial lineage products (DataHub, OpenMetadata, Atlan, Collibra, Alation). They build a global graph of “what depends on what” that the analyst can query: “if I delete this raw table, what downstream tables / reports / models break?”
Field-level lineage typically requires SQL parsing — modern lineage products extract it automatically from the SQL submitted to the warehouse. Row-level lineage is more expensive (every output row carries a pointer back to its inputs) and is enabled only where it is required by regulation.
Audit Trails and Tamper-Evidence
The audit layer is conceptually simple — every action emits a record — but the integrity of the record is what makes it operationally valuable. Three patterns:
- Append-only storage. The audit table accepts INSERTs only; UPDATEs and DELETEs are blocked by the database role. PostgreSQL: revoke
UPDATE,DELETEfrom the application role. Use a separate read-only role for reporting. - Hash-chained records. Each new audit row includes a hash of the previous row. Any retroactive tampering invalidates every subsequent hash. The standard pattern used by every blockchain (so the technique is well-tested) and adapted by banks long before blockchain existed (see SWIFT KYC Registry, FINRA Trade Reporting).
- Independent attestation. Periodically (daily / hourly), a separate system takes a Merkle-tree hash of the audit log and stores it in tamper-resistant storage (write-once cloud object storage, a separate cluster, or — for the highest-stakes systems — a notary service).
The three properties together — append-only, hash-chained, externally attested — make the audit log defensible in court and to regulators. The cost is modest; the value when an incident occurs is enormous.
For very high-stakes use cases (clinical-trial primary endpoints, central-bank reserves accounting, settlement records at central counterparties), the audit log is sometimes physically replicated across legal entities. Two parties hold the same chain; any divergence is investigated.
Access Control — Row, Attribute, and Action Level
Modern access control on an ontology operates at three levels.
- Row-level security (RLS). “User A can see records where
country = US, user B can see records wherecountry in (FR, DE).” Implemented as a query filter automatically appended to every read. Postgres supports RLS natively; most modern databases now do too. - Attribute-level security (ALS). “Analysts in the EU may see customer records but the
ssnanddobcolumns are masked.” Implemented via column-level encryption + decryption based on the requesting role’s attributes. - Action-level authorisation. “Only users in the
LOAN_APPROVErole may invoke theApproveLoanaction, and only up to their individual authority limit.” Implemented in the action’s pre-conditions (Chapter 8) and the action’s invocation surface.
The unifying framework is Attribute-Based Access Control (ABAC) — every access decision is a function of:
- The actor’s attributes (role, department, jurisdiction, clearance level).
- The resource’s attributes (record’s country, sensitivity, classification).
- The action’s attributes (read vs. write, batch vs. interactive).
- The context (time of day, location of request, recent incidents).
ABAC is more flexible than the older Role-Based Access Control (RBAC) and is the de-facto standard for modern data platforms. Production implementations: AWS IAM with policy conditions, Open Policy Agent (OPA), Snowflake’s MASKING POLICY, Databricks’s Unity Catalog access policies, Apache Ranger.
The same ABAC pattern, scaled out, is what every modern fintech, healthcare, and government data platform implements. The discipline is to keep the policy declarative (defined as data, not as code) so that legal and compliance teams can review it without reading source code.
The EU AI Act and Its Annex IV Documentation
The EU AI Act (Regulation 2024/1689, in force August 2024, with high-risk-system provisions phasing in through August 2027) is the most-binding AI-specific regulation in force globally. Any organisation deploying AI in the EU market — irrespective of where the organisation is headquartered — must comply. High-risk categories include credit-scoring, employment decisions, education-admission decisions, biometric identification, critical-infrastructure operation, and many medical-device applications.
For high-risk systems, Annex IV of the regulation prescribes the technical documentation that must be maintained. The Annex IV table of contents reads like a checklist for every chapter of this book:
- General description of the AI system and its intended purpose.
- Detailed description of elements (training data, processes, third-party components).
- Detailed information on the monitoring, functioning, and control of the system.
- Description of risk-management measures.
- Description of changes to the system.
- Description of standards applied.
- Test and validation procedures (the discipline of Volume I Chapter 10 plus this chapter).
- Lists of relevant scientific literature and harmonised standards.
Every one of these sections is a deliverable. Production teams under the EU AI Act maintain it continuously — not as a one-time submission. The lineage layer, the audit layer, and the governance layer all feed into the Annex IV documentation directly. A team that built these layers from the start can produce Annex IV documentation in days; a team that didn’t may need months.
Three concrete operational consequences:
- Risk-management system required (Article 9). A formal, documented process to identify, evaluate, and mitigate risks across the system’s lifecycle.
- Data and data governance required (Article 10). Training data, validation data, test data must be representative, relevant, and free of errors. Document the procedure.
- Transparency and provision of information to deployers required (Article 13). Users of the AI system must receive clear information about its capabilities and limitations.
The other major framework names a different but overlapping discipline: NIST AI Risk Management Framework (US, voluntary but widely adopted by federal agencies and large enterprises) organises around four functions — Govern, Map, Measure, Manage — with similar substantive requirements.
Regulatory Frameworks by Domain
The four domain case studies each have their own regulatory regime layered on top of the cross-cutting AI rules.
Quantitative trading:
- MiFID II / MiFIR (EU, 2018) — transaction reporting, best-execution evidence, market-abuse surveillance.
- MAR / MAD II (EU) — market-abuse regulation, requires comprehensive trading-record retention and surveillance.
- SEC Rule 17a-4 (US) — broker-dealer record retention (now WORM- or WORM-equivalent).
- CFTC Part 23 — swap-dealer record-keeping and reporting.
- PRIIPs / KIDs (EU) — product-disclosure requirements that affect any model used in retail product design.
Healthcare:
- HIPAA (US) — patient privacy, breach notification, technical safeguards. Audit logs must be retained six years.
- GDPR (EU) — applies to personal data, including patient data. Lawful basis, consent, data minimisation, right of access.
- FDA AI/ML SaMD Action Plan (US, 2021 onwards) — Software as a Medical Device that uses AI/ML is subject to evolving FDA oversight.
- EU MDR — Medical Device Regulation; AI-enabled medical devices.
Macroeconomics / official statistics:
- UN Fundamental Principles of Official Statistics (1994, updated) — the global standard for national statistical agencies.
- EU Statistical Law 223/2009 — binds Eurostat and EU member-state agencies.
- IMF Special Data Dissemination Standard Plus (SDDS Plus) — the gold-standard data-publication discipline; ~30 economies are signatories.
- OECD Quality Framework for Statistical Activities — quality criteria for statistical outputs.
Energy:
- NERC CIP-002 through CIP-014 (US) — Critical Infrastructure Protection. Mandates physical and cyber-security controls for bulk-electric-system assets. CIP-008 is incident-reporting; CIP-013 is supply-chain cyber-security risk management.
- FERC Order 2222 (US) — interconnection of distributed-energy resources; data-exchange obligations.
- EU Clean Energy Package — data-exchange and consumer-data-portability rules.
- NIS2 Directive (EU) — cyber-security for “essential and important entities,” including energy.
Every working practitioner in one of these domains has spent a week of their career reading one of these documents end-to-end. The textbook obligation is to know the shape of the regulatory landscape; the practitioner obligation is to consult the specific text when designing a system in that domain.
Putting It Together — A Governance Architecture
A unified governance architecture for an ontology-driven analytics system has the following components, sequenced by what is built first:
┌────────────────────────────────────────────────────────────────────┐
│ GOVERNANCE LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ ABAC engine │ │ Approval │ │ Stewardship workflow │ │
│ │ (OPA / Ranger│ │ workflow │ │ (changes to ontology, │ │
│ │ / IAM) │ │ for actions │ │ schema, definitions) │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
├────────────────────────────────────────────────────────────────────┤
│ AUDIT & LINEAGE LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Action audit │ │ Data lineage │ │ Model registry │ │
│ │ (append-only,│ │ (OpenLineage,│ │ (versioned models, │ │
│ │ hash-chained│ │ Marquez) │ │ inputs, evaluation) │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
├────────────────────────────────────────────────────────────────────┤
│ ONTOLOGY LAYER │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Object types, properties, links, actions (Chapters 2, 8) │ │
│ │ Built in PostgreSQL + AGE + pgvector / Neo4j / rdflib (Ch 6) │ │
│ │ Queried via SQL / Cypher / SPARQL / GraphQL (Ch 7) │ │
│ └──────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────┤
│ DATA LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Ingestion │ │ Entity │ │ SCD2 / event-sourced │ │
│ │ + MDM (Ch 4) │ │ resolution │ │ stores (Ch 3) │ │
│ │ │ │ (Ch 4) │ │ │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
The discipline that keeps this architecture coherent:
- Define every artefact once. Schema definitions, object types, actions, lineage events — each lives in version control, is reviewed via pull request, and is deployed through an automated pipeline.
- Test changes against historical data. A schema change that breaks downstream lineage must fail CI. Modern data-platform tools (Datafold, Recce, Great Expectations, dbt’s data tests) build this in.
- Maintain a change log the regulator can read. Every change to the ontology, every model retraining, every action-definition update is recorded with a human-readable reason. The change log is itself an ontology — versioned, queryable, replayable.
A common refrain is that “compliance slows engineering.” It is true that adding compliance retroactively slows engineering. Adding it from the start often speeds engineering, because the same artefacts (lineage events, audit logs, governed schemas) accelerate debugging, simplify incident response, and enable cross-team collaboration. The teams that ship most aggressively in regulated industries are the ones that automated their compliance discipline; the teams that ship least are the ones that defer it.
Layers: (1) the audit log identifies the specific model invocation, its actor, timestamp, and the input feature vector used; (2) the lineage layer traces each input feature back through its computing job to the upstream raw data tables and the SQL transformations applied; (3) the model registry identifies the exact model version that produced the score, its training data snapshot, and its validation report. Format: a JSON document containing an OpenLineage-conformant lineage graph, plus the model-card excerpt for that version, plus the audit-log entry for the decision, plus a human-readable summary. The regulator will want the JSON for forensic verification and the summary for narrative review. Producing all of this in under a day is realistic if the three layers were built in from the start; producing it in less than a week is essentially impossible if they were not.
Book Wrap-up
Ten chapters now cover the practitioner’s discipline of ontology-driven analytics, end to end:
- Chapters 1–3 — what domain modelling is, the four primitives (object types, properties, links, actions), the temporal discipline (grain, SCDs, point-in-time correctness).
- Chapters 4–5 — entity resolution and master data; the published-ontology standards (FIBO, SNOMED CT, GICS, GS1, SDMX, CIM) that you adopt rather than reinvent.
- Chapters 6–7 — building the ontology in Python (NetworkX / rdflib / Neo4j / Apache AGE / pgvector) and querying it (SQL / SPARQL / Cypher / GraphQL).
- Chapters 8–9 — the operational layer (actions, events, workflows) and the AI layer (GraphRAG, GNNs, KG embeddings) that sit on top of the ontology.
- Chapter 10 — the trust, lineage, governance, and regulatory discipline that determines whether anything you built is actually deployable.
The two-book pair — Learning Statistics in Python (Volume I, the methods) and Domain Modelling in Python (this volume, the practitioner’s discipline) — is the curriculum a modern analytics master’s program ought to teach. The methods half is well-known, well-taught, and largely commoditised. The practitioner half — domain modelling, entity resolution, published ontologies, lineage and governance — is the half that has migrated out of bespoke craft into platform engineering over the past five years, and it is the half that hires now most desperately need.
The single most important habit a graduate of this program should leave with is the ontology-first instinct: when handed a problem, the first reflex is not “what method should I apply?” but “what objects, properties, links, and actions describe this domain?” Once the modelling is right, the methods of Volume I almost-always work. When the modelling is wrong, no method saves you.
Statistics is the discipline of finding signal in noise. Domain modelling is the discipline of structuring the world so that the signal can be found. Together they are the practitioner’s craft.
← Chapter 9 · Contents · Cover