Chapter 1: Why Domain Modelling
Chapter Introduction
Most undergraduate data-science training begins with the same scene: a pandas.read_csv(...) call. A flat table appears, and the student starts applying methods to it — filter, group, regress, plot. This scene is the source of an enormous amount of operational dysfunction in working analytics teams. Real-world data does not arrive as a single table. It arrives as fragmented records from a dozen source systems describing entities — customers, accounts, transactions, patients, vessels, machines — that don’t agree on identifiers, don’t agree on definitions, change over time, and are connected to each other in ways the source systems didn’t bother to record.
The practitioner’s first job is not to apply a method. It is to model the domain — to say explicitly what entities exist, what properties they have, how they are connected, and how they change. Only after the domain model is in place do the statistical methods of Volume I have anything well-defined to operate on. This chapter exists to make that hierarchy explicit and to set the agenda for the rest of the book.
The shift in worldview is real and observable across the industry. Palantir Technologies — the most valuable analytics company of the 2020s — sells essentially this idea as software. Microsoft Fabric, Databricks Unity Catalog, Snowflake’s Horizon, AWS DataZone, and Atlan all converged in 2023–2025 on the same architectural principle: the ontology is the platform, the data is what you attach to it, the methods are what operate on the attached data. This book teaches the discipline that any of those platforms expects you to bring.
The chapter does three things. First, it draws the distinction between a schema, a data model, and an ontology, three terms that are routinely confused. Second, it walks through the PLTR / Foundry worldview as the contemporary canonical example, and explains why the open-source alternatives — PostgreSQL + pgvector + NetworkX + DuckDB — implement the same architecture at a different price point. Third, it is honest about when domain modelling is not the right answer: there are problems for which a flat table is genuinely the natural representation, and pretending otherwise wastes everyone’s time.
Table of Contents
- The Three Layers of Representation
- From Rows to Objects — the Conceptual Shift
- The PLTR / Foundry Worldview
- The Open-Source Alternative Stack
- When You Don’t Need an Ontology — Anti-Patterns
- Setting Up the Working Stack for This Book
The Three Layers of Representation
When practitioners argue about “data architecture” they are often arguing about three different things using overlapping words. Distinguishing them up front saves entire careers’ worth of confusion.
1. The schema layer. A schema describes the physical shape of stored data: column names, data types, primary keys, foreign keys, indexes. A schema lives inside a single database or file system. PostgreSQL DDL, Avro schemas, Parquet column specifications, BigQuery table definitions, dbt models — all schemas. A schema answers the question “how is this stored?”
2. The data model layer. A data model describes the conceptual entities the schema represents and how they relate. ER diagrams, UML class diagrams, Kimball’s dimensional model (facts and dimensions), Inmon’s third-normal-form enterprise data model — all data models. A data model can be implemented by many different schemas (Postgres, BigQuery, Snowflake) without changing. A data model answers the question “what is being described, and how do the descriptions fit together?”
3. The ontology layer. An ontology is a data model with semantics — it carries explicit definitions of what each object type means in the business domain, what the allowed relationships are, what the allowed states and transitions are, and what operations (actions, processes) are defined over the entities. Ontologies are operational specifications, not just descriptive ones. An ontology answers the question “how does the business reason and act about these entities?”
A schema without a data model is a mess of tables nobody can join. A data model without an ontology is a passive descriptor that doesn’t tell you what to do. An ontology without a schema is a Powerpoint diagram nobody can query. You need all three, in the right order.
The practical implication: when joining a project, the first three questions to ask, in order, are (a) “what is the ontology — what are the object types and what do they represent?” (b) “what data model implements it?” (c) “what schemas / tables hold the data?” Most teams get those questions backward, and most analytics projects fail at the upstream layer they didn’t know they needed.
From Rows to Objects — the Conceptual Shift
Consider a healthcare claims dataset. The naïve approach loads a single CSV:
The flat-table thinker sees columns: filter by patient, group by provider, sum the amount. This view hides almost every interesting question.
The object thinker sees a graph of entities:
- A
Patientis the same person across multipleclaim_idrows. (John Smith and Jon Smith are probably the same — entity resolution will come up in Chapter 4.) - A
Provideris an entity with its own properties (specialty, location, network membership). - A
Diagnosis(ICD-10 code) is an entity in a medical ontology (SNOMED CT or ICD-10 hierarchy) that links to other diagnoses, treatments, and outcomes. - A
Claimis an event entity that links Patient → Provider → Diagnosis at a moment in time, with a monetary property. - The
claim_dateis not just a column — it’s the temporal axis along which the Patient’s diagnoses, providers, and out-of-pocket spending all evolve.
The object-graph view immediately suggests questions the flat-table view was hiding:
- “Which patients have multiple providers for the same diagnosis?” — needs
PatientandDiagnosisas first-class entities. - “What is each provider’s diagnosis mix?” — needs
Provideras an entity with a deriveddiagnosis_mixproperty. - “How long after a wellness exam (Z00.0) do patients return for their next claim?” — needs
Claimas a temporal entity with the right ordering. - “Has this patient’s average per-claim spend been rising?” — needs
Patientas a time-evolving entity with a derivedavg_claim_amount(month)property.
Each of these questions is a function over the ontology. Each of them is awkward, slow, or buggy when implemented on the flat table. None of them are interesting questions until the entity layer exists.
The question is not about transactions — it’s about products and baskets. The right first move is to materialise Product and Basket as object types in an ontology, then derive the per-basket product set, then run market-basket analysis on the derived property. Skipping straight to groupby works for one ad-hoc question and creates technical debt for every subsequent one.
The PLTR / Foundry Worldview
Palantir’s product is the most influential commercial implementation of ontology-first analytics, and its vocabulary has become the lingua franca for practitioners. The Foundry ontology has four primitives:
- Object Types — the nouns of the business.
Patient,Account,Aircraft,Vessel,Shipment,Loan,Customer,Equipment. Each has a stable identifier and a defined set of properties. - Properties — typed attributes. Some properties are primitive (a name, a birthdate, a number) and some are derived (a running average, a sector roll-up, an outcome prediction). The derivation logic is part of the ontology, not bolted on afterward.
- Links — typed relationships between object types.
Patient—has-claim→Claim,Claim—billed-by→Provider,Provider—affiliated-with→Hospital. Links carry their own properties (start date, status, weight). - Actions — operations that change the state of the ontology.
Approve loan,Schedule procedure,Allocate inventory,Submit prior authorisation. Actions are first-class — they have parameters, side effects, and audit trails. This is what separates an ontology from a static knowledge graph.
Two additional concepts complete the architecture:
- Functions — read-only computations over the ontology. A risk score for a
Loan, a sentiment label for aNews Article, a clustering label for aCustomer. Functions are how statistical methods (everything in Volume I) attach to the ontology. - Applications / Workshop — user-facing screens that let humans see, decide on, and act on the ontology. The decision loop closes here.
The killer feature of this architecture is lineage: every derived property, every action’s outcome, and every model’s prediction is automatically tracked back to the upstream objects and actions that produced it. When a regulator asks “why was this loan denied?” the system can produce a complete answer chain — input properties, action history, function calls, model outputs — with no extra engineering.
- No data swamp. Every column in every table has a defined home in an
Object Typeor aLink. - Reusable features. A property defined once is available everywhere — to the dashboard, the model, the action, the audit report.
- Causality at the platform level. Actions are explicit, so “what happens when we intervene” is a first-class query.
- Regulator-ready by construction. Lineage and definitions are the artefacts the EU AI Act and US OCC actually want.
This is not unique to Palantir — Microsoft Fabric’s OneLake, Databricks Unity Catalog’s semantic layer, Snowflake Horizon’s data graph, and AWS DataZone’s business-glossary all implement variations on the same architecture. The vocabulary is what makes Foundry the canonical reference; the idea is platform-independent and is what this book teaches.
The Open-Source Alternative Stack
You do not need Palantir to do ontology-first analytics. The same architecture is buildable on free, mature open-source components — and for most teaching and prototype purposes, the open-source stack is the right choice.
The recommended teaching stack for this book:
| Concern | Open-source tool | Foundry equivalent |
|---|---|---|
| Storage of object properties | PostgreSQL (or DuckDB for analytics) | Foundry datasets |
| In-memory graph operations | NetworkX (small graphs), igraph (medium) | Foundry’s object-type service |
| Triple store / formal ontology | rdflib, Apache Jena, GraphDB | Foundry ontology service |
| Property graph database | Neo4j Community, Memgraph, Apache AGE (on Postgres) | Foundry’s object graph |
| Vector embeddings on objects | pgvector extension for PostgreSQL | Foundry’s AIP semantic layer |
| Workflow / action execution | Dagster, Prefect, Apache Airflow | Foundry Pipelines / AIP Logic |
| Lineage tracking | OpenLineage, Marquez | Foundry’s automatic lineage |
| Data-quality contracts | Great Expectations, Soda, dbt tests | Foundry’s data health monitors |
Production-scale firms using this open-source stack include Lyft (Amundsen + Marquez), Stripe (custom Postgres ontology), Spotify (graph-backed Show/Episode/Listener ontology), and most of the modern data-platform startups. The pattern is the same as Foundry’s; the price is “your team’s engineering hours” instead of “a Palantir contract.”
The example above is roughly 30 lines of Python — but it has every component a real ontology needs: typed object instances, typed links, a schema specification, and a function over the ontology. Production systems scale this pattern to millions of objects and billions of links; the conceptual primitives don’t change.
When You Don’t Need an Ontology — Anti-Patterns
Ontology-first analytics is genuinely the right answer for most enterprise problems, but it is also possible to over-engineer. Reach for a flat-table workflow, not an ontology, when:
- The problem is genuinely one table. A Kaggle competition where the input is a single CSV. A one-off pricing-curve fit. A peer-reviewed empirical paper on a published dataset. In all of these the entities are simple and the table is the natural representation.
- The data is read-once. If you will load the data once, run one analysis, write a report, and never see it again, a domain model is wasted effort.
- The team is one person. Ontologies pay off because they make shared meaning explicit across multiple people. A solo analyst can hold the implicit model in their head.
- The grain is the question. If the question is literally “what is the mean of this column?” no entity layer is needed.
Over-engineering is real and costly. The symptoms: an ontology document longer than the code, three weeks spent debating object-type naming before the first analysis ships, a “data steward” role added to a six-person team. If you’re seeing those signs you’ve overshot.
The rule of thumb: the domain model should be the smallest object representation that makes the analysis problem natural to express. If the flat table is natural, use it. If you find yourself reaching for self-joins, complicated groupby chains, or “did this column mean X or Y when it was loaded?” debates, the ontology is paying back.
Setting Up the Working Stack for This Book
The chapters of this book run in Pyodide (browser-side Python) for any example small enough. For larger work or production exercises, the recommended local stack is:
# core
python >= 3.11
pip install pandas numpy duckdb networkx rdflib scikit-learn
# graph & ontology
pip install rdflib-sqlalchemy SPARQLWrapper
# (for property-graph work, install Neo4j Community locally
# and `pip install neo4j` for the driver)
# vector / embedding
pip install sentence-transformers
# (Postgres-only sites: install the pgvector extension)
# workflow execution
pip install dagster dagit
# or
pip install prefect
# data quality
pip install great-expectations
The Pyodide cells in this book require none of this — they install nothing on your machine. The local stack is only needed when you start building beyond the toy examples — typically by Chapter 6.
A single test of “the stack is healthy enough” is the cell below: it builds a tiny ontology, queries it with NetworkX traversals, and derives a property using a function. If this runs in your browser, you have everything you need to follow the next two chapters.
Chapter Wrap-up
Three things to carry forward:
- The ontology comes first. Methods operate on a domain model; without one the methods produce numbers without meaning. The book’s working order is: model → instantiate → derive → analyse → decide.
- The Foundry vocabulary is the practitioner standard. Object Type, Property, Link, Action, Function are the four-plus-two primitives. Every working data team uses these words, on whatever platform they happen to deploy.
- Open-source is fully sufficient for learning. PostgreSQL + NetworkX + DuckDB + Dagster covers everything Foundry does for the scale of work this book exercises. Foundry, Fabric, Unity Catalog, Snowflake Horizon are realisations of the same architecture at production scale.
In Chapter 2 we open the toolbox and learn to actually build the four primitives — object types, properties, links, actions — on real example data from three different verticals.
Contents · Chapter 2: Object Types, Properties, Links, Actions →