Chapter 6: Building the Ontology in Python
Chapter Introduction
The previous chapters argued for ontology-first analytics and walked through the published ontology stacks of four domains. This chapter is where the modelling becomes code. We move from which ontology to how to instantiate it — load object instances, declare typed links, attach properties, persist the result so other code can query it, and scale from a teaching-size hundred-object example to a production-size hundred-million-object deployment.
Python has four mature options for representing an ontology, each with its own sweet spot:
- NetworkX — pure-Python graph library; ~100,000 nodes in memory; the right choice for prototyping, teaching, and any in-RAM analysis.
- rdflib + RDF/OWL — the formal, semantic web stack; SPARQL queries; standards-grade interop. Slower; the right choice when you need formal ontology semantics (subsumption reasoning, OWL inference).
- Neo4j and Apache AGE — property graphs at production scale. Cypher queries. ~10⁹ nodes; the default for ontology-backed applications.
- PostgreSQL + pgvector + JSONB — pragmatic relational storage that doubles as a property graph and a vector store. Most-deployed in industry because the operations team already runs Postgres.
The right tool is rarely a single one; production stacks routinely combine three or four. A typical pattern: NetworkX in the research notebook, Neo4j or AGE in the application, pgvector for any embedding-driven lookups, rdflib for the FIBO / SNOMED interop layer.
The chapter walks each option end to end, then closes with a case-study pattern that uses each tool in a different layer of a single working pipeline. The four domain case studies recur throughout to keep the discussion concrete.
Table of Contents
- NetworkX — the In-Memory Workhorse
- rdflib and RDF / OWL — the Formal Semantic Stack
- Neo4j and Apache AGE — Property Graphs at Scale
- PostgreSQL + pgvector + JSONB — the Pragmatic Hybrid
- Choosing the Right Layer
- A Layered Stack — All Four Working Together
NetworkX — the In-Memory Workhorse
NetworkX is the Swiss-army knife. Pure Python, zero infrastructure, in-memory only. The data structures are dictionaries; every property is JSON-serialisable; the algorithms cover everything from BFS to graph-isomorphism. Production code at every quant fund, hospital research team, and macro shop runs NetworkX somewhere in the prototype-to-research pipeline.
Object types as attributed nodes; links as attributed edges
That cell — under 40 lines — is a functional macro ontology: country and indicator object types, observation events linking them, a derived “latest value” function. The same pattern scales linearly to ~50,000 indicators × 250 countries × 30 years of monthly data — 4.5 million observations — in maybe 4 GB of RAM. Past that you persist (next sections).
Built-in graph algorithms
These algorithms — PageRank, shortest path, community detection, betweenness — are how an ontology-aware analyst answers questions like “which provider is the central referrer in this network?” or “which generating unit is the most-pivotal in this grid topology?”
When NetworkX runs out
NetworkX is single-machine and single-threaded. Past ~10 million nodes it slows down enough that an interactive analyst’s experience degrades. Past ~50 million nodes it requires a persisted backend. Past ~500 million the right tool is a distributed graph engine (Spark GraphFrames, Pregel-style systems, or a managed service).
rdflib and RDF / OWL — the Formal Semantic Stack
When you need formal semantic reasoning — automatic subsumption (“a CommonStock is-a Equity is-a Security”), OWL property constraints, SPARQL across federated graphs, or compatibility with the W3C Linked Data ecosystem — rdflib is the standard Python library. The data model is triples: (subject, predicate, object). Everything is a triple; queries are SPARQL.
The rdflib advantage:
- Every public ontology (FIBO, SNOMED CT, ICD-10, SDMX, GICS-OWL versions, CIM-OWL versions) ships as RDF/OWL files. You can
load("fibo-securities.ttl")and start querying with no transformation. - W3C standards apply: URIs as identifiers, SPARQL as the query language, OWL as the constraint language.
- Federated SPARQL lets you query across graphs hosted at different endpoints.
The rdflib disadvantage:
- Slower than NetworkX for in-memory work.
- The triple paradigm is initially unfamiliar — every property of an object is a separate triple, so a 10-property object is 10 triples.
- Tooling is less mature than the property-graph stack.
Production rdflib usage adds: persistent stores (BerkeleyDB, SQLAlchemy back-end, GraphDB, Apache Jena Fuseki), SPARQL endpoints, reasoners that infer triples (OWL RL or OWL DL semantics), and validators (SHACL — Shapes Constraint Language — for data-quality rules).
- You consume or publish public ontologies that ship as RDF (FIBO, SNOMED CT subsets, SDMX agency-codes, GICS-OWL).
- Your data is naturally graph-shaped and needs formal semantics (subsumption, equivalence, property restrictions).
- You operate inside a Linked Data ecosystem (W3C, EU’s Data Spaces, certain government open-data programmes).
- You need cross-organisation federated queries.
When not to use it: - Performance-critical applications. Property graphs (Neo4j, AGE) are typically 10-100× faster. - Pure prototyping. NetworkX is easier. - Pure relational analytics. Postgres + JSONB is more familiar.
Neo4j and Apache AGE — Property Graphs at Scale
A property graph is a graph where nodes and edges both carry typed labels and arbitrary key/value properties. The model is friendlier than RDF for most engineers — fewer indirections, no triple-explosion, native support for “an edge has properties.”
The two production options:
- Neo4j — the standalone graph database. Cypher query language. Mature; on-premise or managed (AuraDB). Used by every social-network analytics team, fraud-detection vendor, supply-chain risk team, and recommendation engine you have heard of.
- Apache AGE — a PostgreSQL extension that adds property-graph capability to an existing Postgres database. Cypher and SQL in the same query. Used at firms that want a graph but don’t want to operate a separate database.
Modelling in Neo4j / Cypher
Cypher’s syntax is intentionally pattern-like — you draw the pattern you want to match. The classic example:
// Create a small generator → substation topology
CREATE
(g:Generator {name: 'Bayswater-1', capacity_MW: 660}),
(s1:Substation {name: 'Bayswater-SUB', voltage_kV: 220}),
(s2:Substation {name: 'Sydney-SUB', voltage_kV: 220}),
(l:ACLineSegment {capacity_MW: 900, length_km: 145}),
(g)-[:FEEDS]->(s1),
(s1)-[:CONNECTED_VIA]->(l),
(l)-[:CONNECTED_TO]->(s2);
// Find every substation reachable from Bayswater-1 within 3 hops
MATCH (g:Generator {name: 'Bayswater-1'})-[:FEEDS|CONNECTED_VIA|CONNECTED_TO*1..3]-(s:Substation)
RETURN DISTINCT s.name;
Cypher is read top-to-bottom as a pattern; [:REL_TYPE*1..3] means “follow one to three hops along that relationship type.” The Neo4j query planner turns the pattern into an efficient traversal.
The Cypher language is also the de-facto standard for Apache AGE and was adopted as the basis of the ISO GQL standard (the SQL-equivalent for graph databases, ratified in 2024). Learning Cypher is the path of least regret in the property-graph space.
Modelling in AGE / Postgres
AGE runs Cypher inside Postgres. The same query above:
SELECT * FROM cypher('grid_graph', $$
CREATE
(g:Generator {name: 'Bayswater-1', capacity_MW: 660}),
(s:Substation {name: 'Bayswater-SUB'})
,(g)-[:FEEDS]->(s)
$$) AS (result agtype);The huge advantage of AGE for an enterprise team: the graph lives in the same Postgres instance as the relational tables, the BI dashboards, the audit log, and the user-management table. One backup, one ops team, one ACL model.
- Performance is a priority.
- Your engineers know SQL but not SPARQL.
- You need ad-hoc property additions without schema migration.
- You want to mix relational and graph data freely.
- You need to deploy on a managed cloud service (Neo4j AuraDB, AWS Neptune via openCypher, Azure Cosmos DB for Apache Gremlin).
The same query in Cypher: MATCH (s:Equity)-[:LISTED_ON]->(v:Venue {mic: 'XNAS'}), (s)-[:ISSUED_BY]->(i) RETURN s.ticker, i.lei. Cleaner and faster on a real graph database; the NetworkX version above proves the concept.
PostgreSQL + pgvector + JSONB — the Pragmatic Hybrid
The default operational stack at most enterprises already runs PostgreSQL. Three reasons it is often the best place to put your ontology too:
- JSONB columns let you store arbitrary nested property bags without schema migration; perfect for the heterogeneous attributes object types accumulate over time.
pgvectorextension adds first-class vector embeddings with ANN indexing — the same vector capability discussed in Volume I Chapter 8 — without leaving Postgres.- Apache AGE (above) layers property graphs into the same database.
- Mature: backups, replication, security, audit, monitoring all just work.
A practical pattern:
-- Object types as tables with JSONB property columns
CREATE TABLE securities (
security_id TEXT PRIMARY KEY,
ticker TEXT,
figi TEXT UNIQUE,
isin TEXT,
lei TEXT REFERENCES legal_entities(lei),
properties JSONB NOT NULL DEFAULT '{}',
embedding VECTOR(384), -- pgvector
valid_from DATE NOT NULL,
valid_to DATE
);
-- Indexes for fast lookup
CREATE INDEX ix_sec_lei ON securities(lei);
CREATE INDEX ix_sec_props ON securities USING gin(properties); -- query into JSONB
CREATE INDEX ix_sec_embed ON securities USING hnsw(embedding vector_cosine_ops); -- ANNThe same table holds the canonical security record, every typed property (some in columns for speed, some in JSONB for flexibility), the SCD2 validity dates (Chapter 3), the FIBO-aligned LEI link, and a 384-d transformer embedding ready for similarity search.
The Postgres approach is the de-facto industry default: roughly 80% of the ontology systems in production at firms outside the largest Internet companies sit on PostgreSQL with JSONB + pgvector + (optionally) AGE. Specialised graph databases are reserved for cases where the graph traversals genuinely outweigh the operational simplicity of Postgres.
Choosing the Right Layer
| Need | Right tool |
|---|---|
| Prototype, teaching, < 1M nodes | NetworkX |
| Consume / publish public OWL ontologies | rdflib + Apache Jena / GraphDB |
| Formal reasoning (subsumption, OWL DL) | rdflib + a reasoner |
| Large native graph, many traversals | Neo4j or AWS Neptune |
| Graph + SQL in one place | Apache AGE on Postgres |
| Operational simplicity, JSONB flexibility | Postgres + JSONB + pgvector |
| Vector similarity + relational + graph | Postgres + pgvector + AGE |
| Distributed, > 1B nodes | Spark GraphFrames / TigerGraph / managed service |
Most production stacks combine 2–4 of these. Common production patterns by domain:
- Quant trading: Postgres + AGE + pgvector for the working store; rdflib for FIBO interop with regulators; NetworkX in research notebooks; Spark for historical-data jobs.
- Healthcare: FHIR server (HAPI-FHIR, often Postgres-backed) for the operational layer; Neo4j for clinical-pathway analytics; rdflib for SNOMED CT subset interop.
- Macroeconomic: NetworkX for analyst notebooks; Postgres for the SDMX-shaped observation store; rdflib for federated SPARQL across IMF / BIS / OECD endpoints.
- Energy: Neo4j or AGE for the CIM-style topology; Postgres for SCADA observations and EIA cross-walks; NetworkX for contingency-analysis prototypes.
A Layered Stack — All Four Working Together
A worked end-to-end pattern for a representative system — say, a hospital research analytics platform — uses every tool in this chapter:
┌─────────────────────────────────────────────────────────────┐
│ Research notebooks (analysts) ──> NetworkX │
│ │
│ Operational EHR API ──> HAPI-FHIR │
│ │ │
│ ▼ │
│ Operational data store ──> PostgreSQL │
│ + JSONB │
│ + pgvector │
│ + Apache AGE │
│ │ │
│ ▼ │
│ Standard-ontology interop layer ──> rdflib + Jena │
│ (SNOMED CT, ICD-10 subsets) SPARQL endpoint │
│ │
│ Large-graph analytics ──> Neo4j Community │
│ (patient-referral networks, etc.) │
└─────────────────────────────────────────────────────────────┘
Each layer plays to its strengths: FHIR for clinical interop; Postgres for operational reliability; rdflib for standards-grade interchange; Neo4j for serious graph analytics; NetworkX for the analyst’s daily work. Building all of this at once is a year-plus engineering investment; building it piecemeal as need demands is the realistic path.
Almost certainly not yet. The right call is to add Apache AGE to their existing Postgres — it gives them Cypher and property-graph semantics without operating a second database. Neo4j becomes the right choice only when the graph workload genuinely outgrows what AGE on Postgres can handle (typically beyond ~50 million richly-connected nodes with frequent multi-hop queries) or when they need Neo4j-specific features (the Graph Data Science library, Bloom visualisation, AuraDB managed hosting). For a 4-engineer fintech, the operational cost of a separate graph DB exceeds the marginal capability for years.
Chapter Wrap-up
Four tools, four sweet spots:
- NetworkX — prototypes, teaching, in-memory work.
- rdflib + Jena / GraphDB — formal semantic web stack; OWL reasoning; FIBO/SNOMED/SDMX interop.
- Neo4j / Apache AGE — production property graphs; Cypher; the GQL future.
- PostgreSQL + JSONB + pgvector — pragmatic hybrid storage; the default operational layer.
Production stacks combine 2–4 of these depending on scale, interop needs, and engineering budget. The decision tree above is enough to defend a choice in a design review.
Chapter 7 takes up the question of how to query the ontology you have just built — SPARQL for the RDF stack, Cypher for property graphs, GraphQL for the API layer, and SQL for the relational projections. Same question, four idioms; choosing the right one is part of the craft.
← Chapter 5 · Contents · Chapter 7: Querying the Ontology →