• 📖 Cover
  • Contents

Chapter 6: Building the Ontology in Python

Chapter Introduction

The previous chapters argued for ontology-first analytics and walked through the published ontology stacks of four domains. This chapter is where the modelling becomes code. We move from which ontology to how to instantiate it — load object instances, declare typed links, attach properties, persist the result so other code can query it, and scale from a teaching-size hundred-object example to a production-size hundred-million-object deployment.

Python has four mature options for representing an ontology, each with its own sweet spot:

  • NetworkX — pure-Python graph library; ~100,000 nodes in memory; the right choice for prototyping, teaching, and any in-RAM analysis.
  • rdflib + RDF/OWL — the formal, semantic web stack; SPARQL queries; standards-grade interop. Slower; the right choice when you need formal ontology semantics (subsumption reasoning, OWL inference).
  • Neo4j and Apache AGE — property graphs at production scale. Cypher queries. ~10⁹ nodes; the default for ontology-backed applications.
  • PostgreSQL + pgvector + JSONB — pragmatic relational storage that doubles as a property graph and a vector store. Most-deployed in industry because the operations team already runs Postgres.

The right tool is rarely a single one; production stacks routinely combine three or four. A typical pattern: NetworkX in the research notebook, Neo4j or AGE in the application, pgvector for any embedding-driven lookups, rdflib for the FIBO / SNOMED interop layer.

The chapter walks each option end to end, then closes with a case-study pattern that uses each tool in a different layer of a single working pipeline. The four domain case studies recur throughout to keep the discussion concrete.


Table of Contents

  1. NetworkX — the In-Memory Workhorse
  2. rdflib and RDF / OWL — the Formal Semantic Stack
  3. Neo4j and Apache AGE — Property Graphs at Scale
  4. PostgreSQL + pgvector + JSONB — the Pragmatic Hybrid
  5. Choosing the Right Layer
  6. A Layered Stack — All Four Working Together

NetworkX — the In-Memory Workhorse

NetworkX is the Swiss-army knife. Pure Python, zero infrastructure, in-memory only. The data structures are dictionaries; every property is JSON-serialisable; the algorithms cover everything from BFS to graph-isomorphism. Production code at every quant fund, hospital research team, and macro shop runs NetworkX somewhere in the prototype-to-research pipeline.

Object types as attributed nodes; links as attributed edges

That cell — under 40 lines — is a functional macro ontology: country and indicator object types, observation events linking them, a derived “latest value” function. The same pattern scales linearly to ~50,000 indicators × 250 countries × 30 years of monthly data — 4.5 million observations — in maybe 4 GB of RAM. Past that you persist (next sections).

Built-in graph algorithms

These algorithms — PageRank, shortest path, community detection, betweenness — are how an ontology-aware analyst answers questions like “which provider is the central referrer in this network?” or “which generating unit is the most-pivotal in this grid topology?”

When NetworkX runs out

NetworkX is single-machine and single-threaded. Past ~10 million nodes it slows down enough that an interactive analyst’s experience degrades. Past ~50 million nodes it requires a persisted backend. Past ~500 million the right tool is a distributed graph engine (Spark GraphFrames, Pregel-style systems, or a managed service).

rdflib and RDF / OWL — the Formal Semantic Stack

When you need formal semantic reasoning — automatic subsumption (“a CommonStock is-a Equity is-a Security”), OWL property constraints, SPARQL across federated graphs, or compatibility with the W3C Linked Data ecosystem — rdflib is the standard Python library. The data model is triples: (subject, predicate, object). Everything is a triple; queries are SPARQL.

The rdflib advantage:

  • Every public ontology (FIBO, SNOMED CT, ICD-10, SDMX, GICS-OWL versions, CIM-OWL versions) ships as RDF/OWL files. You can load("fibo-securities.ttl") and start querying with no transformation.
  • W3C standards apply: URIs as identifiers, SPARQL as the query language, OWL as the constraint language.
  • Federated SPARQL lets you query across graphs hosted at different endpoints.

The rdflib disadvantage:

  • Slower than NetworkX for in-memory work.
  • The triple paradigm is initially unfamiliar — every property of an object is a separate triple, so a 10-property object is 10 triples.
  • Tooling is less mature than the property-graph stack.

Production rdflib usage adds: persistent stores (BerkeleyDB, SQLAlchemy back-end, GraphDB, Apache Jena Fuseki), SPARQL endpoints, reasoners that infer triples (OWL RL or OWL DL semantics), and validators (SHACL — Shapes Constraint Language — for data-quality rules).

When rdflib is the right answer
  • You consume or publish public ontologies that ship as RDF (FIBO, SNOMED CT subsets, SDMX agency-codes, GICS-OWL).
  • Your data is naturally graph-shaped and needs formal semantics (subsumption, equivalence, property restrictions).
  • You operate inside a Linked Data ecosystem (W3C, EU’s Data Spaces, certain government open-data programmes).
  • You need cross-organisation federated queries.

When not to use it: - Performance-critical applications. Property graphs (Neo4j, AGE) are typically 10-100× faster. - Pure prototyping. NetworkX is easier. - Pure relational analytics. Postgres + JSONB is more familiar.

Neo4j and Apache AGE — Property Graphs at Scale

A property graph is a graph where nodes and edges both carry typed labels and arbitrary key/value properties. The model is friendlier than RDF for most engineers — fewer indirections, no triple-explosion, native support for “an edge has properties.”

The two production options:

  • Neo4j — the standalone graph database. Cypher query language. Mature; on-premise or managed (AuraDB). Used by every social-network analytics team, fraud-detection vendor, supply-chain risk team, and recommendation engine you have heard of.
  • Apache AGE — a PostgreSQL extension that adds property-graph capability to an existing Postgres database. Cypher and SQL in the same query. Used at firms that want a graph but don’t want to operate a separate database.

Modelling in Neo4j / Cypher

Cypher’s syntax is intentionally pattern-like — you draw the pattern you want to match. The classic example:

// Create a small generator → substation topology
CREATE
  (g:Generator {name: 'Bayswater-1', capacity_MW: 660}),
  (s1:Substation {name: 'Bayswater-SUB', voltage_kV: 220}),
  (s2:Substation {name: 'Sydney-SUB',    voltage_kV: 220}),
  (l:ACLineSegment {capacity_MW: 900, length_km: 145}),
  (g)-[:FEEDS]->(s1),
  (s1)-[:CONNECTED_VIA]->(l),
  (l)-[:CONNECTED_TO]->(s2);

// Find every substation reachable from Bayswater-1 within 3 hops
MATCH (g:Generator {name: 'Bayswater-1'})-[:FEEDS|CONNECTED_VIA|CONNECTED_TO*1..3]-(s:Substation)
RETURN DISTINCT s.name;

Cypher is read top-to-bottom as a pattern; [:REL_TYPE*1..3] means “follow one to three hops along that relationship type.” The Neo4j query planner turns the pattern into an efficient traversal.

The Cypher language is also the de-facto standard for Apache AGE and was adopted as the basis of the ISO GQL standard (the SQL-equivalent for graph databases, ratified in 2024). Learning Cypher is the path of least regret in the property-graph space.

Modelling in AGE / Postgres

AGE runs Cypher inside Postgres. The same query above:

SELECT * FROM cypher('grid_graph', $$
  CREATE
    (g:Generator {name: 'Bayswater-1', capacity_MW: 660}),
    (s:Substation {name: 'Bayswater-SUB'})
  ,(g)-[:FEEDS]->(s)
$$) AS (result agtype);

The huge advantage of AGE for an enterprise team: the graph lives in the same Postgres instance as the relational tables, the BI dashboards, the audit log, and the user-management table. One backup, one ops team, one ACL model.

When property graphs beat RDF
  • Performance is a priority.
  • Your engineers know SQL but not SPARQL.
  • You need ad-hoc property additions without schema migration.
  • You want to mix relational and graph data freely.
  • You need to deploy on a managed cloud service (Neo4j AuraDB, AWS Neptune via openCypher, Azure Cosmos DB for Apache Gremlin).

The same query in Cypher: MATCH (s:Equity)-[:LISTED_ON]->(v:Venue {mic: 'XNAS'}), (s)-[:ISSUED_BY]->(i) RETURN s.ticker, i.lei. Cleaner and faster on a real graph database; the NetworkX version above proves the concept.

PostgreSQL + pgvector + JSONB — the Pragmatic Hybrid

The default operational stack at most enterprises already runs PostgreSQL. Three reasons it is often the best place to put your ontology too:

  • JSONB columns let you store arbitrary nested property bags without schema migration; perfect for the heterogeneous attributes object types accumulate over time.
  • pgvector extension adds first-class vector embeddings with ANN indexing — the same vector capability discussed in Volume I Chapter 8 — without leaving Postgres.
  • Apache AGE (above) layers property graphs into the same database.
  • Mature: backups, replication, security, audit, monitoring all just work.

A practical pattern:

-- Object types as tables with JSONB property columns
CREATE TABLE securities (
    security_id TEXT PRIMARY KEY,
    ticker      TEXT,
    figi        TEXT UNIQUE,
    isin        TEXT,
    lei         TEXT REFERENCES legal_entities(lei),
    properties  JSONB NOT NULL DEFAULT '{}',
    embedding   VECTOR(384),                  -- pgvector
    valid_from  DATE NOT NULL,
    valid_to    DATE
);

-- Indexes for fast lookup
CREATE INDEX ix_sec_lei      ON securities(lei);
CREATE INDEX ix_sec_props    ON securities USING gin(properties);   -- query into JSONB
CREATE INDEX ix_sec_embed    ON securities USING hnsw(embedding vector_cosine_ops);  -- ANN

The same table holds the canonical security record, every typed property (some in columns for speed, some in JSONB for flexibility), the SCD2 validity dates (Chapter 3), the FIBO-aligned LEI link, and a 384-d transformer embedding ready for similarity search.

The Postgres approach is the de-facto industry default: roughly 80% of the ontology systems in production at firms outside the largest Internet companies sit on PostgreSQL with JSONB + pgvector + (optionally) AGE. Specialised graph databases are reserved for cases where the graph traversals genuinely outweigh the operational simplicity of Postgres.

Choosing the Right Layer

Need Right tool
Prototype, teaching, < 1M nodes NetworkX
Consume / publish public OWL ontologies rdflib + Apache Jena / GraphDB
Formal reasoning (subsumption, OWL DL) rdflib + a reasoner
Large native graph, many traversals Neo4j or AWS Neptune
Graph + SQL in one place Apache AGE on Postgres
Operational simplicity, JSONB flexibility Postgres + JSONB + pgvector
Vector similarity + relational + graph Postgres + pgvector + AGE
Distributed, > 1B nodes Spark GraphFrames / TigerGraph / managed service

Most production stacks combine 2–4 of these. Common production patterns by domain:

  • Quant trading: Postgres + AGE + pgvector for the working store; rdflib for FIBO interop with regulators; NetworkX in research notebooks; Spark for historical-data jobs.
  • Healthcare: FHIR server (HAPI-FHIR, often Postgres-backed) for the operational layer; Neo4j for clinical-pathway analytics; rdflib for SNOMED CT subset interop.
  • Macroeconomic: NetworkX for analyst notebooks; Postgres for the SDMX-shaped observation store; rdflib for federated SPARQL across IMF / BIS / OECD endpoints.
  • Energy: Neo4j or AGE for the CIM-style topology; Postgres for SCADA observations and EIA cross-walks; NetworkX for contingency-analysis prototypes.

A Layered Stack — All Four Working Together

A worked end-to-end pattern for a representative system — say, a hospital research analytics platform — uses every tool in this chapter:

┌─────────────────────────────────────────────────────────────┐
│  Research notebooks (analysts)         ──>  NetworkX        │
│                                                              │
│  Operational EHR API                   ──>  HAPI-FHIR        │
│                                              │               │
│                                              ▼               │
│  Operational data store                ──>  PostgreSQL       │
│                                              + JSONB         │
│                                              + pgvector       │
│                                              + Apache AGE     │
│                                              │               │
│                                              ▼               │
│  Standard-ontology interop layer       ──>  rdflib + Jena    │
│  (SNOMED CT, ICD-10 subsets)                 SPARQL endpoint │
│                                                              │
│  Large-graph analytics                 ──>  Neo4j Community  │
│  (patient-referral networks, etc.)                          │
└─────────────────────────────────────────────────────────────┘

Each layer plays to its strengths: FHIR for clinical interop; Postgres for operational reliability; rdflib for standards-grade interchange; Neo4j for serious graph analytics; NetworkX for the analyst’s daily work. Building all of this at once is a year-plus engineering investment; building it piecemeal as need demands is the realistic path.

Almost certainly not yet. The right call is to add Apache AGE to their existing Postgres — it gives them Cypher and property-graph semantics without operating a second database. Neo4j becomes the right choice only when the graph workload genuinely outgrows what AGE on Postgres can handle (typically beyond ~50 million richly-connected nodes with frequent multi-hop queries) or when they need Neo4j-specific features (the Graph Data Science library, Bloom visualisation, AuraDB managed hosting). For a 4-engineer fintech, the operational cost of a separate graph DB exceeds the marginal capability for years.

Chapter Wrap-up

Four tools, four sweet spots:

  • NetworkX — prototypes, teaching, in-memory work.
  • rdflib + Jena / GraphDB — formal semantic web stack; OWL reasoning; FIBO/SNOMED/SDMX interop.
  • Neo4j / Apache AGE — production property graphs; Cypher; the GQL future.
  • PostgreSQL + JSONB + pgvector — pragmatic hybrid storage; the default operational layer.

Production stacks combine 2–4 of these depending on scale, interop needs, and engineering budget. The decision tree above is enough to defend a choice in a design review.

Chapter 7 takes up the question of how to query the ontology you have just built — SPARQL for the RDF stack, Cypher for property graphs, GraphQL for the API layer, and SQL for the relational projections. Same question, four idioms; choosing the right one is part of the craft.

← Chapter 5  ·  Contents  ·  Chapter 7: Querying the Ontology →

 

Prof. Xuhu Wan · HKUST ISOM · Domain Modelling in Python