Chapter 3: Grain, Time, and Point-in-Time Correctness
Chapter Introduction
Chapter 2 modelled the domain as if it were frozen — object types, properties, links, actions defined at a single instant. Real domains are not frozen. A customer’s address changes; a loan moves from current to delinquent; a product price is updated; a patient’s diagnosis is revised after new lab results. These changes are not bugs to be papered over — they are the operational story of the business. Capturing them faithfully without breaking auditability is the single hardest practical skill in data modelling, and it is the subject of this chapter.
Three concepts run through the entire discussion:
- Grain — the unit of analysis. Is one row per customer? Per customer-per-month? Per claim-line? Choosing the right grain determines what every downstream model can and cannot say.
- Slowly-Changing Dimensions (SCD) — the canonical patterns for representing change to an entity’s properties over time, due to Ralph Kimball. SCD1 overwrites, SCD2 versions, SCD6 hybridises. Each pattern is right for some problems and wrong for others.
- Point-in-time (PIT) correctness — the discipline of ensuring that any feature value used at training or scoring time was actually available at that point in calendar time. Violations are look-ahead bias, and they are the most common silent killer of production analytics — invisible in code review, fatal in production.
These three sit beneath every other modelling decision you will make in this book. They are also the concepts most often glossed over in introductory data-engineering courses, with the consequence that practitioners learn them by blowing up a production model and being forced to investigate. This chapter exists to prevent that blow-up.
Table of Contents
- What Is Grain, and Why It Decides Everything
- Slowly-Changing Dimensions — SCD1, SCD2, SCD6
- Event Sourcing — When the History Is the Truth
- Point-in-Time Correctness — the Silent Killer
- As-Of Joins — the Right Way to Build a Feature
- Bitemporal Modelling — Valid Time vs. System Time
- Operational Patterns and Tools
What Is Grain, and Why It Decides Everything
The grain of a dataset is the answer to “what does one row represent?” The question sounds trivial. It is not. Choosing the wrong grain is the single most expensive modelling mistake on a real project, because it propagates into every downstream feature, every model, and every report — and cannot be fixed by joining more data later.
Consider a retail business. Possible grains for “sales data”:
- Order grain — one row per order. Properties: order date, total amount, customer, payment method.
- Order-line grain — one row per product within an order. Properties: order_id, product_id, quantity, line amount.
- Customer-day grain — one row per customer per calendar day. Properties: total spend that day, orders count, return count.
- Customer-month grain — one row per customer per calendar month.
- Customer-lifetime grain — one row per customer, aggregating all-time properties.
Each grain answers different questions naturally and others awkwardly:
| Question | Natural at grain… | Awkward / impossible at… |
|---|---|---|
| “What products are bought together?” | order-line | customer-month |
| “Which day of the week sees most orders?” | order | customer-lifetime |
| “What is each customer’s lifetime value?” | customer-lifetime (or aggregated) | order-line raw |
| “How does spending change month-to-month?” | customer-month | order |
| “Which payment methods generate refunds?” | order (with refund flag) | customer-month |
The discipline:
- State the grain explicitly in the ontology canvas (Chapter 2 §❸). Every dataset has one grain; if your “dataset” has rows at different grains you have two datasets disguised as one.
- Match the grain to the question. If the question is “which customers churn?” the customer-month grain is natural. Loading the order-line data and grouping every time is wasteful; storing pre-aggregated customer-month rows is correct.
- Never down-aggregate without the lineage. Going from order-line to order is a deterministic aggregation; going from order to order-line is recovery of lost information and is impossible. Keep the finest-grain dataset that the question and storage budget allow.
The lesson: each grain throws away something the level below preserves. Choose the grain that preserves what your downstream needs and no more. Aggregating up is cheap; recovering detail is impossible.
Slowly-Changing Dimensions — SCD1, SCD2, SCD6
Ralph Kimball’s SCD patterns (from his Data Warehouse Toolkit, 1996, still definitive) are the canonical solutions to representing change in entity attributes. Five named patterns are widely used; three matter most.
SCD1 — overwrite
The simplest pattern. When an attribute changes, the new value replaces the old. No history is kept.
When SCD1 is correct: corrections (typos, data-quality fixes), attributes that you genuinely don’t care about historically (an internal “favourite colour” preference, a re-keyed system identifier). When SCD1 is wrong: anything you might want to reproduce a historical decision or report against. If a regulator asks “what was this customer’s address when the loan was approved?” SCD1 cannot answer.
SCD2 — versioned with valid_from / valid_to
Every attribute change creates a new row. Each row carries a validity period.
When SCD2 is right: any attribute whose history matters for reasoning, reporting, or auditing. Customer address (for KYC), product price (for revenue restatement), patient diagnosis (for clinical care), policy terms (for claims adjudication). SCD2 is the default for any attribute on any business-critical dimension.
The cost: more rows, more storage, more join complexity. The benefit: every historical question is answerable from a single dimension table without auxiliary archives.
SCD6 — current + history hybrid
A mash-up of SCD1 and SCD2: each row has a versioned history plus a denormalised “current” copy of the most-recent attributes. Two columns per attribute: email_at_time (historical) and email_current (always the latest). Useful when downstream consumers frequently ask both “what was this customer’s email when this happened?” and “what is their email right now?” and you want to avoid an extra join. Slightly more storage; meaningfully faster queries.
Implementation realities
- Tools like
dbt snapshots, Foundry’sincrementalbuilds, and Databricks’MERGE INTOautomate SCD2 maintenance: feed them a snapshot at time \(t\), they detect changes and append new versioned rows. - The classic mistake: deleting an SCD2 row to “fix” the history. Don’t. Add a new row with the corrected values and a
correctionflag; preserve the audit trail. - The other classic mistake: applying SCD2 to every attribute regardless of business need, producing dimension tables with 10× the rows they need. Pick SCD2 deliberately, attribute by attribute.
Event Sourcing — When the History Is the Truth
SCD2 stores the current and historical state of an entity. Event sourcing stores the list of events that produced the state. The current state is computed by replaying events. Two flavours of the same idea coexist in real systems:
- State-oriented (SCD2 dimensions, snapshot-style fact tables). Easy to query, harder to extend.
- Event-oriented (append-only log of
CustomerCreated,AddressChanged,OrderPlacedevents). Harder to query, trivially extensible, perfect audit trail.
Most modern systems use both: an event log as the source of truth, with state-oriented projections derived from it. Kafka + ksqlDB, Confluent, Apache Pulsar, EventStoreDB are the production tools.
Event sourcing’s killer feature: any past state is reconstructable by replay. The audit question “what did the system know about this customer when this decision was made?” has a single, correct, automatic answer. The cost: every new query starts from “fold this event log,” which requires good projection infrastructure.
Point-in-Time Correctness — the Silent Killer
The most common, most damaging, and most invisible bug in real-world analytics: training a model on features that were not actually available at the time they pretend to be. The failure mode is called look-ahead bias or information leakage, and it routinely turns a 70%-accurate-in-backtest model into a useless production model.
The mechanism is mundane. You join a “feature snapshot” to a “label snapshot” on customer_id — but the feature snapshot was generated today, with knowledge that didn’t exist when the label was created. A current_balance feature joined to a 2024-Q1 default label encodes 2026-knowledge into 2024-data. The model learns the leakage and looks brilliant; in production, where the feature genuinely only has real-time information, the model fails.
The bug above is so common that every major data-platform vendor has a section in their documentation specifically warning against it. The fix is a point-in-time-correct join — for each label row, fetch the feature value as it would have been at that label’s observation date.
The single line that made the difference: pd.merge_asof(..., direction="backward"). It performs a temporal left-join that respects causality. Spark, DuckDB, Snowflake, BigQuery, Polars all have an equivalent (AS OF JOIN is the SQL standard syntax; Foundry’s UI calls this an “Effective As-Of” join).
Make this idiom the default reflex. Whenever you build a feature for a labelled observation, ask: would this feature value have been available in production at the label’s observation timestamp? If you cannot defend “yes” from the data architecture, you have a leak.
The tenure_months column was computed today. For a churn label that fired in March 2025, the analyst is joining “tenure as of today” — which is post-churn — to a March-2025 label. The model learns “longer tenure → less churn” partly because customers who didn’t churn have continued to accrue tenure since the label, while churned customers’ tenure froze. Fix: replace tenure_months with tenure_at_observation_months = (observation_date − account_open_date) computed at the label’s observation date, not at refresh time.
As-Of Joins — the Right Way to Build a Feature
merge_asof is the most-important data-engineering primitive in this entire book. It deserves its own section because the cases for using it extend beyond the obvious feature-engineering one.
Common as-of join patterns
- Snapshot to event: at each event, fetch the most-recent snapshot of an entity (current credit score at the time of a loan application).
- Event to event: at each “high-priority” event (e.g., a fraud alert), fetch the most-recent “low-priority” event (e.g., the last transaction before the alert).
- Reference data to fact: at each transaction, fetch the FX rate or interest rate that was in effect at that exact timestamp.
- Cross-entity rolling features: at each customer event, fetch the most-recent state of their related entities (their employer’s stock price, their region’s macroeconomic indicator).
Without the as-of join, every trade would be priced at “the rate now” — wrong by potentially many basis points and wrong in a way the model would learn to exploit. The same idiom values every transaction in every multi-currency book in every bank in the world.
Bitemporal Modelling — Valid Time vs. System Time
There is a second time dimension hiding in many real systems: the difference between when an event was true in the world and when the system learned about it. Bitemporal modelling makes both explicit.
- Valid time (a.k.a. effective time, business time) — when the fact is true in the world. “The customer’s address changed on 2024-05-01.”
- System time (a.k.a. transaction time, ingestion time) — when the system recorded the fact. “We received the address-change notification on 2024-05-15.”
Most operational systems care only about valid time. Audit, compliance, and reconciliation require both: “What did the system think the customer’s address was on 2024-05-10?” The answer is Hong Kong — even though the truth, retroactively, is Singapore. The system at 2024-05-10 did not know yet.
Bitemporal storage:
Bitemporal modelling is standard practice in:
- Banking — regulatory reporting must reconstruct exactly what was reported at each historical filing date.
- Insurance — claims processing depends on what coverage was believed to be in force at the loss date, even after re-issues.
- Healthcare — clinical decisions must be auditable against what the chart said at the moment of the decision, not the corrected version.
- Trading — P&L attribution requires both “as-of-trade-date” and “as-of-reporting-date” views.
The complexity is real and the storage cost is real (rows multiply). The benefit is that the answer to every “what did we know when?” question exists in a single table without auxiliary archives.
Operational Patterns and Tools
The patterns of this chapter map to specific tools in production:
| Concept | Open-source tool | Foundry / Fabric equivalent |
|---|---|---|
| SCD2 maintenance | dbt snapshots, MERGE INTO (Spark / Delta / Iceberg) |
Foundry incremental pipelines |
| Event sourcing | Kafka + Schema Registry, Apache Pulsar, EventStoreDB | Foundry event datasets |
| As-of joins | pandas.merge_asof, DuckDB ASOF JOIN, Polars join_asof, Snowflake AS OF JOIN, Spark windowSpec with lag |
Foundry’s effective-as-of join builder |
| Bitemporal storage | XTDB, Datomic, Postgres + system-period extensions | Foundry’s bitemporal datasets |
| Lineage of point-in-time features | Feast feature store, Tecton, Hopsworks | Foundry Pipelines |
| Replay / time travel | Delta Lake VERSION AS OF, Iceberg snapshots, Snowflake Time Travel |
Foundry dataset versions |
Three operational rules that prevent the most painful surprises:
- Always store the source timestamp. When records arrive from an upstream system, capture both the source-emitted timestamp and the ingestion timestamp. The pair is your only protection against late-arriving data corrupting historical features.
- Treat the dataset version as a feature input. If your model depends on
customer_dim_v3.2, and an upstream change moves it tov3.3with subtly different SCD2 semantics, your model has effectively changed. Lineage the version in. - Test the as-of join. A common failure: the developer wrote
direction="nearest"when they meant"backward". The build succeeds, the model trains, the bug ships. Write a unit test on a tiny known dataset that fails on the wrong join direction.
Chapter Wrap-up
Time is the dimension where most data-modelling efforts quietly break. The patterns in this chapter are the corrective discipline:
- Grain — declare it, match it to the question, never silently mix grains.
- SCD1 / SCD2 / SCD6 — pick the right pattern per attribute; don’t default to SCD2 for everything, don’t default to SCD1 for anything that matters historically.
- Event sourcing — when the audit trail of what happened is more important than the state of what is, store the events and project the state from them.
- Point-in-time correctness — the single most-violated discipline in production analytics. Every feature must be answerable to “would this value have been available at the label’s timestamp?” If the answer is “no,” the model is leaking.
- As-of joins —
merge_asof,ASOF JOIN,join_asof— the SQL/pandas primitive that operationalises point-in-time correctness. - Bitemporal modelling — valid time + system time, for any domain where “what did the system know when?” is a real question.
Chapters 4 and 5 build on these foundations. Chapter 4 takes up the messy problem that comes before you can model anything: when records in your sources don’t agree on who is who. Chapter 5 walks through the standard published ontologies (FIBO, SNOMED CT, GICS, GS1) and the trade-offs of adopting vs. extending them.
← Chapter 2 · Contents · Chapter 4: Entity Resolution and Master Data →