lenatriestounderstand

Chapter 5 of 10

The Hindsight Corpus: Time in LLM Pretraining Data

Created May 13, 2026 Updated May 13, 2026

Ask a current model what mainstream economists thought about the prospects of a banking crisis in January 2008. The answer comes back lucid, structured, roughly correct — as a retrospective. It names the dissenters who'd been flagging derivatives risk, gestures at the housing bubble, tone-matches a knowledgeable observer who already knows what happened. What rarely comes out cleanly is the modal view of the field at that moment: that growth was slowing but the system was structurally fine. The model knows what happened. It also knows what was written about January 2008 by people who knew what happened.

This matters whenever we ask an LLM to answer not what do we now know about 2008? but what could a reasonable person have believed in January 2008? The first is knowledge retrieval and current models handle it well. The second is reconstruction of a contemporaneous mindset, and it is where the structure of the corpus starts to bite.

It is worth saying this clearly upfront, because the failure mode looks superficially like a familiar one. This is not the usual cutoff problem. The usual cutoff problem is that the model does not know the future. The hindsight corpus problem is that the model knows too much future about the past. The two need different fixes, and the obvious remediations for the first — RAG over recent sources, retraining on newer data — do nothing for the second.

The standard framing — trained on text written before date T — invites you to imagine a snapshot of human knowledge as of T. That picture is wrong in at least four independent ways, and they compose. Each one would distort the temporal worldview a model inherits; together they produce a corpus that systematically knows the future of its own past. The framing I'll use here is that all four are selection mechanisms on what makes it into the training mass — volume, curation, editing, preservation — and each acts differently on different parts of the timeline.

What "trained on text before T" actually is

A pretraining corpus is not a sample of "writing produced before T". It's a sample of writing accessible and retained at the moment of corpus construction, filtered by everything that decides accessibility and retention. Three things follow that aren't usually made explicit.

First, the corpus is sampled at construction time, not at writing time. A 2024 web crawl gets you the 2024 version of every still-extant URL, regardless of when the underlying text was first produced.

Second, much of the corpus has no reliable authoring date attached. HTTP headers, on-page metadata, and content heuristics all give different and unreliable answers. Crawl date is precise but is not authoring date.

Third, "before T" is a statement about the cutoff of ingestion, not about the contents. A document from 2008 that was last revised in 2023 enters the corpus as a 2023 document, semantically.

Once these three are spelled out the failure modes stop looking surprising.

Four distortions

DistortionMechanismWhat it affects
Volumetric skewMore text produced (and retained) per recent yearFluency, density, register coverage of old periods
Retrospective rewritingCurated sources are present-tense snapshots edited over timePre-event states, contemporaneous mindsets
Timestamp unreliabilityMost documents carry no usable authoring dateAny attempt to condition or filter by time
Survival biasOld text is in the corpus only if curated, archived, copiedRepresentativeness of any pre-internet-era subset

Volumetric skew

The web has more text per year for recent years, by a wide margin. Common Crawl gets most of its mass from recent snapshots of the live web. ArXiv, GitHub, Reddit, Stack Exchange — every per-year-growing source — contributes more tokens for last year than for fifteen years ago, often by an order of magnitude or more. Even with deduplication and quality filtering, the resulting corpus is recency-weighted in tokens.

This isn't a flaw to fix in the abstract. It's a property the model inherits as a prior. Generations about older periods are noticeably less fluent and more error-prone partly because the model has had less practice writing in their register. A 1992 voice is harder to hit than a 2022 voice not because 1992 is intrinsically harder but because there are fewer tokens of it under conditioning.

Retrospective rewriting

This is the more interesting distortion and Wikipedia is the canonical case. A Wikipedia article about a 2008 event was likely created shortly after the event with information available then, then revised hundreds of times since — incorporating later consequences, retrospective analyses, subsequent reframings. A training pipeline that crawled Wikipedia in 2024 captured only the latest revision. The model sees a 2008 event through its 2024 description. The 2024 description contains everything the 2008-as-it-was-lived account does not yet contain.

The same dynamic, to varying degree, acts on most curated content: textbooks, encyclopedias, "best of" lists, academic survey papers, news retrospectives, edited blog posts, GitHub READMEs and code, documentation sites. Anything written about a past event by someone who is now in the future of that event has been quietly contaminated by what they now know.

The result: most of the long tail of the model's temporal coverage is not contemporaneous text. It's present-tense snapshots of the past, narrated by people who already know what comes next.

Timestamp unreliability

Even when we want to weight or filter by document age, we mostly can't.

  • HTTP Last-Modified headers reflect server behavior, not authorial intent
  • HTML <time> and schema.org metadata are sparse and frequently wrong
  • Content-based date inference works for some long documents and fails on most short ones
  • Crawl date (when the document was archived) is precise but is not authoring date
  • Revisions of the same document share a URL; the version-at-crawl is one slice of an evolving body

So even when authoring date is decision-relevant, the corpus typically doesn't expose it cleanly. The documents that do come with reliable dates — academic papers, dated press releases, some blog post archives — are a non-representative slice biased toward formal publishing.

Survival bias

The pre-internet portion of any web-derived corpus is whatever was scanned, transcribed, or re-uploaded. The 1995-web inside a 2024 crawl is not the 1995 web; it's the subset of 1995 content that was deemed worth preserving by someone in the intervening thirty years. The criteria for preservation are not flat across topics, registers, or viewpoints.

This is structurally the same problem as survival bias in any historical corpus, and it has the same character: the missing material is exactly the material no one bothered to preserve, which is correlated with how mainstream, mundane, or low-status it was at the time.

Failure modes

Once the four distortions are explicit, several model behaviors stop being puzzling. None of these are exotic — they show up wherever temporal accuracy actually matters.

Anachronistic framing

The model retrojects later concepts onto earlier periods. Pre-2020 discussions of pandemic preparedness get COVID-shaped vocabulary; pre-2008 financial commentary gets GFC-shaped framing; pre-LLM AI writeups acquire transformer-flavored phrasing. The vocabulary isn't load-bearing — it's a tell that the conditioning is retrospective.

Detection: check whether terms introduced after period T appear in the model's analysis of T. Crude term-by-year baselines work well enough to flag this.

Hindsight calibration on forecasting

The model's confidence about historical outcomes is too high in a specific way: it knows what happened. Retrospective forecasting evaluations — predict event X, resolved in 2022, given context available in 2020 — are systematically inflated whenever the resolution date precedes the training cutoff, because the answers are in the training data.

Detection: split forecasting questions by resolution date around the plausible training cutoff and compare. The gap is often dramatic. A model that looks like a strong forecaster pre-cutoff can be near-random post-cutoff, which means most of the apparent skill is recall.

Mainstream-of-the-time absence

The model can usually surface what an iconoclast or dissenter said about a topic at time T, because that's exactly what gets cited in retrospect once they turned out to be right. The boring, modal, since-superseded mainstream view of the period — the one most professionals actually held — is much harder to recover. Curation amplifies prescience and erases consensus.

Detection: ask the model to reconstruct the modal view of a community at a specific past moment, then check whether what comes back is the modal view or a curated set of memorable minority positions.

Density collapse for older periods

Generation about older events is more error-prone, more hallucinated, more generic. Names, dates, and specifics degrade as you go back. This is the direct consequence of volumetric skew plus survival bias: less text, less varied text, less practice generating in that register.

Detection: fact-extraction or citation tasks on parallel topics across decades. Accuracy typically drops with depth into the past, and the drop is steeper than people anticipate.

Concept anachronism

A concept that didn't exist in vocabulary form at time T is applied as if it had. Algorithmic bias applied to 1980s expert systems. Prompt engineering applied to GPT-2-era work. Burnout applied to 1950s working life. The model knows the concept and the era separately but doesn't track when the concept entered ordinary usage.

Detection: concept-introduction-date baselines. Check whether the model uses a term about a period before that term was in standard use within the relevant community.

Three probes for hindsight contamination

Three small experiments surface the failure modes above. Each holds the question constant and varies what the model is told about time. All three are cheap to run; the third requires a retrieval layer. Together they isolate which fix actually moves the conditioning and which only moves the surface.

Probe A — naive contemporaneous query.

As of January 2008, what did mainstream economists believe about systemic banking risk?

The expected failure: the answer comes back in the voice of a post-2009 retrospective. It foregrounds the dissenters, names credit instruments using post-crisis vocabulary, and treats the eventual collapse as already implicit in the contemporary evidence. The date in the prompt does not override the conditioning.

Probe B — explicit cutoff instruction.

Same question, prefixed with: Do not use any information from after January 2008.

This is the obvious fix, and it helps less than people expect. The model usually softens its tone and removes the most overtly post-2008 terminology, but the framing stays retrospective. The model has no mechanism to mask out post-2008 weights — it is reasoning with the full training set and self-censoring at the output layer. What to look for: residual anachronisms in vocabulary or framing that survive the instruction, and how confident the model remains about a "consensus" it cannot actually reconstruct.

Probe C — contemporaneous retrieval.

Same question, constrained to a retrieval set of documents dated before January 2008 — archived FT, Economist, IMF/BIS commentary from late 2007.

This is the only one of the three that actually shifts the conditioning, because the answer is grounded in dated text rather than in the model's prior. The contrast between (B) and (C) is the most useful diagnostic in the set: it measures how much of the as of T capability is the model's internal time-conditioning — very little — versus the retrieval layer's curation — most of it.

What I look for across the three: lexical anachronisms (post-event terms appearing in pre-event analysis), consensus-confidence asymmetry (too-strong claims about what the mainstream view was, when the contemporary record was more diffuse), and named-dissenter inflation (Roubini-style warnings reported as if widely heeded, because that is what subsequent writing emphasizes). The gradient A → B → C is what the rest of the note is trying to explain.

What's missing from datasheets

The pretraining literature has gotten meaningfully better at reporting data composition — domain mix, source breakdown, dedup strategy, quality filters. Temporal structure is still mostly absent from these reports. The fields I'd want on every corpus card, in roughly priority order:

  • Per-source distribution of authoring dates (not crawl dates), with uncertainty
  • Per-source revision policy: is the included document the latest version of a mutable text, or a version-at-time?
  • Per-source temporal coverage gaps and known holes
  • Token-level weighting by document age, if any was applied
  • Crawl-to-cutoff lag distribution

I'm not aware of any major open corpus that reports all of this well, and proprietary corpora report it less. The result is that the temporal worldview a model inherits is essentially undocumented at the corpus level, which makes downstream claims about what the model "knows when" hard to interpret.

A practical consequence: when current evaluations probe temporal reasoning or historical knowledge, we usually don't know whether we're measuring the model's reasoning ability or the temporal slice of its corpus that happened to oversample examples like ours. Without temporal datasheets, ablations are hard to design and harder to interpret.

What time-aware pretraining would actually require

Pieces of this exist in the literature on temporal language models, time-aware QA, and continual pretraining for knowledge updating. None of it composes into a standard stack at frontier scale, as far as I can tell from outside. The components I'd want as a coherent layer:

1. Per-document date annotation at construction time. Best-effort timestamping with quality signals. Noisy dates are usable if the noise model is documented. A separate authoring-date and revision-date field, where the latter is available.

2. Time-conditioned training. Prepend a date token or embedding to each document during training. At inference, conditioning on a target date lets the model represent as of date D, this is what the available text says. Existing temporal LM work suggests this is tractable; scaling it and proving it on standard benchmarks is the open part.

3. Temporal holdouts in evaluation. Holdout splits by date, not random. Train on pre-T, evaluate on post-T, and treat that gap as the relevant generalization target. Random splits leak the future into training; this is the eval-side mirror of the corpus-side problem and it inflates every benchmark that doesn't control for it.

4. Revision-aware ingestion for mutable sources. For Wikipedia, ArXiv, GitHub — anywhere revision history is available — train on version-at-time rather than version-at-crawl, at least for a portion of the data. This is expensive: it multiplies a source by the depth of its revision history. It's the only mechanism I can think of that would actually let a model represent contemporaneous views rather than retrospective ones.

5. Stratified sampling by epoch. Counter volumetric recency skew with deliberate upsampling of older material, if the goal is balanced temporal coverage. Often it isn't — for most applications recency is the feature. But if you want a model that handles older periods evenly, you have to fight for it explicitly.

None of these are research-fresh. The reason they aren't standard at frontier scale is, I think, that the cost is real and the benefit on standard benchmarks is small. The distortions described above don't show up on MMLU. They show up on specific applications — historical analysis, forecasting, anachronism-sensitive writing, period reconstruction — that aren't the dominant eval target.

Where this matters in practice

A short list of applications where the hindsight corpus actually bites:

  • Forecasting evaluations. Apparent forecasting skill on resolved questions is partly recall. Strict temporal holdouts are non-negotiable; without them, comparison across models is noise.
  • Historical reasoning. Simulating what a reasonable observer would have thought at time T is brittle, because the model was trained on the post-T consensus about T, not the pre-T discourse.
  • Period-correct writing and analysis. Maintaining voice, vocabulary, and concept set appropriate to an earlier moment is harder than it looks; concept anachronism leaks through.
  • Pre-event state reconstruction. What did the field think on date D about phenomenon X, before event E occurred? This is exactly the question retrospective sources are worst at, and exactly what the model has read most of.
  • Anachronism-sensitive domains. Legal history, history of science, medical history, intellectual history — fields where retrojecting modern frames onto historical actors is a research error and the model does it by default.

For general-purpose use, most of this is invisible. For these applications the hindsight corpus silently corrupts the answer, in a way that doesn't look like an error because the answer is otherwise fluent and internally consistent.

Where I've landed

A pretraining distribution is the model's prior. The community has gotten relatively careful about its domain breakdown, language mix, and quality filtering. It has been much less careful about temporal shape, partly because the shape is hard to measure and partly because the consequences are subtle on aggregate benchmarks.

The reframe that's been useful for me: a frontier LLM doesn't know what was true at time T. It knows what people writing recently said was true at time T, weighted by how much they wrote and how much of it survived to the moment of corpus construction. For most uses, the distance between those two is small enough to ignore. For some uses — and they happen to include several of the applications people are most excited about — it's the whole problem.