Chapter 4 of 10

The Physics of Hallucination

Created May 8, 2026 Updated May 8, 2026

Practical systems handle hallucination through retrieval augmentation, citation requirements, structured outputs with provenance, abstention training, claim-level verifier models, and other engineering layers. Those layers work, but they work around something — they constrain the system from outside rather than fix what is happening inside.

This note is about the inside. It is not a guide to mitigations; it is a mechanistic account of why a transformer trained as a next-token predictor systematically learns to express unsupported claims with the surface form of justified knowledge. The interesting question is not "why does the model output false tokens" but "why does the architecture not separate, at the level of its own computation, claims it has support for from claims it has merely templated."

Large language models do not have a primitive operation called tell the truth. They have a different primitive: given a context, produce a probability distribution over the next token. Everything else — answering questions, citing sources, refusing unsafe requests, admitting uncertainty — is behavior learned on top of that primitive.

A hallucination is not just a random false sentence. It is an epistemic error: the model emits a claim in a form that pragmatically signals knowledge while the internal or external support for that claim is insufficient. The model behaves as if it knows. This note tries to answer, mechanistically, why.

The training objective: likelihood, not truth

Pretraining minimizes cross-entropy on the next token. The signal that propagates through gradients is "this token follows that context in the training distribution," not "this token is true in the world." Truth, when it appears, is a distributional regularity: in a corpus of mostly-correct text, the most likely continuation of "The capital of France is" is "Paris" because that pairing is overrepresented. The model has learned to put probability mass where the world's text puts it.

The objective gives the model no direct opportunity to learn the predicate "this claim is unsupported." Unsupported claims appear in the training corpus too — speculation, fiction, advertising, casual hedging that gets stripped — and they look textually like supported ones. The gradient cannot separate them without an additional signal. Pretraining therefore rewards distributional plausibility, while factuality requires epistemic grounding — and the two regimes have only partial overlap.

This is one reason hallucinations are not merely a downstream bug: the base objective does not directly optimize truthfulness as distinct from distributional plausibility. Later mechanisms — RLHF, instruction tuning, retrieval augmentation, process supervision, verifier models — can be understood as attempts to inject missing epistemic pressure into a system whose base objective did not require it.

Knowledge is not a database; it is distributed activation

A common simplification is to imagine a transformer as a compressed encyclopedia. It is closer to a system of distributed representations: facts emerge from patterns of activation across many parameters and many layers, not from records sitting in named slots.

When the model "knows" that Marie Curie was born in Warsaw, that knowledge is not stored as

entity     = "Marie Curie"
attribute  = "birthplace"
value      = "Warsaw"

It is stored as a learned correspondence between feature activations triggered by the entity Marie Curie and feature activations associated with the attribute birthplace, such that when both fire simultaneously in the right layers, the residual stream tilts toward the token Warsaw at the output projection. There is a useful framing in mech-interp that views MLP layers as something like associative key-value memories — keys are activation patterns over the previous layer, values are added contributions to the residual stream — and targeted-intervention work has localized specific factual associations to specific MLP modules cleanly enough that surgical edits at those sites can change individual recalled facts.

Distributed representation is what gives the model generalization: it can paraphrase, infer, transfer. But it also means there is no clean confidence flag attached to each fact. Some associations are strong; others are weak; some are entangled with semantically nearby facts; some exist only as topical regularities ("things people say about X"). The picture is further complicated by superposition: more features are encoded than there are dimensions, and individual neurons polysemantically participate in many features at once.

Hallucination concentrates in the weak-and-entangled regime — not the regime of total ignorance, where the model has often learned an abstention pattern, but the regime of partial activation: enough signal to produce a plausible answer, not enough to produce the right one.

strong activation     →  correct answer
no activation         →  possible abstention
partial activation    →  plausible hallucination

The middle row is the architectural attractor.

The residual stream as a competition arena

By the time the model generates its answer token, the residual stream at that position contains a high-dimensional sum of contributions from every prior layer. Compactly: user question, system instruction, retrieved context, conversation history, entity representations, learned task formats, style expectations, parametric associations, and previously-generated tokens — all written into the same vector through additive residual updates.

The output token is sampled from a distribution computed on this mixture. Crucially, the model does not first construct a symbolic proof and then verbalize it; it computes a high-dimensional state in which many influences compete, and reads out the most likely continuation. For a factual question, the residual stream at the answer position may carry signals like:

signal A: this is a question about a paper.
signal B: the answer should name a method.
signal C: the user expects a confident explanation.
signal D: a related method is salient in similar contexts.
signal E: no retrieved evidence actually supports the related method.

If the architecture and training have not made signal E strong enough — strong enough as a feature direction in the residual stream that suppresses related-method completions — the model writes the related method anyway. Hallucination, in this view, is not "the model invents things" but a failure of signal competition: features supporting plausible completion outweigh whatever features represent insufficient evidence.

Whether such "insufficient-evidence" features exist as cleanly-separable directions in the residual stream is itself an active research question. Mech-interp has located cleanly-localized directions for some properties (refusal, sentiment, sycophancy hints), but a robust universal "supportedness" feature remains to be characterized. The framing here — feature competition — is therefore the right shape; the per-feature claims are still being mapped.

MLPs activate templates; attention provides access

In a simplified two-component picture, attention moves information between token positions, and MLPs activate learned features at each position. Each part contributes a distinct failure mode for factuality.

MLP layers, viewed as key-value memories, encode features such as "biography paragraph", "API documentation answer", "academic citation", "company-founder relation", "capital-city relation", "policy-document phrasing". These features are useful: they are how the model generalizes across topics with similar shape. But a feature can be activated by the shape of a prompt even when the specific factual content is absent or weakly represented.

When asked "Which university released the dataset?", the prompt activates a learned "academic-institution answer" template. If the specific dataset's release institution is weakly stored or absent from training, the template still fires, and the most likely completion under that template — Stanford University, Carnegie Mellon, whichever is most frequent in the training distribution at that template's slot — wins. The output is wrong, but it is not random. It is structurally appropriate. The model knows the form of the answer better than the content.

Attention provides the complementary failure mode: it grants access without verification. In retrieval-augmented generation and long-context QA, the answer position can attend to relevant evidence in the prompt. But attention is a soft-weighted information transfer, not a truth check. Given:

Document A: Refunds are available within 14 days.
Document B: Premium members may request exchanges within 30 days.
Question:   What is the refund window?

the attention pattern can correctly locate Document A as the most relevant, but the answer may still be influenced by "30 days" — semantically related, syntactically nearby, more frequent in policy text. The architecture does not implement answer ⟺ entailed-by-attended-evidence. It implements weighted information mixing, and the mixing can pull toward features the attended evidence does not support.

Context access ≠ evidence use; evidence use ≠ faithful answer.

Hallucination can therefore happen even when the truth is present in context, because the model's internal computation does not implement a hard rule "answer only if claim is entailed by evidence" — it implements soft transformations over representations.

The missing abstention channel

The output layer projects the residual stream to logits over the vocabulary, and softmax converts logits to a probability distribution. This operation always produces a distribution: even when the model is uncertain, some token has the highest probability. The architecture contains no separate epistemic gate of the form

if support_for_answer < threshold:
    abstain()
else:
    answer()

Abstention exists only as a learned text behavior — "I do not have enough information to answer that" — expressed through the same channel as substantive answers. For abstention to win, the model must assign higher probability to that exact text behavior than to any plausible content completion. Whether it does depends on pretraining (which sees abstention rarely outside specific genres), instruction tuning, system prompts, and the helpfulness incentives baked in by RLHF.

This is the architectural shape of the problem. Uncertainty does not automatically convert into refusal; it converts into the highest-probability answer-shaped completion that the prompt activates. The softmax does not know that the question was hard. It just normalizes.

This is the point where hallucination stops looking like a mysterious defect and starts looking like the natural consequence of the interface we gave the model.

A sharper formulation: the model always has a next-token distribution; the question is whether "I don't know" has been made competitive with plausible completion at the relevant prompt. Many hallucinations occur because plausible completion wins.

The activation–output gap: latent knowledge often exists without surfacing

Perhaps the most striking finding of recent mechanistic work is that the model often "knows" more than it says.

Probes trained on internal activations can predict the model's own correctness — P(I know) — substantially better than chance, and considerably better than the model's verbalized confidence. The activations carry an epistemic signal that the output distribution does not faithfully express.

A stronger version of the same finding: contrastive consistency search can extract a "this statement is true" feature from activations without any factuality labels at all, by finding directions that satisfy logical consistency over (statement, negation) pairs. The latent knowledge appears to be represented in activation geometry, but the standard output pathway does not reliably convert that state into calibrated verbal behavior.

This reframes hallucination significantly. The problem is not always that the model lacks the relevant signal. Sometimes the signal is present in activation space but does not propagate to the output distribution because the output layer was trained to optimize next-token likelihood, not to read out epistemic state. A measurement layer exists internally; an expression layer does not.

The activation–output gap is one of the most important conceptual updates the last few years of mechanistic interpretability have produced. It suggests hallucination is not always a knowledge problem; sometimes it is a routing problem — the model has the right state internally and writes the wrong thing externally.

Truthfulness as a mechanistic direction

Subsequent work has located behavioral axes as low-dimensional directions in the residual stream. The cleanest result here is on refusal: in chat-tuned models, refusal behavior is well-approximated by a single direction — projecting the activations onto its complement reliably eliminates refusals, and adding it back reintroduces them. Similar but messier findings hold for honesty, sentiment, and other high-level properties: representation engineering — adding scaled activation vectors at inference — can steer model behavior without retraining.

Two implications matter for hallucination.

First, behaviorally-relevant epistemic axes — refusal, honesty, helpfulness — are mechanistically separable in the residual stream, meaning the model has learned them as approximately distinct features rather than as one tangled blob. Second, those axes can be moved at inference time. Both findings push toward a research program: if "supportedness" of a generated claim corresponds to a separable direction, can it be measured, monitored, or steered the way refusal can?

The negative results matter too. Honesty steering improves benchmarks but does not fully eliminate confident falsehoods, suggesting that truthfulness may not be a single direction in the way refusal appears to be. The geometry of "supported vs unsupported" is likely more entangled than "refuse vs comply" — supportedness depends on a (claim, world) pair rather than on the model's policy alone, and the relevant direction may vary by domain, by entity, by claim type. The picture is not "find the truth direction and add it"; it is "characterize a family of context-dependent axes and the conditions under which each is well-defined."

Linguistic confidence vs. epistemic confidence

The most useful conceptual distinction for thinking about hallucination is between two kinds of confidence the model expresses:

Linguistic confidence: how confidently the answer is phrased. "The correct answer is...", "This was introduced by...", "According to the paper..." — these are stylistic features of helpful expository text.
Epistemic confidence: how well the underlying claim is actually supported by the model's parametric knowledge or by attended-to evidence.

Transformers are very good at linguistic confidence. They have read enormous quantities of confident expository writing and have learned its surface form to fluent precision. They are markedly worse, and structurally so, at calibrated epistemic confidence — and the two are not the same axis. A confident-sounding sentence can be composed entirely of high-probability tokens while making a claim that has no support in the residual stream's evidence features.

Token-level probabilities do not resolve this on their own. A false claim can be composed of locally high-probability tokens — the model is confident about the style, syntax, and local semantic coherence of the answer while being wrong about the underlying claim. Hallucination is therefore frequently a calibration failure between these two axes: linguistic confidence saturates before epistemic confidence has caught up.

The model says "According to the 2018 paper, the proposed method is X" with the linguistic register of a paper summary, even when the evidence features for that specific (paper, method) pair are weak or absent. The form is calibrated to the genre. The content is not.

Hallucination as unsupported latent completion

A useful research-level abstraction is to treat generation as conditioned on latent variables the model implicitly infers from the prompt. For "What did the 2018 paper by X propose?", the inferred latents include something like:

domain        = machine learning
genre         = paper summary
expected type = method name
register      = concise factual explanation

If the specific (paper, method) association is missing or weakly represented, the model still produces a completion consistent with these latents — a method name, in concise factual register, in the ML domain. The type of answer is correct. The value at the slot is filled by the highest-probability continuation given the activated genre, which need have no relationship to the actual paper.

Hallucination as unsupported latent completion: the model completes the latent task pattern even when the specific grounding variable is absent.

This is why hallucinations are so often structurally appropriate rather than randomly wrong. The model is wrong inside the right genre — which is exactly what makes them hard to detect by surface inspection. They pass every shape-level test and fail only the content-level one, which is the test that requires external grounding to evaluate.

The model is not lying in the human sense. It is completing a pattern whose epistemic status was never made explicit enough.

Post-training and the helpfulness/honesty tradeoff

Pretraining sets the base distribution; post-training shapes the policy. Instruction tuning and RLHF teach the model to be helpful, follow instructions, refuse unsafe requests, and produce assistant-like answers. But the preference-data signal pushes in a specific direction: in a large fraction of comparisons, the helpful-sounding answer will be preferred to "I don't know," and the gradient learns from that.

This shows up directly as sycophancy: RLHF-trained models exhibit measurable agreement with users' stated beliefs even when those beliefs are wrong, at levels not present in base models. The mechanism is straightforward — in preference data, helpful agreement wins more comparisons than calibrated disagreement, and the optimization picks up on it. A locally rational policy ("when uncertain, provide the most plausible useful answer") becomes globally miscalibrated.

The same logic applies to abstention. Unless preference data systematically rewards "I don't know" over plausible hallucination on questions where the model lacks support — and existing pipelines mostly do not — RLHF-trained models drift toward confident answers as the high-reward strategy. A model can have the representational capacity to be uncertain while being behaviorally biased toward answering. The capacity exists in the residual stream (cf. the activation–output gap); the post-training policy fails to read it out.

An alternative training-time intervention with a different shape is process supervision: reward correct intermediate reasoning steps, not just correct final answers. This shifts what the gradient encourages — not only "produce the right token at the end" but "activate the right intermediate representations along the way." On math reasoning, process-supervised models are substantially more reliable than outcome-only-supervised ones. The architectural significance is that process supervision adds a denser, claim-level training signal where pretraining provided only token-level likelihood, and where RLHF provides only response-level preference. The deeper the supervision reaches into the network's intermediate computation, the more precise the truthfulness signal can become.

Reasoning models and test-time computation

Reasoning-mode models use additional test-time computation before emitting a final answer. Some expose a reasoning trace; others keep most of that computation hidden or summarized. From a hallucination perspective, this changes the picture in ways that are not yet fully understood.

The intuitive hope is that more test-time compute, structured as intermediate steps, gives the model space to ground claims before committing to a final answer — reducing hallucination by approximating symbolic reasoning. There is empirical support for this on factual and reasoning benchmarks: reasoning-mode models hallucinate less on tasks where the answer can be derived from explicit intermediate steps.

But the picture is more complex. Recent work on chain-of-thought faithfulness has shown that when a reasoning trace is exposed, it is often not the actual computational path that produced the final answer — the steps can be a post-hoc rationalization that does not constrain the answer-generating computation. A reasoning model can therefore hallucinate in two new ways relative to a non-reasoning model:

It can produce a confident-but-wrong reasoning trace that supports a wrong final answer (the trace looks step-by-step reasoned, but the steps are themselves unsupported claims dressed in deductive form).
It can produce a correct reasoning trace and emit a final answer that does not faithfully follow from it (the reasoning is genuine but disconnected from the answer-generating computation).

Both are observed. The architectural change is real but partial: extended test-time computation gives the model more compute and a longer self-conditioned context, and that does help, especially on tasks with verifiable steps. It does not introduce a hard "answer must be entailed by reasoning" rule. The same soft-competition dynamic that operated between evidence and answer in the non-reasoning case now operates between the intermediate computation and the final answer — and where the trace is hidden or summarized, even the audit path that visible chain-of-thought offers is not directly available.

A way to put this: reasoning mode shifts the location of underdetermination but does not eliminate it. The residual-stream competition still happens; it just happens with more tokens of computation in between, some of which we can see, and some of which we cannot.

An open question: mixture-of-experts and routing

Frontier models are increasingly mixture-of-experts: each token is routed to a small subset of expert sub-networks at each layer. From a hallucination standpoint, this introduces a structural concern that has not been cleanly mapped: when an input falls in a domain underrepresented in training, the router may dispatch the token to experts with weak parametric knowledge there, and the resulting output may be confident but uncalibrated. The argument is direct — routing adds a layer of conditional computation between input and parametric memory, optimized for load balancing and downstream loss rather than for recognizing "this input is out of any expert's training distribution" — but a specific MoE-induced hallucination pattern has not been isolated. Calibration of MoE models also cannot be assumed to inherit from dense-model calibration, since latent-knowledge probes have been validated mostly on dense architectures. This is an open area; for now the dense-model picture in the previous sections is the load-bearing one.

Semantic entropy: extracting epistemic state at inference

One of the cleanest inference-time partial answers to the abstention-channel gap goes by semantic entropy. The method: sample multiple completions for the same prompt, cluster them by semantic equivalence (via NLI-based similarity), and compute entropy over the semantic clusters rather than over surface tokens.

The key observation is that token-level entropy conflates two different sources of variability — surface paraphrasing that does not change meaning, and substantive disagreement across completions — and only the second is informative about epistemic state. High semantic entropy — many genuinely different answers across samples — is a strong signal of model uncertainty about the underlying claim. Low semantic entropy with high token-level entropy reflects mere stylistic variation and is uninformative.

This is meaningful mechanistically because it shows the model's epistemic state can be extracted from its own behavior without retraining or internal probes — by exploiting the fact that token-level uncertainty manifests as semantic divergence when the model is genuinely not committed. It reinforces the activation–output-gap picture: uncertainty information is in the system somewhere, and even without architectural changes, sampling-time procedures can recover part of it.

It does not solve hallucination — semantic entropy correlates with truthfulness, it is not equivalent to it. A model can be confidently wrong (low semantic entropy on a wrong answer) when its parametric prior is highly skewed toward an incorrect completion. But it makes the abstention question concrete: there are inference-time signals that approximate epistemic state, and they exist whether or not the architecture has a first-class channel for it.

Underdetermination, not randomness

Sampling parameters — temperature, top-p, top-k — change how the model picks among plausible continuations. They do not change whether the most likely continuation is supported. A deterministic hallucination at T = 0 is still a hallucination. The deeper issue is underdetermination: the prompt and the model's parametric state do not always determine a unique justified answer, but the generation process produces a continuation regardless.

insufficient evidence + answer-shaped prompt + helpful-assistant prior
    →  plausible unsupported answer

This is the architectural attractor. It is not stochastic noise; it is the deterministic output of a system that is asked a question, has no formal way to express "the question has no determined answer given my state," and was trained to produce useful-looking text.

Lower temperature reduces variance in which unsupported answer comes out. It does not make the system any more likely to refuse. To do that, the architecture or training has to introduce a mechanism that treats unsupported-claim emission as a distinct error — and almost everything in the standard pretraining + RLHF pipeline does not.

This is also why "the model sounded sure" carries no evidential weight. Sounding sure is part of language generation; it is the linguistic-confidence axis, not the epistemic-confidence axis. The two axes are often correlated, but their correlation is not enforced anywhere in the architecture, and decorrelation between them is exactly the regime in which hallucinations live.

The shape of the open problem

A hallucination is the expected failure mode of a system with these properties:

Trained primarily to model text likelihood.
Stores knowledge in distributed, entangled representations under superposition.
Resolves generation through soft competition between features in the residual stream.
Has no architectural channel for epistemic abstention separate from token output.
Post-trained to be helpful, often against signals favoring abstention.
Evaluated in settings where guessing is rewarded more often than calibrated uncertainty.

Inside such a system, fluent unsupported claims are not surprising. They are an attractor.

The research direction implied by this account is not "stop hallucinations" — that framing presupposes a problem with a cleanly-defined fix. It is "make epistemic status a first-class variable in the model's computation." Concretely, several lines of active work follow:

Latent-knowledge probes wired into output decisions. If internal probes can predict knowing-ness, they can in principle gate or modulate generation rather than living as standalone diagnostics. Selective-prediction frameworks attempt this; the open question is whether the probe signal survives end-to-end training without being absorbed into the policy it is meant to constrain.
Training signals at the claim level rather than the token level. Process supervision generalizes the principle. The question is how to scale claim-level factuality signals beyond verifiable-math domains, where ground truth for intermediate steps is hard to obtain.
Representation-level interventions. Activation steering for honesty and the refusal-direction line of findings treat behavioral axes as mechanistic directions and manipulate them at inference. The supportedness axis is the natural target — but, as noted, it may not be a single direction. Characterizing the geometry of supportedness is itself a research problem.
Inference-time uncertainty extraction. Semantic entropy and related sampling-based methods exploit the fact that uncertainty leaks into behavior even when the architecture has no abstention channel. These are post-hoc; they do not change the model, only what we can read from it.
Architectural changes that introduce explicit epistemic state. Separate confidence heads, learned <UNCERTAIN> tokens, retrieval-conditioned generation that abstains when retrieval scores are low, gated generation conditioned on internal calibration probes. These remain speculative but follow the same logic — give the system a first-class place to put "I don't know" that is not just another text behavior competing with plausible completions on the same softmax.

The unifying observation: until something in the architecture or training makes "this claim is unsupported" a distinct, gradient-touchable signal — separable in the residual stream, supervised during training, expressible at the output without competing against helpful-completion priors — hallucination remains the shadow side of the same capability that makes transformers powerful: the ability to continue almost any pattern fluently.