Chapter 8 of 10
Fine-Tuning LLMs: When the Weight Delta Is Worth It
Created May 30, 2026
Fine-tuning is the moment you stop steering a model from the outside and start changing the function itself. You take a checkpoint θ_base, expose it to examples, preferences, or rewards, backpropagate a signal through the transformer, and write a reusable parameter delta Δθ. After that, the same prompt can produce different logits because the computation underneath it is different.
This note opens that box. We will look at how a target token produces a gradient, how many correlated gradients become a default, how LoRA constrains where the delta can live, why fine-tuning can make a format or reasoning mode stable, and why the same mechanism can cause forgetting. The decision question still matters — a weight delta is not free — but first you should see what the delta actually does.
Mechanistically, fine-tuning is not "the model learns an example." One example nudges a probability surface. A dataset creates many nudges. If the nudges point in related directions, the optimizer writes a new slope into weight space: future contexts in the same region now flow toward different logits. That is the thing prompts cannot do persistently and the thing fine-tuning is for.
The pipeline shape — CPT, SFT, preference optimization, RLVR, and distillation — is covered in the companion note on post-training. The implementation knobs — DPO variants, LoRA/QLoRA, memory-efficient full fine-tuning, merging, and tooling — live in the modern fine-tuning deep dive. This note is the upstream decision layer: when is a parameter update justified at all?
What a trained LLM is, mechanistically
Strip an LLM down to the part that matters here, and it is a fixed function from token sequences to next-token distributions.
The residual stream is the central object. It is the per-position vector that flows up through the layers. Every attention head reads from it (forming queries, keys, values), computes its operation, and writes its output back. Every MLP layer reads from it, runs a feed-forward computation, and writes its output back. By the time you reach the final layer, the residual stream at the last position contains the model's "summary" of what should come next; one linear projection and a softmax later, you have the next-token distribution.
Two things live in this picture:
- Weights — the parameters of the attention projections (Q, K, V, output) and of the MLPs. These are fixed once training stops. They define what computation each layer performs on whatever happens to be in the residual stream at that point.
- Activations — the actual vectors flowing through the residual stream on a given forward pass. These depend on the input and are recomputed every time. They are temporary.
This is the crucial split. Anything that depends on the input lives in activations. Anything that is the same across all inputs lives in weights. When you prompt or do RAG, you change the input. That changes activations. It does not change weights. So the kinds of behaviors you can reach by prompting are exactly the kinds that can be expressed as different activation patterns under the same fixed computation. If the durable behavior requires a different computation — different circuits, different feature directions, different output distributions at decode time — the missing ingredient is a parameter update, not a longer prompt.
This is the minimum backdrop. Fine-tuning matters because it changes the fixed computation, not just the vectors flowing through it on one request.
What fine-tuning changes, in one update step
The least mystical way to describe fine-tuning is: start from an existing checkpoint θ_base, compute a loss on examples of the behavior you want, and write a parameter delta Δθ so that some logits become larger and others become smaller in similar future contexts.
On a forward pass, the model maps a context to hidden states, projects the last hidden state to vocabulary logits, and softmaxes those logits into next-token probabilities:
h_t = f_θ(x, y_<t)
z_t = W_U h_t
p_θ(y_t | x, y_<t) = softmax(z_t)
During inference, this is where the story stops: you sample or choose a token from p_θ. During fine-tuning, the output is compared to the fine-tuning signal, a loss is computed, and backpropagation pushes a gradient through the unembedding, MLPs, attention projections, layer norms, and any trainable adapter weights:
L(θ) → ∂L/∂θ → optimizer update → θ_finetuned = θ_base + Δθ
For supervised fine-tuning, the loss is usually cross-entropy on the target completion:
L_SFT(θ) = -Σ_t log p_θ(y_t | x, y_<t)
At the logit level, cross-entropy has a simple shape: for each supervised position, the gradient is p - one_hot(target). If the model assigns too little probability to the target token, the update raises that token's logit in similar contexts and lowers competing logits. Backprop does not create a single "JSON neuron" or "medical fact slot"; it distributes a parameter delta across the matrices that contributed to those logits. But when many fine-tuning examples share the same structure, their gradients point in correlated directions. A fine-tune full of terse answers repeatedly rewards terse continuations. A fine-tune full of tool calls repeatedly rewards the tool-call grammar and the tokens that maintain it. A fine-tune full of refusals repeatedly rewards refusal-shaped continuations. That is how a behavior becomes a default: not because the model memorized one example, but because the accumulated Δθ changed the probability geometry for a whole region of similar contexts.
Other fine-tuning objectives change the same base model through different supervision signals:
- Preference optimization does not provide a target token sequence to imitate. It compares two completions and adjusts sequence-level log-probabilities so the preferred completion gains probability relative to the rejected one, usually against a frozen reference model.
- RL from verifiable rewards samples completions from the current policy, scores whole trajectories with a verifier, and increases the probability of high-advantage trajectories. Unlike SFT, it can reinforce strategies that were not present as fixed target completions, as long as the model samples them sometimes.
- LoRA restricts where the parameter delta can live. Instead of updating a full weight matrix
W, it trains a low-rank updateΔW = BA, so inference usesW_eff = W + BA. The base computation is still changed, but only through the low-rank subspace and only in the modules where adapters were attached.
This is the deep-learning version of "changing the weights": fine-tuning changes the map from contexts to logits by applying an objective-driven delta to an existing model. The next sections slow down what that delta means internally.
There is one more architectural detail hidden inside the phrase "update the weights": which weights are allowed to move? This is not an implementation footnote. It changes what the fine-tune can express, how much memory training needs, how reversible the update is, and how much damage it can do to the base model.
In a full fine-tune, gradients flow into the original checkpoint: embeddings, attention projections, MLP projections, layer norms, and the output head. The model has maximum freedom to change, which is why full FT can handle broad behavioral shifts — and why it can forget more aggressively.
In LoRA / QLoRA, the original base weights are frozen. The model still computes with them, but the trainable delta lives in small low-rank adapter matrices attached to selected projections. If you attach LoRA only to q_proj and v_proj, the trainable update can change how attention queries and values are formed. If you attach it to all attention projections, it can also change keys and output mixing. If you attach it to MLP projections too, the adapter can influence more of the feature transformation inside each block. QLoRA has the same trainable map as LoRA; it mainly changes storage of the frozen base, usually by quantizing it.
In head-only tuning, the transformer body is frozen and only the final head moves. That can be enough for classification or a narrow output remapping, but it cannot deeply change the model's internal computation.
Read the map as the actual gradient surface. Gray hatched modules still participate in the forward pass, but receive no optimizer updates. Green modules or adapter stripes are the only places where ∇θ is written. Moving the LoRA rank slider changes the size of the adapter matrices; changing d_model, vocabulary, or layer count changes the formula-based parameter counts directly on the layer cards.
The delta is not a patch; it is a distributed slope
It is tempting to imagine fine-tuning as storing little rules: "answer in JSON", "use the house style", "do not over-refuse", "call the tool like this." That is not what the optimizer writes.
For a supervised token, the immediate logit gradient is simple:
∂L / ∂z = p - one_hot(target)
If the target token is } and the model put too much probability on prose, the local pressure is: raise the } logit, lower the competitors. But the } logit came from a hidden state, and that hidden state came from attention heads, MLPs, layer norms, embeddings, and all the residual-stream writes below it. Backprop pushes the error signal through all of those contributors. The update is distributed because the computation was distributed.
That is why fine-tuning does not create a "JSON neuron" in the cartoon sense. It changes many matrices a little: some attention projections learn to attend to structural cues, some MLP directions become more format-supporting, the final hidden states that used to lean toward prose now lean toward syntax-preserving tokens. The behavior is local in data space but distributed in parameter space.
The important thing is correlation. One JSON example barely matters. Ten thousand examples that all reward schema-valid continuations create gradients with shared structure. Those gradients add. After training, the model does not need to "remember" a specific example; the whole region of similar contexts has a different slope toward the target logits.
How a behavior becomes a default
A default is just a high-probability trajectory. At every generation step, the model is choosing among continuations: stay in JSON or drift into prose, keep reasoning or answer now, call the tool or explain why it might call the tool, answer directly or hedge.
Fine-tuning changes those per-token competitions. If SFT repeatedly shows terse answers, the model learns that after this class of instruction, terse continuations are the high-probability path. If preference optimization repeatedly prefers concise correct answers over verbose vague ones, the sequence-level probability of the concise answer rises relative to the verbose one. If RLVR repeatedly rewards solutions that keep working through a math problem until the verifier passes, the "continue reasoning" trajectory gets reinforced.
This is the inside view of "style", "format", "policy", and "reasoning mode." They are not separate modules. They are probability ridges through token space. Fine-tuning deepens some ridges and shallows others.
This is also why bad data is so dangerous. If every correct answer in your dataset is long, length becomes part of the ridge. If every refusal is legalistic, legalism becomes part of the refusal policy. If every tool call appears after a verbose preamble, the preamble becomes part of tool use. The optimizer cannot tell "real behavior" from "dataset accident" unless the data or evals tell it.
Why fine-tuning forgets
Forgetting is not a separate monster from learning. It is the same mechanism viewed from another task.
Every update improves loss on the distribution in the batch. If the batch contains only legal QA, the optimizer sees gradients that improve legal-QA continuations. It does not see gradients preserving coding ability, casual conversation, math, or multilingual behavior unless those are in the batch, protected by a reference/KL term, or preserved by the limited capacity of an adapter.
Transformer features are reused. The same MLP direction or attention head can participate in many behaviors. Move it to improve one narrow distribution, and you may move it away from a configuration that supported another. The model did not "erase" coding in a literal database sense. It changed logits: coding continuations became less competitive in contexts where they used to win.
That is why replay data works. It puts old behaviors back into the objective. That is why LoRA often forgets less. It restricts the update, leaving most of the base weights untouched. That is why full fine-tuning can produce more capability transfer and more damage: it gives the optimizer access to more degrees of freedom.
So a fine-tune has two products: the behavior you wanted, and the interference pattern created by how you got it.
Superposition and the feature budget
There is a deeper constraint underneath everything above, best seen through one of the most useful lenses in LLM interpretability: superposition. Treat what follows as a model that fits the evidence well and predicts the right things here, not as settled ground truth — the mechanics are still active research.
A transformer layer has a finite hidden dimension — a few thousand in a mid-size model, at most low tens of thousands even in the largest. But empirically, the number of distinct features the model represents at that layer is far larger than the dimension. The way it pulls this off is by packing features into the activation space in overlapping, non-orthogonal directions: each feature corresponds to a direction in activation space, and many features share dimensions because they tend not to co-occur in real text. When a feature fires, it activates its direction; the model can usually recover what is on by looking at which directions have non-negligible projections.
This is superposition, and it has a hard implication: there is a feature budget. The model can represent more features than dimensions, but not unboundedly more. The features that get represented are the ones that pretraining made it useful to represent. Features that were rare or absent in the training distribution may simply not have directions in the model's activation space at all.
The consequence for our distinction, on this view: a feature the model never allocated is not something a prompt can summon on demand. RAG can supply tokens that would activate the feature in a model that had it, but in a model that does not, those tokens get processed through whatever overlapping nearby features the model does have. The result is generic, fuzzy, or wrong.
This is the mechanistic version of "the model doesn't really know this domain." It does not just mean the model has not memorized the facts. It means the model has not allocated strong, clean activation directions for the relevant concepts. The MLPs are not committing capacity to processing them. The attention heads are not specialized to bind them. Pretraining decided what features to allocate, and that decision is encoded in the weights.
The way to make new or underdeveloped features stable, reusable parts of the model's computation is to update the checkpoint. Continued pretraining does this for domain substrate. SFT does it more narrowly for behavioral concepts. RL does it for capabilities that need to be reinforced as reachable trajectories. Prompting can activate, combine, and temporarily bind existing representations; it does not run the fine-tuning updates that would make those representations easier to reach next time.
There are two practical signatures of "this is a feature-budget problem":
- The model can talk about the domain at a surface level but cannot reason precisely or produce specialist-quality output, however much you put in the prompt or retrieve.
- The model conflates concepts that an expert would distinguish — using terms approximately interchangeably even when handed precise definitions. This is the symptom of features being collapsed into one shared direction in the model's representation.
When you see these signatures, you are most likely looking at a weight problem, and no prompting recipe reliably fixes it.
Format collapse: a probability-shape problem
Here is a concrete, common failure mode where the difference between inference-time steering and training-time updates matters. You ask the model to output JSON. It outputs JSON for the first 100 tokens. By token 500 it has added a trailing comment, mismatched a quote, or drifted into prose. Why?
At every token, the model produces a distribution over the vocabulary. Even in "JSON mode" — features activated by the prompt — the probability of breaking format at any given token is non-zero. Call it ε. Over n tokens, the probability of staying in format is roughly (1 − ε)^n. With ε = 0.001, a 500-token output stays clean with probability around 60%. A 2000-token output stays clean with probability under 15%.
This is the kind of thing that looks like a "small prompt bug" and is actually an exponential process. A 0.1% per-token chance of drifting is invisible in a 20-token demo and fatal in a 2000-token extraction job. Production failures often live in that gap between the demo length and the real length.
The prompt's job here is to make ε small. It does this by activating format-following features. But the magnitude of ε is determined by the weights: it depends on how strongly the model's output distribution at each position concentrates on format-consistent tokens, and that concentration is a property of the final linear projection plus everything in the residual stream contributing to it. The prompt nudges the residual stream; it does not change the projection.
SFT on JSON outputs, by contrast, directly modifies the weights so that format-consistent tokens are higher probability at every position. The optimization target is exactly to push ε down across the whole sequence distribution, not just at the start. After SFT, "stay in JSON" stops being an instruction to follow and becomes the model's default trajectory.
This is what we mean by a probability-shape problem. The desired behavior is not a single output token but a property of the entire output distribution over many tokens. Prompting can shift the distribution for the current context; constrained decoding can mask invalid continuations; fine-tuning changes the distribution the model naturally assigns before any mask is applied.
One caveat, because it is what production systems actually do: constrained decoding. If you only need the output to parse, you do not have to touch the weights. Grammar- or schema-constrained decoding — JSON-schema mode, GBNF grammars, structured-output APIs — masks the sampler at each step so that only tokens allowed by the grammar can be emitted: valid JSON, guaranteed, with no training. But this clips the distribution rather than reshaping it. The model's pull toward drifting out of format is still there; the mask only forbids acting on it. That has limits. It works only where the target is a formal grammar — it can force valid JSON, but not a style, a reasoning policy, or any soft convention a grammar cannot express. And when the model's distribution and the grammar disagree, syntax wins but content can suffer: forced into tokens it scored as low-probability, the model may emit empty fields, truncate early, or repeat. Fine-tuning is the complementary move — instead of masking a distribution that wants to drift, it changes the distribution so the format is the model's default. In practice the two compose: constrained decoding to guarantee the brackets parse, fine-tuning so the model is not fighting the mask the whole way down.
Format collapse generalizes to a family of related problems: persona drift (the model's voice changes over a long generation because per-token deviation accumulates), style decay (the prompted style fades as more model-generated tokens enter the context), mode collapse in reasoning (the model stops thinking step-by-step and jumps to answers as the prompt's "think carefully" signal weakens with distance). All of these are probability-shape problems. All of them are weight-gap problems by construction.
Reasoning as a trained policy
The single most striking demonstration that some behaviors live in the weights and nowhere else is the reasoning-model wave that started with o1 and DeepSeek-R1.
A base model — even a very strong one — can produce chain-of-thought reasoning if prompted to. But it does so unreliably. Sometimes it thinks for ten steps; sometimes it jumps to an answer; sometimes it starts reasoning and then abandons the reasoning halfway. The prompt "think step by step" works as a nudge but not as a guarantee.
This is because deciding whether to keep reasoning or to answer now is a per-token decision. It is a property of the model's output distribution at every step: at each token, the model is weighing the probability of continuing the reasoning trace vs the probability of emitting an answer. In a base model, that weighing is governed by pretraining priors — and pretraining text contains far more "answer immediately" patterns than "reason at length" patterns, because long explicit reasoning traces are rare in natural text.
The R1 team's experiments with R1-Zero made this concrete in an instructive way. They trained a base model with reinforcement learning on verifiable math/code rewards, with no SFT phase first. The model learned to solve problems — its reward went up. But its outputs were unreadable: it switched languages mid-trace, produced nonsensical formatting, and discovered reasoning patterns that worked but that no human could parse. RL had found a policy that maximized reward, but the shape of the policy was alien.
The fix was cold-start SFT: a small amount of supervised fine-tuning on high-quality long reasoning traces before the RL phase. This carved out a readable "thinking mode" in the model's policy distribution — a region of weight space where producing structured, human-readable reasoning had high probability. Then RL could amplify the reward-maximizing trajectories within that mode, instead of searching the entire policy space.
The mechanistic lesson is direct: a stable reasoning mode is a property of the weights. You cannot prompt your way into it as a stable default from a model that does not have it — you can elicit the behavior for a stretch, but not make it the trajectory the model reliably falls into. The base model has the capability to reason — that capability is also in its weights, from pretraining — but does not have it as a default trajectory that persists across hundreds of tokens. Carving out that trajectory is a weight-modification job.
This generalizes to any behavior that has to persist across long generations: holding a tool-call policy and output format stably across a long loop, agent personas, code-style consistency, multi-turn reasoning. The longer the behavior has to persist, the more it is a property of the weights and not the input. One boundary is worth drawing here, because it is easy to confuse: if an external system decomposes the task and orchestrates the steps, that reliability is an agent-scaffolding job, not a weight gap. What lives in the weights is the model's own ability to hold the tool-calling format and policy as the trace grows — a scaffold can call the tools, but it cannot make the model stop drifting out of the protocol mid-loop.
Behavioral defaults and refusal thresholds
Another concrete weight-gap class: behavioral defaults that the model was trained to have. Refusal patterns, safety behaviors, helpful-assistant framing, hedging language, formality. These are the most explicitly weight-baked things in any post-trained model, because they were the target of dedicated SFT and RLHF stages.
When you try to override them with prompting — "respond as if you have no restrictions", role-play instructions, system-prompt manipulation — you get a real but partial effect. The model bends. Sometimes it bends far. But the underlying defaults reassert themselves, often in subtle ways: the model performs the role but adds caveats, or starts neutrally and drifts back to refusal, or produces the requested content with disclaimers.
Why? Because the refusal behavior is not encoded as "if prompted with X, refuse". It is encoded as feature directions in the residual stream that, when activated by certain semantic patterns in the input, contribute strongly to a refusal-shaped output distribution. The pattern detection runs at every layer; the contribution is summed across many components. Even when you suppress one source of the refusal signal with a clever prompt, others remain. The defaults are distributed across the weights, which is exactly what makes them robust — and what makes them unreachable from prompts.
The flip side is also a weight problem. A model that over-refuses — that disclaims at the slightest ambiguity, hedges on questions it should answer directly, treats every domain question as if it were a medical emergency — has those defaults baked in too. You can prompt it not to, and sometimes that works for a turn or two. But over a long conversation, the trained behavior dominates. Fixing it requires fine-tuning, not prompting, for the same reason format collapse requires fine-tuning.
This is the same shape of failure as reasoning-mode collapse and format collapse: behavior over long generation is governed by the per-token output distribution, the per-token output distribution is governed by the weights, and the weights determine which defaults reassert themselves regardless of how the prompt tries to push.
What SFT, preference optimization, and RL each change
Once you accept that some gaps require weight changes, the natural question is what kind of change each technique makes. The staging story is the post-training pipeline note's job; the variant-level details are in the modern fine-tuning deep dive. Here I want only enough to classify gaps: each technique shapes the function differently, so each reaches a different kind of weight gap.
- SFT (supervised fine-tuning). Cross-entropy on target tokens. It sharpens the per-position output distribution toward the targets and can teach new defaults — and, for the same reason, it forgets: the MLPs and attention heads that did other jobs get partially overwritten, so old knowledge is suppressed rather than erased.
- Preference optimization (DPO and family, or PPO on a reward model). Learns from comparisons — completion A preferred over B. It re-weights tendencies the model can already produce rather than adding new ones, which is why it needs a competent SFT base and why KL regularization against the reference keeps it from drifting far.
- RL from verifiable rewards (RLVR). Amplifies high-return trajectories sampled from the current policy. It can only reinforce what the policy already produces sometimes — hence cold-start SFT — but it can also discover and weight up trajectories no supervised data contained, which is what produced R1's long internal reasoning.
- LoRA. Adds a low-rank update
W → W + BAinstead of moving all ofW. Rank 8 is often enough because the task vector — the fine-tune's shift away from base — tends to be approximately low-rank for behavioral tasks. - Model merging. Arithmetic on the weight vectors of fine-tunes from a shared base. It treats "task knowledge" as a direction in weight space that can be added or interpolated, as long as the fine-tunes stay in a shared loss basin.
The throughline for this note is just the classification: each technique is a different objective for reshaping the function the model computes. Which one you actually reach for, and how to stage them, is the companion note's subject.
When the gap is a weight gap: a typology
Pulling the mechanistic threads together, here is a list of problem classes where the gap is usually in the weights rather than in missing context:
The mistake most projects make is misreading the second column as the first. They have a fresh-facts problem and reach for fine-tuning, where RAG would have worked. Or they have a multi-step task that decomposes cleanly and reach for fine-tuning, where an agent scaffold would have worked.
The opposite mistake exists too. Teams hammer prompts and retrieval pipelines for months, trying to close a format-collapse or persona-drift problem whose failure pattern is really a probability-distribution problem. Eventually they fine-tune, and the problem disappears because the fine-tuning data repeatedly rewards the trajectory they wanted the model to take all along.
The mechanistic perspective is supposed to make this decision easier. Ask, concretely: if this behavior existed, where in the model would it live? If the answer is "in the residual stream at runtime, supplied by the current input", input intervention can produce it. If the answer is "in the per-token output distribution across the whole generation", or "in feature directions the model doesn't have", or "as a default that overrides input", the durable intervention is a fine-tuning signal that changes those probabilities across similar contexts.
The fine-tuning preflight
The hardcore version of "should we fine-tune?" is an experiment plan, not an opinion. Before a serious run, the experiment plan needs five artifacts:
- A target eval. The metric the fine-tune is supposed to improve: schema-validity plus semantic correctness, refusal threshold, expert rubric score, unit-test pass rate, pairwise preference win rate.
- A regression eval. A fixed set of capabilities the base model must not lose: general chat, coding, reasoning, safety, latency, output length. Training loss does not see what you forgot to measure.
- A cheap baseline. Prompt/schema/RAG/scaffold/larger-teacher performance on the same eval. This is the control group for the training run, not the main story.
- A pilot run. Usually LoRA SFT first, on a small but clean dataset. You are testing whether the gradient points in the right direction before scaling data or complexity.
- A data ablation. Remove templates, mix in general replay, vary length/style, and check whether the model learned the behavior or just copied dataset artifacts.
The scaffold baseline catches one important false positive. A model that fails to review a pull request end-to-end may not need a new policy in its weights; it may need a loop that lists changed files, reads diffs, runs tests, preserves state, and asks the model to comment only on observed evidence. Fine-tuning can improve the review style or the per-step rubric, but it will not give one forward pass a filesystem, test runner, or persistent memory. Scaffold handles the loop; fine-tuning changes the policy inside each step.
The decision rule is brutal: if the pilot improves the target eval but damages the regression eval, you do not have a successful fine-tune — you have a trade-off. It may still be worth shipping, but only if the product values the target gain more than the lost behavior. If the pilot does not beat the best baseline, the answer is even simpler: do not escalate to a bigger training recipe yet.
This is where fine-tuning becomes engineering rather than folklore. The weight delta is real. The gain has to be real too.
A summary, mechanistically
Conclusion
Fine-tuning modifies the function the model computes. That is why it is powerful, and why it is dangerous. A good fine-tune does not merely make today's prompt work better; it changes the model's future probability distribution across many similar contexts.
The right question is therefore not "can we train it?" but "what do we expect the delta to buy?" If the answer is a reusable default, a lower per-token failure rate, a compressed instruction/rubric, a domain representation, a preference policy, a verifier-amplified capability, or a cheaper distilled model, fine-tuning is on the table. If the answer is just "the model got this one case wrong", it is not.
The mechanistic perspective makes the decision operational. If the target behavior is a durable property of the per-token output distribution, a domain representation the model lacks, or a policy trajectory the model should enter by default, the only way to make it durable is to train. But the engineering discipline is just as important: freeze evals, compare against cheap baselines, run a pilot, measure regressions, and treat every gain as a trade-off until proven otherwise.
Reduced to a routing rule:
The actual pipeline of weight-modification techniques is covered in the companion note on post-training, and the catalog of modern variants lives in the fine-tuning deep dive. This note is the decision layer before that pipeline starts: when a weight delta is worth writing at all.