Chapter 9 of 10
Fine-Tuning LLMs: Post-Training Is a Pipeline, Not a Step
Created May 30, 2026
Fine-tuning is not one move. It is a sequence of different objective signals applied to the same base object: a pretrained model θ_base whose logits you want to change. Continued pretraining, SFT, preference optimization, RL from verifiable rewards, and distillation all write deltas, but they do not teach the same thing.
The clean way to hold the pipeline in your head is:
Post-training is programming the probability distribution with data. SFT programs by examples. Preference optimization programs by comparisons. RLVR programs by tests. Distillation transfers trajectories. Evals tell you whether the program did the thing you meant or found a shortcut.
That is the central idea of this note. Not every project needs every stage. A customer-support JSON extractor might need domain substrate, SFT examples, constrained decoding, and evals. A reasoning model needs readable traces, on-policy exploration, verifier feedback, and usually distillation. Same umbrella word — "fine-tuning" — completely different pipeline.
The decision of whether to train at all is covered in the companion note on when the weight delta is worth it. This note starts after that decision: once you know a weight update is justified, which objective signal should create it?
The pipeline as objective signals
A full post-training pipeline tends to look like this:
base model
↓ continued pretraining raw domain substrate
domain-adapted base
↓ supervised fine-tuning examples, format, defaults
SFT model
↓ preference optimization preferences, ranking, policy shaping
aligned model
↓ RLVR verifiable trajectories
reasoning model
↓ distillation / release mix transfer and product shape
final release
The arrows matter more than the acronyms. Each stage gives the optimizer a different kind of evidence:
- CPT says: "predict this domain text better."
- SFT says: "in this context, imitate these target tokens."
- Preference optimization says: "make this completion score higher than that one."
- RLVR says: "sample attempts, run a verifier, reinforce the attempts that pass."
- Distillation says: "copy the useful trajectories of a stronger or later-stage teacher."
Those signals produce different models because they point gradients in different directions. Treating them as one generic "fine-tuning" knob is how teams end up using DPO when they needed better SFT data, or SFT when they needed verifier-driven exploration.
The shared inner loop
Every stage still touches the model through the same basic shape. For a context, the model produces logits:
h_t = f_θ(x, y_<t)
z_t = W_U h_t
p_θ(y_t | x, y_<t) = softmax(z_t)
Then the post-training objective turns behavior into a scalar loss or reward, and backprop writes a parameter delta:
objective signal → L(θ) or A(y) → ∂L/∂θ → θ_post = θ_base + Δθ
The hard part is not the existence of gradients. The hard part is choosing what signal should point them.
Continued pretraining: substrate before behavior
Continued pretraining is the quiet stage most vertical projects skip until they feel the pain. The loss is ordinary next-token prediction, but the corpus is domain text: medical notes, legal filings, internal code, a low-resource language, product documentation.
CPT is for substrate, not instruction behavior. If the base model has never seen ICD-10 codes, your internal DSL, or Kazakh at meaningful scale, a few thousand SFT examples will not magically allocate clean representations. CPT gives the model more raw material to model; SFT later teaches it how to behave with that material.
The usual recipe is a domain corpus mixed with some broad original-style data to reduce forgetting. That mix is not optional housekeeping. CPT is still gradient descent on a narrower distribution: if the model only sees legal filings, medical notes, or internal code for long enough, it gets better at continuing that text and worse at parts of the original distribution you stopped showing it. General replay keeps broad language, instruction-following substrate, and adjacent skills from silently degrading while the domain gets sharper.
The output is not a chat model. It is a domain-adapted base that still needs downstream SFT.
SFT: examples become defaults
Supervised fine-tuning is the workhorse because it is the first stage that turns a base model into something product-shaped. The training objective is simple: cross-entropy on target response tokens, usually with the prompt masked out.
L_SFT(θ) = -Σ_t log p_θ(y_t | x, y_<t)
At a supervised token position, the logit-gradient has the familiar p - one_hot(target) shape. If the target token should have been } but the model put too much probability on prose, the update raises the } logit and lowers competing continuations for similar contexts. One example barely moves the model. Thousands of examples with the same structure produce correlated gradients, and correlated gradients become behavior.
The useful intuition is that every completion is a tiny program for the model's future defaults. If all your examples answer in six bullet points, the model learns "six bullet points" as much as it learns the task. If all your refusals are verbose and legalistic, verbosity and legalism become part of the refusal behavior. The labels are not just labels; they are the behavior.
That is why the largest SFT failure is usually data, not optimizer math. Quality beats quantity, synthetic data is normal, and rejection sampling is often the cheapest way to improve a dataset when a verifier exists. But narrow data creates narrow models. Format bleeds into behavior. General replay, template diversity, edge cases, and anti-examples are not polish; they are how you tell the optimizer what you actually mean.
Preference optimization: comparisons, not demonstrations
After SFT, the model can follow instructions, but it may still choose mediocre answers among many plausible ones. Preference optimization teaches a different signal: not "copy this response", but "under this prompt, prefer this response to that one."
Classical RLHF trains a reward model on preference pairs and then optimizes the policy with PPO plus a KL penalty to stay near the SFT model. It works, but it is operationally heavy: a separate reward model, a value head, rollout instability, and many interacting hyperparameters.
DPO keeps the preference idea but removes the explicit RL loop. It compares the trained policy's log-probability ratio against a frozen reference policy:
DPO loss = -log σ( β · [ (log π_θ(y_w|x) - log π_ref(y_w|x)) - (log π_θ(y_l|x) - log π_ref(y_l|x)) ] )
The key difference from SFT is sequence-level comparison. DPO is not told the chosen answer is perfect token by token. It is told the chosen answer should score better than the rejected one, without drifting too far from the reference. That makes it powerful for preferences, ranking, policy shaping, helpfulness, directness, and trade-offs that are hard to encode as a single target completion.
Two variants are worth keeping in the main mental model. SimPO normalizes by average per-token log-probability and reduces DPO's length dependence. KTO handles pointwise "good/bad" labels when you do not have paired preferences. The rest of the variant zoo is useful, but it belongs in a reference note, not in the core pipeline story.
The trap is to treat preference labels as truth. They are a measurement interface. If annotators prefer long answers because long answers look thoughtful, the model learns length. If an LLM judge rewards confident tone, the model learns confidence. Preference optimization is only as good as the difference the labels actually encode.
RLVR: tests become the teacher
For math, code, structured output, and formal reasoning, you often do not need a learned reward model. You have a verifier: unit tests, exact answer checks, schema validation, proof checking. This is RL from Verifiable Rewards.
The objective is no longer "match these target tokens." The model samples completions from its current policy, the verifier scores them, and the update increases the probability of trajectories with positive advantage:
sample y ~ π_θ(.|x)
reward r = verifier(x, y)
update roughly ∝ A(x, y) · Σ_t ∇_θ log π_θ(y_t | x, y_<t)
That difference is why RLVR can improve reasoning where SFT plateaus. SFT imitates traces in the dataset. RLVR explores many traces and reinforces the ones that actually pass. The cost is that the verifier becomes part of the objective. If it is incomplete, hackable, or miscalibrated, the model learns the verifier's quirks just as eagerly as it learns the task.
GRPO made this style of training practical outside frontier labs by removing the critic. For each prompt, sample a group of completions, score each one, and normalize each reward against the group's own mean and standard deviation. The group becomes its own baseline.
for each prompt x:
sample K completions y_1..y_K from current policy
compute rewards r_1..r_K
advantage_i = (r_i - mean(r)) / std(r)
update policy with PPO-style clipped objective using advantage_i
The failure mode is visible in the formula: if every completion passes or every completion fails, the group has no useful advantage spread. Difficulty calibration is the whole game. The prompts that teach are the ones where some rollouts pass and some fail.
The R1 recipe as a story
DeepSeek-R1 is useful because it published the shape of a modern reasoning pipeline. It is a public reference recipe, not a universal recipe and not a claim that every frontier lab trains reasoning models in exactly this order:
1. Cold-start SFT
Small set of high-quality long reasoning traces
Teaches readable reasoning format
2. Reasoning-oriented RLVR
Rule-based rewards on math, code, and science
The model explores trajectories and reinforces passes
3. Rejection sampling + SFT
Sample many completions from the RL checkpoint
Keep high-reward traces
Mix with general SFT data
4. Broader alignment pass
Helpfulness, harmlessness, general assistant behavior
5. Distillation
Use the final model as teacher
SFT smaller models on its long reasoning traces
The order matters. R1-Zero showed that RL alone can discover reward-seeking reasoning behavior, but the traces were hard for humans to read: mixed languages, unstable formatting, strange reasoning style. Cold-start SFT gives RL a readable region of policy space to amplify. The later rejection-sampling SFT step pulls the model back toward general assistant behavior. Distillation makes the recipe available to smaller models without forcing every team to run frontier-scale RL.
The lesson is not "copy R1 exactly." The lesson is that stages compose because they teach different things. SFT creates a readable mode. RLVR amplifies verified trajectories inside that mode. Rejection sampling turns discovered successes back into supervised data. Distillation transfers the behavior to cheaper models.
Evals are the pipeline's guardrails
Training loss only says the model fit your training signal. It does not say the product got better. A serious post-training run needs at least:
- Target evals for the behavior being improved: verifier pass rate, schema validity, rubric score, pairwise preference win rate.
- Regression evals for what the base model already did well: chat, code, reasoning, safety, latency, output length.
- Contamination checks so the held-out data stays held out.
- Reward-hacking probes for RL or judge-based pipelines.
- Human spot checks for tone, subtle incorrectness, and failure modes the metrics missed.
The most important habit is to define the evals before training. The moment you tune hyperparameters against a held-out set, that set starts becoming training data. Keep one rarely touched final set for the go/no-go.
Every stage also needs a kill condition. Before you launch it, know what number would make you stop: target quality too low, regression loss too high, refusal rate changed too much, output length too expensive, verifier probes showing reward hacking. Without that, the pipeline becomes a story you tell yourself while the model drifts.
What this note leaves out
This article is deliberately not a catalog. Once you know which stage you need, there are many implementation choices: LoRA vs QLoRA vs DoRA, DPO variants, memory-efficient full fine-tuning, model merging, trainer stacks, rollout engines, and multi-LoRA serving. Those knobs matter, but they are second-order after the objective choice.
The reference-style deep dive is here: Modern Fine-Tuning Deep Dive.
Conclusion
Post-training is a pipeline because the model needs to learn different things from different evidence. Domain text teaches substrate. SFT teaches examples and defaults. Preference optimization teaches comparisons. RLVR teaches verified trajectories. Distillation transfers the resulting behavior.
The central design decision is not "which acronym is best?" It is: what signal does this model need next? If you answer that correctly, the pipeline becomes legible. If you answer it wrong, more stages just make the wrong objective more expensive.
Fine-tuning is not a step. It is a sequence of objective choices.