Chapter 26 of 28
SFT: Imitation, and the Ceiling It Hits
Created Jun 16, 2026 Updated Jun 16, 2026
SFT — supervised fine-tuning — is usually the first post-training step that turns a pretrained model toward assistant behavior. The recipe is deliberately boring: collect examples of good assistant behavior as (prompt → ideal completion) pairs, and minimize cross-entropy on the target tokens. It is the same objective as pretraining — predict the next token — just run on curated demonstrations instead of the open internet.
What it mostly teaches is the demonstrated mode of behavior — format, instruction-following, tool-call conventions, and the assistant register. Pretraining gives the model the ability to continue text plausibly; SFT teaches it the shape of being helpful — answer the question that was asked, follow the instruction, call the tool, adopt the register of an assistant rather than an autocomplete. Whoever wrote the demonstrations encoded a house style, and the model imitates not just the answers but their manner.
pretraining: predict next token over all of the internet
SFT: predict next token over (prompt → ideal answer) demos
→ minimize −Σ log p_θ(target_token | context)
That objective is plain maximum likelihood — why that loss is exactly cross-entropy is its own one-pager.
The ceiling — and why it matters. SFT is imitation, and imitation has a structural limit:
- It is bounded by the distribution and quality of the demonstrations: the loss rewards matching the written target, not discovering a better answer the demonstrator never thought of.
- Most of what we want is comparative, not absolute. "Helpful", "honest", "appropriately concise" are judgments you make by comparing two outputs, and there is no slot for a comparison in
(prompt → target). SFT can teach a good answer; it cannot teach better-than.
That gap is exactly why the next stage exists. SFT produces a competent base policy — good enough that you can sample from it and have humans (or a model) rank the samples — and it becomes the reference point the rest of post-training is tethered to.
Where it sits in the pipeline. SFT is the launchpad for RLHF: the reward model is often initialized from the SFT/base checkpoint with a scalar reward head, and the KL leash in policy optimization pulls the trained policy back toward this SFT checkpoint. Skip a strong SFT base and the later stages have nothing trustworthy to optimize from.
How preference learning breaks that ceiling is the deep dive.