Chapter 26 of 28

SFT: Imitation, and the Ceiling It Hits

Created Jun 16, 2026 Updated Jun 16, 2026

SFT — supervised fine-tuning — is usually the first post-training step that turns a pretrained model toward assistant behavior. The recipe is deliberately boring: collect examples of good assistant behavior as (prompt → ideal completion) pairs, and minimize cross-entropy on the target tokens. It is the same objective as pretraining — predict the next token — just run on curated demonstrations instead of the open internet.

What it mostly teaches is the demonstrated mode of behavior — format, instruction-following, tool-call conventions, and the assistant register. Pretraining gives the model the ability to continue text plausibly; SFT teaches it the shape of being helpful — answer the question that was asked, follow the instruction, call the tool, adopt the register of an assistant rather than an autocomplete. Whoever wrote the demonstrations encoded a house style, and the model imitates not just the answers but their manner.

pretraining:   predict next token over all of the internet
SFT:           predict next token over (prompt → ideal answer) demos
               → minimize  −Σ log p_θ(target_token | context)

That objective is plain maximum likelihood — why that loss is exactly cross-entropy is its own one-pager.

The ceiling — and why it matters. SFT is imitation, and imitation has a structural limit:

It is bounded by the distribution and quality of the demonstrations: the loss rewards matching the written target, not discovering a better answer the demonstrator never thought of.
Most of what we want is comparative, not absolute. "Helpful", "honest", "appropriately concise" are judgments you make by comparing two outputs, and there is no slot for a comparison in (prompt → target). SFT can teach a good answer; it cannot teach better-than.

That gap is exactly why the next stage exists. SFT produces a competent base policy — good enough that you can sample from it and have humans (or a model) rank the samples — and it becomes the reference point the rest of post-training is tethered to.

Where it sits in the pipeline. SFT is the launchpad for RLHF: the reward model is often initialized from the SFT/base checkpoint with a scalar reward head, and the KL leash in policy optimization pulls the trained policy back toward this SFT checkpoint. Skip a strong SFT base and the later stages have nothing trustworthy to optimize from.

How preference learning breaks that ceiling is the deep dive.