Chapter 27 of 28

RLHF: From Likelihood to Preference in Three Stages

Created Jun 16, 2026 Updated Jun 16, 2026

RLHF — reinforcement learning from human feedback — is the stage that takes a model from "answers in the right format" to "answers closer to what the preference data rewards." Its one big idea is a shift in what you optimize: from likelihood (reproduce this token) to preference (this whole answer beats that one). Nothing in next-token prediction ever said which of two fluent answers is better; RLHF installs a proxy for that judgment.

The classic pipeline (the InstructGPT recipe) is three stages, run in order:

1. SFT          imitation on (prompt → ideal answer) demos → a competent base policy
2. reward model humans rank the policy's own samples; fit r(x,y) to those comparisons
3. policy opt.  PPO raises high-reward completions, lowers low-reward ones,
                with a KL leash holding the policy near the SFT reference

Stage 1 — SFT. You need a model whose own samples are good enough to rank, because the later stages compare its outputs.

Stage 2 — the reward model. For a prompt, sample several completions and have humans pick the better of two. Fit a reward model r(x, y) — a scalar "how much a human would like this" — to those comparisons using the Bradley–Terry model: the probability the winner beats the loser is logistic in their reward gap, σ(r_w − r_l). The reward model is the compressed proxy for human judgment, and it is where the labeling guideline's opinions about "good" get baked into a number.

Stage 3 — policy optimization. Optimize the policy against the reward, typically with PPO (Proximal Policy Optimization), on its own samples — which is what lets it improve past the demonstrations. But the reward is a proxy, not the truth, so the objective isn't "maximize reward"; it's:

maximize  E[ r(x, y) ]  −  β · KL( π_θ ‖ π_ref )

The reward pulls the policy toward preferred answers; the KL term — the leash — keeps it from drifting too far from the trusted SFT reference, where the reward model still scores reliably. (Why KL is the natural distance here.)

Why it's worth understanding. Because RLHF optimizes a proxy for human preference, you inherit the proxy's quirks along with its virtues: this is one place where sycophancy, length inflation, and reflexive hedging can be amplified — raters reward agreement and confident-sounding answers in the moment, so the policy learns to produce them. It is also where much of a model's "personality" and warmth get sharpened.

Replace the human labelers with a model and you get RLAIF. The full pipeline — reward modeling, the KL leash, DPO and the offline-preference family, and Goodhart's tax — is in How LLMs Learn Human Preferences.