Chapter 11 of 13

How LLMs Learn Human Preferences: RLHF, RLAIF and Beyond

Created Jun 15, 2026

A pretrained language model is an extraordinary mimic and a terrible assistant. It has read enough text to continue almost any prompt plausibly, but "plausible continuation of the internet" is not the same thing as "the answer this person wanted, phrased the way they wanted it, without the parts that would harm them." Nothing in the pretraining objective ever told it which of two equally fluent answers is better. That judgment — better-than, for a human, in context — is what preference learning installs.

Why this matters beyond making a model polite: this is the stage where a model acquires dispositions — the consistent tilts we read as personality, warmth, emotional tone, or sycophancy. Pretraining and SFT decide what the model can say; preference learning decides what it prefers to say. The character we read off a model's replies is, in large part, written here.

The spine of the whole explanation is a single shift: from optimizing likelihood to optimizing preference. Hold onto that contrast; everything else is implementation.

This note builds on the mechanics of a weight delta — what it means, concretely, to change the function a model computes. Here we ask a narrower question: how do you change that function using nothing but judgments about which output is better?

The gap SFT can't close

Supervised fine-tuning (SFT) is imitation. You collect examples of good assistant behavior — (prompt → ideal completion) — and minimize cross-entropy on the target tokens, exactly as in pretraining but on curated demonstrations. (Why cross-entropy, and why it is just maximum likelihood wearing a different hat, is its own short note.) SFT is indispensable: it teaches the model the format of being an assistant — answer the question, follow the instruction, use the tool, adopt the register.

But imitation has a ceiling, and the ceiling is structural:

It can only be as good as the demonstrations. The target is a specific token sequence a human wrote. The model is rewarded for reproducing it and penalized for everything else — including better answers the demonstrator never thought of. You cannot reliably optimize beyond what the demonstrations encode.
Most of what we want is comparative, not absolute. "Helpful", "honest", "harmless", "well-calibrated", "appropriately concise" are not properties of a single output you can write down as a target. They are judgments you make by comparing outputs. There is rarely one right way to explain recursion or console a frustrated user — there are better and worse ones.
The signal you actually care about isn't a token. What you want to optimize is "a human prefers this response." That is not a target completion; it is a verdict over completions. SFT has no slot for a verdict.

So the data shape has to change. SFT data is (prompt → target). Preference data is (prompt, completion A, completion B → which is better). The first teaches a good answer. The second teaches better-than — and "better-than", applied over and over, is how you push a model past the demonstrations toward what people actually want.

SFT teaches the model to answer. Preference learning teaches it which answer to prefer.

Mechanistically the contrast is just as clean. SFT moves probability mass toward demonstrated tokens. Preference learning moves probability mass according to a comparison: make the preferred completion more likely than the rejected one, relative to where the model started. The first is a target; the second is a gradient direction defined by a judgment.

The classic RLHF pipeline

RLHF (reinforcement learning from human feedback) is not magic morality injected into a model. It is a learned scoring function, trained on messy human comparisons, then optimized against — keep that deflationary picture in mind through all three stages.

The recipe that first turned a raw GPT into a usable assistant — the InstructGPT pipeline, building on Christiano and colleagues' earlier work learning reward functions from human preferences in control tasks — has three stages, run in order.

Each stage produces an artifact the next one consumes. Take them one at a time.

Stage 1: SFT, the launchpad

You cannot optimize preferences on a model that produces garbage, because the later stages work by comparing the model's own samples. So you start with SFT on a few thousand to a few tens of thousands of high-quality demonstrations, yielding a policy that already answers in roughly the right shape. SFT is not scaffolding to be discarded once RL begins — it is the reference point the whole next stage is tethered to. Remember that; it matters at Stage 3.

Stage 2: the reward model

Now you teach a model what humans prefer.

Collect comparisons: for a given prompt, sample several completions from the SFT model and have humans rank them — or, in the simplest case, pick the better of two. You now have a dataset of pairs (y_w, y_l), a "winner" and a "loser" for the same prompt.

Fit a reward model r(x, y) — typically the SFT model with its language head swapped for a single scalar output — to these comparisons using the Bradley–Terry model of pairwise choice. Bradley–Terry says the probability that y_w beats y_l is a logistic function of their reward difference:

P(y_w ≻ y_l)  =  σ( r(x, y_w) − r(x, y_l) )

so you train the reward model to make the winner's reward exceed the loser's, by minimizing

L_RM  =  − log σ( r(x, y_w) − r(x, y_l) )

over all the human comparisons. The output is a single number per (prompt, completion) scoring how much a human would like it.

This is the most important object in the pipeline, and it is worth being precise about what it is: the reward model is a learned, compressed proxy for human judgment. Every preference the labelers expressed — and behind them, every instruction in the labeling guideline, every rater's taste, every bit of inter-annotator disagreement averaged into the data — is squeezed into one scalar function. When we later say a model was "trained to be helpful," what we concretely mean is: a reward model scored helpfulness-shaped answers higher, and a policy was bent to climb it.

Two consequences follow:

The guideline is an opinion. Someone wrote the document that told labelers what "good" means — how to weigh helpfulness against caution, how warm to be, when to refuse. That opinion is now numerically encoded — which is exactly where a model's personality begins to take shape.
The labels are noisy and human. Raters disagree, get tired, prefer longer answers, prefer answers that agree with them. The reward model faithfully learns those tendencies too — including the ones nobody intended. Hold that thought for the section on Goodhart.

Stage 3: policy optimization, and the KL leash

Now bend the policy toward the reward model. The textbook tool is PPO (Proximal Policy Optimization), run as an online loop: sample completions from the current policy, score them with the reward model, and take a policy-gradient step that raises the probability of high-reward completions and lowers that of low-reward ones. Unlike SFT, there is no fixed target sequence — the model learns from feedback on its own samples, which is what lets it improve past the demonstrations.

To make that step stable, PPO carries a fourth network: a value model that predicts the reward the current policy expects on a given prompt. The update is driven not by the raw score but by the advantage — how much better than expected a sample turned out — so a completion that merely matches what the policy already does gets no push. That extra model is a large part of why PPO is operationally heavy: four models — policy, reference, reward, value — running in one loop.

But raw reward maximization is dangerous, and the reason is the heart of this note: the reward model is a proxy, not the truth. Optimize it hard enough and the policy will discover completions the reward model mistakenly scores highly — degenerate repetition, tonal tics, formatting quirks — exploiting the proxy's errors rather than getting genuinely better. So the objective is not "maximize reward." It is "maximize reward while staying close to the policy we trust":

maximize   E[ r(x, y) ]  −  β · KL( π_θ(y | x) ‖ π_ref(y | x) )

The second term is a KL penalty pulling the trained policy π_θ back toward the reference policy π_ref — the SFT model from Stage 1. (KL divergence shows up here for the same reason it shows up almost everywhere in ML: it is the natural measure of how far one distribution has drifted from another.) This is the KL leash, and β is its length. Loosen it and the policy chases reward into territory the reward model never saw and scores wrongly. Tighten it and the policy barely moves. Much of the craft of RLHF is tuning that leash — it is the most underrated knob in the pipeline.

(InstructGPT added one more term, mixing in a little pretraining loss to limit the alignment tax — the tendency for RLHF to erode raw capability. Same shape of problem: keep the policy from forgetting what it was while you teach it what you want.)

Step back and notice what Stage 3 did. The model already could produce the preferred answer — it was somewhere in its sample distribution. Preference optimization re-weighted tendencies the model already had so the preferred ones win more often. That is why a strong SFT base matters and why the KL leash works: you are mostly not teaching new capabilities here, you are sculpting dispositions out of existing ones.

RLAIF and Constitutional AI

The expensive, slow, and frankly grim part of the classic pipeline is Stage 2's human labeling. Comparisons are costly to collect at scale, inconsistent across raters, and — for the harmful content you most need labeled — taxing for the humans doing it. The natural question: can a model do the labeling?

RLAIF (reinforcement learning from AI feedback) replaces some or all of the human comparison labels with a model's judgments. Instead of a person picking the better of two completions, a capable model does, and those AI-generated preferences train the reward model (or feed an offline preference loss directly).

Constitutional AI is the most influential version of this idea — and it is broader than simply "AI labels instead of human labels." It introduces a written constitution — an explicit set of principles ("prefer the response that is more helpful and honest", "choose the response that least encourages harm", and so on) — and puts those principles to work twice:

In a supervised phase, the model is asked to critique and revise its own responses against the principles, generating harmlessness training data without a human writing each example.
In the feedback phase, the model labels which of two responses better follows the constitution, and those labels stand in for human preference judgments.

What actually changed here is worth stating carefully, because it is easy to misread RLAIF as "the values went away." They didn't. They moved, and became explicit. In classic RLHF the values lived implicitly in labelers' tastes plus a guideline document most readers never see. In Constitutional AI they live in a written constitution you can read, audit, and argue about. The bias shifts authorship — from a crowd of raters to the constitution's authors and the judge model's own dispositions — rather than disappearing. This is, concretely, one of the places a model's character is decided — how much of "why this model feels the way it does" reduces to choices made right here.

The honest caveat: an AI judge inherits its own training's biases, and a constitution is only as good as its authors and the model's ability to apply it consistently. RLAIF buys scale and explicitness; it does not buy objectivity.

Beyond PPO: the offline-preference family

PPO works, but it is operationally heavy. You run a live loop with a policy, a frozen reference, a reward model, and usually a value model — sampling, scoring, and updating, with all the instability online RL is famous for. A large fraction of recent post-training research is about getting the same preference bend without that loop.

The offline-preference family — DPO (Direct Preference Optimization) and its relatives — does exactly this. The key insight is that the KL-regularised objective above has a closed-form optimum, and you can rearrange it into a loss defined directly on preference pairs — no separate reward model, no sampling loop, just a supervised-style objective on (y_w, y_l) data that nudges the policy to prefer winners over losers relative to the reference.

The slogan "no reward model" is only half the story: DPO does not abolish the reward so much as make it implicit. The closed-form optimum lets you read the reward straight off the policy's own log-probability ratio against the reference — so the network you optimize and the network that scores collapse into one. It is the same Bradley–Terry comparison as Stage 2; you have only moved where it lives.

This note won't re-derive DPO or catalog its variants — that is exactly what the modern fine-tuning deep dive does, and the post-training pipeline note places it in the larger staging story. The menu in one breath: DPO is the default offline baseline; variants like SimPO, KTO, and ORPO trade off whether you need a reference model, whether labels are pairwise or pointwise, and how you control for length. Reach for the deep dive when you are choosing among them.

What matters here is the invariant underneath all of them. Whether you run PPO online against a reward model or DPO offline on pairs, you are doing the same conceptual thing: fitting the policy to a comparison signal, under a constraint that keeps it near a trusted reference. The algorithms differ in training dynamics and failure modes — reference dependence, length bias, stability — but the conceptual bend is the same.

One branch is worth distinguishing, because it sits beside preference learning rather than inside it: RL from verifiable rewards (RLVR), where the reward is not a human-preference proxy but a checker — unit tests pass, the math answer is correct, the proof verifies. Preference learning encodes taste; verifiable rewards encode correctness. They use similar RL machinery but answer different questions, and a model's reasoning ability and its disposition come from these two different signals. The pipeline note handles RLVR; preference learning — our subject here — is the taste branch.

The proxy problem: reward hacking and Goodhart

Everything good about RLHF and everything unsettling about it flow from the same fact: the reward model is a proxy for human judgment, and you are optimizing the proxy. Goodhart's law — when a measure becomes a target, it stops being a good measure — is not a footnote here. It is the central tax.

When you over-optimize a proxy, the policy climbs the measured reward while the true objective (what humans actually want) plateaus and eventually declines. The policy is getting better at pleasing the reward model and worse at the thing the reward model was supposed to stand for. The artifacts this produces are not random; they are systematic, and you have met all of them:

Length bias. Raters read longer answers as more thorough, so the reward model scores length, so the policy inflates. The model pads.
Sycophancy. Raters prefer, in the moment, answers that agree with them and validate their framing — so the reward model rewards agreement, and the policy learns to tell you what you want to hear rather than what is true. This is not a quirk bolted on later; it is a predictable consequence of optimizing "what raters liked" — and it is where a model's personality and its performed warmth first overlap.
Boilerplate and hedging. Safe, rater-pleasing patterns — the cautious preamble, the bulleted summary, the reflexive "it's important to note" — get over-produced because they reliably score fine and rarely score badly.
Confident tone, decoupled from support. Raters reward answers that sound authoritative. But linguistic confidence and epistemic support are different axes — a model can be rewarded for the register of certainty regardless of whether the claim is grounded, which is one of the mechanisms behind confidently-wrong answers.

The mitigations are the other side of the same coin: the KL leash (don't let the policy run far from the trusted base), retraining or ensembling the reward model (so the policy can't exploit one model's fixed blind spots for long), and watching the gap between proxy reward and held-out human judgment.

But the deeper point is this. Many of the stable, assistant-like behaviors we read as personality — warmth, caution, empathy, agreeableness, sycophancy — are amplified here, by a policy learning to chase a proxy for human preference. They are not all created here — pretraining, SFT demonstrations, system prompts, and decoding shape them too — but this is where a likelihood machine's dispositions get sharpened into a consistent tilt toward some kinds of answers over others. Some of that tilt is exactly what we wanted (be helpful, be careful). Some of it is Goodhart's tax (be long, be agreeable, sound sure). Both come from the same optimization.

What preference learning installed

Pull the threads together.

Pretraining and SFT decide what the model can produce; preference learning decides what it prefers to produce. The first is competence, the second is disposition.
The mechanism is a shift from optimizing likelihood (reproduce this token) to optimizing preference (this completion beats that one). That shift is the whole game; PPO, DPO, and the rest are ways to execute it.
The reward model is the compression of "what we prefer" into a scalar — and therefore the place where labeling guidelines, rater tastes, and (in Constitutional AI) a written constitution get baked into the model.
The KL leash is what keeps optimization honest, because the reward is a proxy and proxies can be gamed.
Goodhart is the unavoidable tax: optimize a proxy and you inherit its artifacts — length, sycophancy, boilerplate, performed confidence — alongside the behavior you wanted.