Chapter 28 of 28

RLAIF: When the Labeler Is a Model

Created Jun 16, 2026 Updated Jun 16, 2026

RLAIF — reinforcement learning from AI feedback — is, roughly, RLHF where some or all of the preference labels are produced by a model rather than by human raters. The expensive, slow, and often grim part of the classic pipeline is Stage 2: collecting human comparisons at scale — costly, inconsistent across raters, and taxing for the people labeling harmful content. RLAIF's question is simply: can a capable model do the labeling? Instead of a person picking the better of two completions, a model does, and those AI-generated preferences train the reward model (or feed an offline preference loss directly).

Constitutional AI is the best-known version, and it's broader than "AI labels instead of humans." It introduces a written constitution — an explicit set of principles ("prefer the response that is more helpful and honest", "choose the response that least encourages harm") — and puts those principles to work twice:

supervised phase:  the model critiques and revises its OWN responses against the
                   principles → harmlessness data, no human writing each example
feedback phase:    the model labels which of two responses better follows the
                   constitution → those labels stand in for human preferences

The thing to get right: the values didn't go away — they moved, and became explicit. In classic RLHF the values live implicitly in labelers' tastes plus a guideline document most people never see. In Constitutional AI they live in a written constitution you can read, audit, and argue about. Authorship shifts — from a crowd of raters to the constitution's authors and the judge model's own dispositions — rather than disappearing.

The honest caveat. An AI judge inherits its own training's biases, and a constitution is only as good as its authors and the model's ability to apply it consistently. RLAIF buys scale and explicitness; it does not buy objectivity. "We removed the humans" is not "we removed the opinions."

Conceptually it's the same preference-learning machinery as RLHF: comparisons define the direction of optimization; the main change is who supplies the comparison. The fuller treatment, alongside SFT, the reward model, PPO, and the offline-preference family, is in How LLMs Learn Human Preferences.