Chapter 10 of 10

Fine-Tuning LLMs: Modern Post-Training Deep Dive

Created May 30, 2026

This note is intentionally more reference-like than the pipeline note. The pipeline note answers the architectural question: which objective signal does the model need next? This one answers the follow-up: once you know the stage, which modern variants and engineering knobs matter?

If you have not read the shorter pipeline explanation, start there: Post-Training Is a Pipeline, Not a Step. If you are still deciding whether to train at all, start one level earlier: When the Weight Delta Is Worth It.

The organizing principle here is practical:

preference variants change what comparison signal you optimize;
PEFT variants change where the delta is allowed to live;
memory-efficient full FT changes whether full-rank updates are affordable;
merging and distillation change how capabilities move between checkpoints;
tooling choices change which bottleneck you are actually solving.

Preference optimization variants

Preference optimization starts from a simple desire: the model already produces multiple plausible answers, but you want one kind of answer to become more likely than another. The training data is not a target completion alone; it is a judgment about completions.

DPO

Direct Preference Optimization is the default baseline because it keeps the RLHF idea but removes the explicit PPO loop. It optimizes preference pairs directly against a frozen reference policy:

DPO loss = -log σ( β · [ (log π_θ(y_w|x) - log π_ref(y_w|x)) - (log π_θ(y_l|x) - log π_ref(y_l|x)) ] )

The trained policy should give the chosen completion y_w a better log-probability ratio than the rejected completion y_l, relative to the reference. The useful mental model: DPO does not say "copy this answer"; it says "move the policy so this answer wins this comparison."

The two main knobs are preference quality and β. Higher β pushes harder away from the reference. That can improve preference win rate, but it also increases the risk of over-optimizing label artifacts: length, confidence, politeness, boilerplate.

SimPO

SimPO removes the reference policy and scores average per-token log-probability with a target margin. Its practical appeal is length normalization. DPO's implicit reward is a sum over tokens, so length can become a hidden axis of optimization. SimPO makes that failure mode more explicit and often easier to control.

Reach for it when DPO is learning answer length as much as answer quality, or when carrying a reference model is operationally awkward.

KTO

KTO is useful when your labels are pointwise rather than paired. Instead of "A is better than B", the data says "this response is good" or "this response is bad." That matches a lot of production feedback systems: thumbs up, thumbs down, accepted, rejected, escalated.

The trade-off is weaker information. A pair tells you a local preference boundary under the same prompt. A pointwise label tells you less about what the model should prefer instead. KTO is convenient, but it does not make noisy product feedback magically precise.

IPO, ORPO, CPO

These are worth knowing, but they should not dominate the main pipeline story.

IPO addresses DPO overfitting when preferences are nearly deterministic. It replaces the unbounded log-sigmoid pressure with a bounded squared-error style objective.
ORPO combines SFT and preference optimization in one loss and removes the reference model. It is attractive when you want a single-stage recipe, but gives you less separation between "learn the task" and "prefer this style of answer."
CPO frames preference learning as a contrastive approximation to PPO without a reference model. It appears often in translation and settings where reference anchoring is awkward.

The meta-lesson is more important than any single acronym: preference optimization is a measurement problem. If your chosen answers are longer, safer-sounding, or more confident for accidental reasons, those properties become trainable signals.

LoRA and PEFT variants

All objective choices above can be trained as full fine-tunes or through adapters. PEFT answers a different question from DPO/SFT/RLVR: not what signal points the gradient, but where can the update live?

LoRA

LoRA freezes a base weight matrix W and trains a low-rank update:

W_eff = W_frozen + ΔW
ΔW = B A
rank(ΔW) ≤ r

A full fine-tune can move W anywhere in its parameter space. LoRA restricts the delta to rank r and only in the modules where adapters are attached. This is why LoRA is both cheap and limited. It works extremely well when the task shift is compact: style, format, routing preference, domain convention. It can underfit when the desired behavior needs many independent degrees of freedom across layers.

QLoRA

QLoRA keeps the LoRA update trainable but stores the frozen base in 4-bit NF4, with double quantization and paged optimizers to manage memory spikes. The important detail is that the base is quantized for storage; the adapter is still trained to steer the model.

QLoRA is the default practical route for large open models when you need fast iteration on limited hardware. The boundary depends on sequence length, batch size, checkpointing, CPU offload, and implementation. Treat "70B on one card" claims as setup-dependent, not as a law of nature.

DoRA

DoRA decomposes weights into magnitude and direction, then applies the low-rank update to the direction while learning magnitude separately. The practical promise is closing some of the LoRA-vs-full-FT gap, especially at low ranks.

Use the mental model: LoRA says "change a low-rank direction"; DoRA says "separate how far the weight points from where it points." In many training stacks it is now a flag, so it is a reasonable first experiment when plain LoRA underfits.

LoRA+, rsLoRA, VeRA, AdaLoRA

These are refinements around capacity and optimization:

LoRA+ uses different learning rates for the two low-rank matrices. It is often a cheap win.
rsLoRA changes scaling from α/r to α/√r, making higher ranks behave more stably.
VeRA shares random matrices and trains small scaling vectors, useful when you need many tiny adapters.
AdaLoRA reallocates rank across layers during training, spending capacity where the update appears most useful.

Most projects should not start by picking among all of these. Start with LoRA or QLoRA, inspect whether rank/module choice is the bottleneck, then reach for variants when you see a specific failure.

Memory-efficient full fine-tuning

LoRA's original argument was memory: full fine-tuning needs gradients and optimizer states for every parameter. That argument is weaker than it used to be.

GaLore projects gradients into a low-rank subspace before they hit the optimizer, periodically re-estimating the subspace. The model update remains full-rank, but optimizer state lives in the projected space. This can approach full fine-tuning quality with memory closer to adapter methods.

Adafactor and Adam-mini reduce optimizer state by using factored or blockwise estimates instead of full per-parameter Adam moments. 8-bit Adam stores optimizer states in 8 bits rather than 32. These techniques compose with checkpointing, offload, LoRA, and sometimes GaLore.

The practical decision boundary:

use LoRA/QLoRA for fast iteration, many adapters, reversibility, and low serving friction;
use memory-efficient full FT when adapter capacity is visibly limiting the target behavior and you are optimizing one model rather than many adapters;
use ordinary full FT when quality matters more than reversibility and the hardware budget is not the binding constraint.

The dangerous version is pretending hardware is the only axis. It is not. Full FT gives the optimizer more freedom to improve the target and more freedom to damage everything else.

Merging and distillation

Some post-training steps move capability without another ordinary training run.

Model soups and task arithmetic

Model soups average several fine-tuned checkpoints, often from the same task with different hyperparameters. The soup can outperform any single run because it lands in a flatter or more robust region of the loss basin.

Task arithmetic computes a task vector:

τ_task = θ_finetuned - θ_base

Then task vectors can be added, scaled, or subtracted. This works best when the fine-tunes share a base and remain in compatible regions of weight space.

TIES, DARE, SLERP

TIES-Merging resolves sign conflicts between task vectors before averaging.
DARE randomly drops many task-vector entries before merging; counterintuitively, this can reduce interference.
SLERP interpolates checkpoints along a spherical path instead of a straight linear average.

These are cheap enough to try when you already have compatible checkpoints or adapters. They are not magic glue for arbitrary unrelated models.

Distillation

Distillation is the most important capability-transfer move in modern fine-tuning. A stronger teacher generates traces; a smaller or cheaper student learns those traces with SFT. For reasoning, this often means long CoT trajectories from a stronger RL-trained model.

The caveat is the same as all SFT: the student can learn the shape of the trace without inheriting the capability. Distilled reasoning needs held-out reasoning evals, not just pretty explanations.

Tooling and serving, as of 2026

Tool names age faster than training concepts, so read this section as a 2026 snapshot of the ecosystem. The stable part is the role each tool plays; the exact winner in each role will keep changing. The right stack follows the bottleneck.

TRL is the reference-style Hugging Face implementation for SFT, DPO, PPO, GRPO, and related algorithms.
axolotl and llama-factory are config-driven recipe systems for common open-source fine-tuning workflows.
Unsloth focuses on single-node LoRA/QLoRA speed and memory through custom kernels.
OpenRLHF and veRL become relevant when rollout throughput, distributed generation, and weight synchronization dominate.
vLLM and SGLang matter when generation is the bottleneck, especially for RL where rollouts can cost more than backprop.

For deployment, multi-LoRA serving lets one base model dispatch requests through many adapters. This is the architecture you want for per-customer, per-domain, or per-task adapters. Without it, LoRA's tiny adapter files are nice at training time but awkward at serving time.

The surprising operational fact in RL runs: generation often dominates cost. For each policy update, the model may generate many long completions before the backward pass begins. At scale, "training" becomes an inference-systems problem.

The toolbox, in order

Post-training is a short sequence of decisions made in order. Each block below is one decision; the rows are the options you choose between.

First — what signal you train on.

Method	What it does	When it's the right call
Continued pretraining	Keeps the next-token loss but swaps in a new corpus	Re-seating the base on a domain's substrate — needs billions of tokens plus general-data replay
SFT	Cross-entropy on prompt/completion pairs	The default for teaching behavior and format; coverage and quality beat raw example count
RFT	Sample N completions, keep the ones a verifier or reward model accepts, then SFT on those	A cheap slice of RL's gains whenever a verifier exists
Reward modeling	Bradley-Terry loss on preference pairs → a learned scorer	When you need a judge: an ORM scores whole outcomes, a PRM scores individual steps

Then — how preferences and rewards reshape the policy.

Method	What it does	Trade-off
PPO	Classic RLHF: reward model + value head + KL leash to the reference	Powerful, but heavy and finicky to stabilize
DPO	Closed-form preference loss against a frozen reference — no reward model, no rollouts	Default baseline; watch length bias and overfitting to near-tie preferences
DPO descendants (IPO, KTO, ORPO, SimPO, CPO)	Each relaxes one DPO assumption	Paired vs unpaired data, reference-model cost, length normalization, stability
RLVR + GRPO	Verifier reward, group-relative advantages, no critic	Extracts math/code/reasoning when some rollouts pass and some fail

Then — how much of the model you actually move.

Method	What it does	When
LoRA	Low-rank update `ΔW = BA` on a frozen base	Style, format, domain convention; underfits broad capability shifts
QLoRA	LoRA with the frozen base stored in 4-bit NF4	Same reach, on much smaller hardware
Memory-efficient full FT	Full-rank update, shrunk optimizer state (GaLore, Adam-mini, Adafactor, 8-bit Adam)	When adapters visibly underfit but full Adam state won't fit
Merging	Combine already-learned directions (soups, task arithmetic, TIES, DARE, SLERP)	Adding capability with no training run at all

Underneath all of it, three rules decide whether any of this lands:

Pipelines chain these stages. The public reference is DeepSeek-R1: cold-start SFT → RLVR → rejection-sample back into SFT → optional second RL → distillation. Distillation alone is often the cheapest route to reasoning behavior.
Every stage risks forgetting. A narrow fine-tune moves representations the rest of the model relied on; the defenses are general-data replay, low-rank or KL-constrained updates, and regression evals honest enough to catch the damage.
It's a ladder, not a default. Prompt → few-shot → RAG → tools → prompt-tuning → LoRA SFT → full FT → DPO → RLVR — climb only as far as the problem forces you, starting from the cheapest reversible step that could close the gap.

Conclusion

The main pipeline article gives the strategic choice: substrate, examples, comparisons, verifiers, or distillation. This note gives the knobs once that choice is made.

Do not start with the knob. Start with the failure. If the failure is label quality, no DPO variant saves you. If the failure is adapter capacity, no prompt rewrite fixes it. If the failure is verifier weakness, GRPO will optimize the weakness. The modern fine-tuning toolkit is powerful precisely because it gives you many ways to move the model; the hard part is choosing the movement you actually want.