Chapter 13 of 25

Temperature, top_k, top_p: How LLMs Pick the Next Token

Created May 28, 2026 Updated Jun 7, 2026

At every decoding step, an LLM produces a probability distribution over the entire vocabulary — 50,000 to 250,000 possible next tokens. temperature, top_k, and top_p decide how the one token that actually gets emitted is drawn from that distribution. They look similar in the API. They do different things.

Temperature reshapes the distribution. The transformer outputs logits (unnormalized scores). Softmax with a temperature divisor turns them into probabilities:

p_i = exp(logit_i / T) / Σ exp(logit_j / T)

T controls sharpness. Small T inflates the gap between logits — the top token approaches probability 1. Large T flattens everything toward uniform. Three regimes:

T = 0 — argmax, deterministic. The standard for code, JSON, regex, anything you want to parse downstream.
T = 1 — softmax unmodified. The distribution the model actually learned.
T → ∞ — uniform noise.

Typical creative range: 0.7–1.0. Above that, coherence falls off.

top_k truncates by count. Keep the k highest-probability tokens, drop the rest, renormalize. top_k = 1 is equivalent to argmax. top_k = 50 is a typical default. The flaw is rigidity — sometimes the right answer is in a sharp distribution with only 3 plausible tokens, sometimes it's in a flat one with 200. A fixed k is wrong for one of those cases.

top_p (nucleus sampling) truncates by cumulative probability. Sort tokens by probability descending. Keep the smallest set whose probabilities sum to at least p. With top_p = 0.9, the model considers however many tokens are needed to cover 90% of the mass — 3 in a sharp distribution, 50 in a flat one. top_p adapts to the shape of the distribution; top_k does not. This is why top_p usually replaces top_k in modern stacks.

The three knobs compose. A typical sampling pipeline: apply temperature → apply top_k (cap the tail) → apply top_p (nucleus cut) → renormalize → sample. Setting them all aggressively at once doesn't always make output better — you can over-filter and lose useful diversity, or under-filter and admit noise. Most production setups pick one primary knob (usually temperature plus either top_p or top_k) and leave the others at defaults.

Two caveats worth knowing:

T = 0 is not bit-exact deterministic. Floating-point order of operations on GPUs can flip the argmax tie-break across runs. For reproducible outputs you also need a fixed seed and, for multi-replica serving, a consistent backend.
Reasoning models often ignore these. OpenAI's o-series and Anthropic's extended-thinking modes restrict or ignore temperature / top_p. The primary control is reasoning.effort / thinking.budget_tokens — how much the model deliberates internally before the final answer.

Full breakdown of sampling, max_tokens, structured outputs, and the production knobs around them: How LLM Generation Works.