lenatriestounderstand

Chapter 1 of 10

How LLM Generation Works: Transformer, Sampling, Tokens, Batching, and Validation

Created Apr 28, 2026 Updated May 27, 2026

Working with large language models through an API is like driving a complex machine through several knobs: temperature, max_tokens, top_p, batching strategy, retry logic. The wrong combination — and you get either an overly creative but incoherent answer, or a JSON cut off in the middle, or a bill for 100 thousand tokens where 500 would have sufficed.

This note first breaks down the inner workings of the transformer and the high-level text generation process, then five fundamental aspects of practical work with LLMs: the temperature sampling parameter, output length limits via max_tokens, structured outputs and schema validation, the batching strategy for bulk processing, and the validation pattern with "catching up" on what was missed.


Transformer architecture and the generation process

Before breaking down the control knobs, it is useful to picture what is actually happening inside the LLM when we send it a prompt. Without this picture, parameters like temperature, top_p, max_tokens remain abstract settings; with it, they become understandable consequences of how the model is built internally.

Transformer at a high level

Transformer — a neural network architecture introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017) from Google Brain. It replaced recurrent networks (RNN, LSTM) for sequence processing tasks and over a few years became the foundation of modern NLP. Most modern LLMs in practical use today — GPT, Claude, LLaMA, Mistral, Qwen, Gemini — are transformer-based or transformer-derived architectures, differing mainly in scale, training details, and small architectural modifications. Non-transformer designs exist (Mamba and other state-space models, RWKV, hybrid architectures), but they are not the mainstream of production-deployed LLMs at the moment.

At a high level, a transformer is built as a stack of identical blocks, each performing two key actions: self-attention (an attention mechanism through which each token "looks" at the other tokens of the sequence and mixes their information into its own representation) and a feed-forward network (an ordinary multilayer fully-connected network that processes each token independently). Between blocks residual connections are used (short paths along which the signal bypasses attention and FFN, which is critical for training deep networks) along with layer normalization (normalization of activations for stability). Modern large models have between 32 and 100+ such blocks stacked on top of each other.

At its input the transformer receives not raw text but a sequence of token embeddings — numerical vectors into which the input text's tokens have been converted. At its output, for every position in the sequence it produces its own vector. During training all positions are used at once — each predicts its own next token, and the gradients flow from all of them in parallel; this is what makes transformer training much more parallelizable than RNN training, and lets it saturate modern GPU/TPU hardware. During autoregressive generation only the final position is consumed: its vector is projected into a probability distribution over the entire vocabulary, the next token is sampled, appended to the sequence, and the whole stack runs again for the next step.

Tokenization — turning text into tokens

Tokenization — the mandatory first step of any work with an LLM. The model does not see "letters" or "words" in the human sense; it works with a fixed set of units called tokens, of which there are usually 50–250 thousand per vocabulary. Before being fed to the model, text is split into these tokens, and the model produces them as output.

Splitting is done by subword tokenization algorithms: BPE (Byte-Pair Encoding, used in GPT and most open-source models), WordPiece (BERT), SentencePiece (T5, multilingual models). The idea is the same for all of them: an optimal set of frequent byte or character subsequences is learned on a huge corpus of texts, and each such subsequence becomes a separate token. Frequently occurring words (the, and, is) get their own token; rare or compound words are split into several tokens (antidisestablishmentarianism → 6–7 tokens).

Why is this needed — two considerations. First, a fixed-length vocabulary: the model must have a finite number of "output classes" in the softmax layer. A vocabulary of ~50–250 thousand tokens can still encode arbitrary text — including rare words, typos, neologisms, and non-English scripts — by falling back to smaller subword or byte-level pieces when no whole-word match is available. Second, a balance between vocabulary size and sequence length: if each token corresponded to a single character, sequences would be too long; if to a whole word, the vocabulary would explode and out-of-vocabulary words would have nowhere to go. Subword tokenization is a compromise that works well for most tasks.

For Russian and other non-English languages, tokenization has historically been less efficient: training corpora are predominantly English, and frequent non-English subsequences were often not promoted to dedicated tokens. Older tokenizers like cl100k_base (GPT-4 Turbo era) split a typical Russian word into 2–4 tokens against 1–2 for an English one. The current generation — o200k_base (used across GPT-5.5 / 5.4 and the o-series), Llama 3/4's 128K BPE, Gemini's ~256K SentencePiece, and Anthropic's tokenizer behind Opus 4.7 / Sonnet 4.6 — substantially closes the gap, but a budget-aware system should still expect a non-trivial multiplier on non-English text relative to English.

Pick a preset and watch the same input get sliced into different numbers of tokens depending on which vocabulary is doing the slicing. The English-vs-Russian gap, the cost of emoji, and how punctuation gets bundled into structural tokens are all visible here at a glance.

Token embeddings — turning tokens into vectors

After tokenization, each token is turned into an embedding vector — a numerical vector of fixed dimensionality (usually 768, 1024, 2048, 4096, 12288 — depending on the size of the model). This process is implemented as a simple lookup table: for each token ID in the vocabulary, its own trained vector is stored; the "token → vector" operation is indexing a row of the table. The table itself is a set of trainable parameters of the model; it is initialized randomly and gradually adjusted so that semantically similar tokens get similar vectors.

To token embeddings are usually added positional embeddings — vectors encoding the position of the token in the sequence. Without them the transformer would not distinguish word order: self-attention by itself is symmetric, and without a position signal "John love Mary" and "Mary love John" would be processed identically. Modern models use various variants of positional encoding (sinusoidal from the original transformer, learned positions, RoPE — Rotary Position Embedding in LLaMA, ALiBi); the idea is the same for all — to inject information about order into the representation.

It is important to distinguish these token embeddings (an LLM's internal vectors, with the same dimensionality as the model's hidden state — 4096 for Llama-7B/8B, larger for bigger models) from sentence embeddings used in retrieval (typically 768–1536 dimensions, produced by separate models such as E5, BGE, or OpenAI's text-embedding-3). These are different objects trained for different tasks: token embeddings are optimized for next-token prediction; sentence embeddings — for cosine similarity between texts.

Attention mechanism

Self-attention — the central operation of the transformer and the main architectural innovation that made modern LLMs possible. The idea — for each token in the sequence, allow the model to "look" at all the other tokens and gather relevant information from them.

Technically, each token via linear projections produces three vectors: Query (what I am looking for), Key (how I can be found), Value (what I can offer). Attention for each pair of tokens is computed as the dot product of one's Query with another's Key, normalized via softmax. The resulting weights are used for a weighted summation of Value vectors: the token gets its updated representation, in which information from all relevant positions is mixed in. The formula is Attention(Q, K, V) = softmax(QKᵀ / √d) · V. The division by √d keeps the variance of the dot products QKᵀ from growing with d; without it, large dot products would push softmax into a saturated region with vanishing gradients, and training would stall.

In real transformers, multi-head attention is applied — several independent attention "heads" in parallel, each with its own Q/K/V projections. This allows the model to simultaneously track different types of relations: one head may specialize in syntactic links (subject ↔ predicate), another — in co-reference (pronoun ↔ its referent), a third — in topical coherence. The results of all heads are concatenated and projected back into the original dimensionality.

For generative models (GPT, LLaMA, Claude) attention is made causal, or masked: each token can look only at previous ones, not future ones. This is critical for training next-token prediction — otherwise the model would simply have "peeked" at the answer. It is implemented via a mask on the attention weights: for all pairs (i, j) where j > i, the attention weight is zeroed out before softmax. For encoder models like BERT there is no mask — each token sees the entire sequence in both directions.

Toggle between bidirectional, causal, and encoder-decoder attention to see the only structural difference that turns a transformer into an "understanding" model (BERT) versus a "generating" one (GPT). Click any row to highlight which keys that query attends to.

The main strength of attention is the ability to model dependencies of any length within the context window. RNN/LSTM lost information about distant tokens due to the sequential nature of processing; attention, on the other hand, gives a direct path between any two positions, and the gradient during training flows freely between them. The main price is quadratic complexity with respect to sequence length (O(n²) operations and memory), which makes working with very long contexts expensive and stimulates research into efficient attention variants (Flash Attention, sparse attention, sliding window, linear attention).

Next token prediction

After the entire stack of attention and FFN blocks, the model produces a final vector for each position in the sequence — a hidden state with the model's hidden dimensionality (the same dimension as the input token embeddings). This vector is passed through the final output projection: a linear layer mapping the hidden state into logits of dimensionality equal to the vocabulary size (50–250 thousand). One logits vector per position.

The logits are turned into a probability distribution via softmax: each logit is exponentiated, and the sum is normalized to one. The result is, for each position, a distribution over the entire vocabulary — the model's belief about which token comes next. At inference time only the distribution at the last position is consumed: a token is sampled from it (or argmax is taken), appended to the sequence, and the whole stack runs again for the next step. At training time all positions' distributions are scored against the ground-truth next tokens in parallel.

This is the fundamental operating mode of any generative LLM — autoregressive next-token prediction. The model does not "generate the answer in one go"; it adds tokens one at a time, each time passing the entire updated context through the whole network. The role of the sampling parameters now becomes concrete: temperature, top_p, and top_k all change exactly how the final token is chosen from the last-position probability distribution (covered in detail in the next section).

Pick a prompt, then press sample one to draw a token from the current top-K distribution; it is appended, and the next step's distribution appears. Drag temperature to see how the draw becomes deterministic at T=0 and erratic at T=2. Distributions are illustrative, not from a real model — but the mechanics are exactly the autoregressive loop.

This is also where the concept of streaming arises — delivering tokens to the client as they are generated, rather than all at once upon completion. Since tokens are generated sequentially, there is technically no reason to wait for all of them — each generated one can be returned immediately into the response stream via Server-Sent Events or WebSocket. This gives the user the feeling that "the model is typing in real time" and reduces perceived latency, especially for long answers.

KV-cache as a generation accelerator

A naive autoregressive process would be extremely inefficient: for each new position, Q, K, V would have to be recomputed for all previous tokens. Modern LLM inference engines instead maintain a KV-cache — stored Key and Value vectors for every token already in the sequence (both the prompt and previously generated tokens).

During decoding, the model still computes the new token's Query, Key, and Value. The Keys and Values for all previous prompt and generated tokens are reused from the cache, so the new token only attends against cached K/V instead of recomputing them across the entire prefix.

The KV-cache does not make initial prompt processing free: the prefill stage over a long input is still expensive — every prompt token has to attend to all earlier ones via the causal mask, which is quadratic in input length. What the cache optimizes is autoregressive decoding: each new step reuses cached K/V instead of recomputing the whole prefix, making per-step decoding cost linear in the accumulated context length rather than quadratic.

The price is memory. The KV-cache grows linearly with context length, and for large models with long contexts it occupies a substantial fraction of GPU memory. This is one of the reasons a 128K- or 1M-token context window is a genuinely expensive feature: processing such sequences requires both a long prefill and a large KV-cache to hold across the rest of the generation.

Top panel: a tall steel column for the one-time prefill (quadratic in input length) followed by short violet bars for the linear-cost decoding steps. Bottom panel: KV-cache memory grows linearly across the whole generation, with a slope set by 2 × layers × kv_heads × head_dim × dtype_bytes per token. Drag the prompt length, the generation length, and the model size class to see why long-context inference is mostly a memory problem.

LLM (decoder-only) vs BERT (encoder-only)

In the world of transformers there are two main architectural families, and understanding the difference between them is important for the right tool choice for the task.

Encoder-only models — BERT (Bidirectional Encoder Representations from Transformers, Google 2018), RoBERTa, DeBERTa, E5 for embeddings — are built as a stack of transformer blocks with bidirectional attention (each token sees all the others in both directions). They are trained via masked language modeling: in the source text 15% of tokens are randomly masked, and the model learns to predict them from bidirectional context. At the output, for each position there is a rich contextual representation reflecting the meaning of the token taking the entire surrounding into account. These models are not designed for open-ended autoregressive generation; their strength is understanding-oriented tasks: classification, entity extraction, similarity search, sentiment analysis, NER. The size is usually 100M–1B parameters — orders of magnitude smaller than modern LLMs.

Decoder-only models — GPT (Generative Pre-trained Transformer from OpenAI, 2018+), LLaMA (Meta), Claude (Anthropic), Mistral, Qwen — are built as a stack of transformer blocks with causal attention (each token sees only the previous ones). They are trained via next-token prediction: the model predicts the next token from the entire previous sequence. This paradigm unifies all NLP tasks under a single format — "give a continuation of the text", and applied to a huge amount of data and parameters (from 7B to 700B+) it gives the ability not only to understand, but also to generate coherent long texts, answer questions, write code, reason. Everything we today call "LLM" — these are decoder-only transformers.

There is also a third family — encoder-decoder transformers (T5, BART, the original transformer from 2017): first the encoder processes the input text, then the decoder generates the output with cross-attention to the encoder representations. This architecture has historically been used for machine translation and is still applied in specific tasks (T5 — text-to-text for a wide set of NLP tasks), but in mass use it has been displaced by decoder-only models, which turned out to be more universal.

A practical choice: for understanding tasks (classification, retrieval, extraction), encoder-only models such as BERT and its descendants are still a strong baseline, especially with a limited inference budget. For generation tasks (chat, summarization, code, reasoning), decoder-only LLMs. For retrieval embeddings, modern models like E5, BGE, and OpenAI's text-embedding-3 are encoder architectures optimized for cosine similarity rather than next-token prediction.


Temperature

Temperature controls randomness in the LLM's output. Too high gives incoherent noise; too low gives boring, predictable answers; for some tasks you want the lowest possible variance. Choosing the right value requires understanding what temperature actually does — and where in the generation pipeline it lives.

Where temperature lives in the generation loop

Temperature is a sampling parameter — it acts only at the step where a token is picked from the model's output distribution, not during the forward pass through the transformer. At each generation step the model produces a probability distribution over the entire vocabulary, and temperature determines how sharp or blurred that distribution is when we draw a token from it.

Softmax with a temperature divisor

At each step the transformer outputs logits — unnormalized scores for the entire vocabulary (50K–250K entries). To turn them into probabilities, softmax is applied with a temperature divisor:

p_i = exp(logit_i / T) / Σ exp(logit_j / T)

The divisor T controls the sharpness of the resulting distribution. With small T, division inflates the exponents and the gap between tokens widens — the most likely one becomes an almost certain pick. With large T, the exponents flatten and the distribution approaches uniform.

Fixed logits for ten plausible next tokens after The cat sat on the ___. Drag the temperature slider to watch the bars sharpen toward T=0 (argmax) or flatten toward T=3 (uniform-ish). Drag top_p and top_k to see which tokens get cut from the candidate set without redistributing probability mass.

T = 0, T = 1, T → ∞: three regimes

Three values of T give intuition across the range. At T = 0 the formula degenerates into argmax: the model always selects the maximum-logit token, no randomness — output is deterministic given the same input (with caveats; see below). At T = 1 softmax is applied unmodified — this is the model's original learned distribution, its "natural" degree of diversity. As T grows above one the distribution flattens and unlikely tokens get larger chances; results become more creative and less coherent. As T → ∞ the distribution becomes uniform and we get random noise from the vocabulary.

top_p, top_k, frequency and presence penalties

Several other sampling parameters live alongside temperature.

top_p (nucleus sampling) is an alternative way to limit the tail of the distribution. Instead of reshaping its sharpness the way temperature does, top_p cuts off all tokens beyond a cumulative-probability threshold: with top_p=0.9, the model considers only the smallest set of most-likely tokens that jointly account for 90% of the probability mass.

top_k is a coarser version — keep the K most likely tokens, drop the rest. It is an older technique; top_p usually performs better because it adapts to the local shape of the distribution.

frequency_penalty and presence_penalty (OpenAI-specific; absent from Anthropic's API) fight repetition: the first penalizes tokens by how often they have already appeared in the generation, the second by the mere fact of appearance. They reduce parrot-like loops at the cost of nudging the model away from naturally repeated structure such as lists or code.

Choosing T by task type

For deterministic tasks — generating regex, JSON against a fixed schema, code — the standard is temperature=0. Any randomness here hurts test reproducibility and can break downstream parsing: if the model generates a regex that changes between runs, tests become flaky and errors surface not at the moment of generation but later, when rules are applied to real data.

For creative tasks — text generation, brainstorming, ideation — 0.7–1.0 is typical: enough diversity without descending into chaos.

For reasoning-oriented models and extended-thinking modes, temperature is no longer the main control knob and may be restricted, ignored, or accepted only in a narrow range depending on the model. The more important parameter is reasoning effort / thinking budget: how much the model deliberates internally before producing the final answer. OpenAI exposes this as reasoning.effort with values like none, minimal, low, medium, high, xhigh (model-dependent) — lower effort favors speed and token usage, higher effort produces more complete reasoning. Anthropic exposes the equivalent as a thinking.budget_tokens cap. Provider docs should be checked per model, since supported values shift between releases.

T=0 is not bit-exactly deterministic

There is a subtle but important nuance. Even at T=0 an LLM may give slightly different answers on the same input. Two main reasons:

  • Floating-point non-associativity on GPU. Atomic adds and varying reduction-tree shapes mean the order of summation can change between runs; the result differs in the lowest bits of the mantissa. Once in a while this flips which token has the maximum logit.
  • Mixture-of-Experts routing under shared batches. Frontier models are increasingly MoE: each token is routed to a small subset of "experts" inside the layer. On a multi-tenant inference server your request shares a batch with other users' requests, and the composition of that batch affects expert load balancing — which affects routing — which affects outputs. Your input is the same; the rest of the batch is not.

In practice this means that on the same prompt in production you may get slightly different answers even at T=0, and code should not rely on bit-exact matching. If reproducibility is critical (regression tests against fixed expected outputs), assert structurally rather than character-for-character, or use a self-hosted, single-tenant deterministic decoder.


Max output tokens

max_tokens (and its provider-specific cousins) caps the length of the model's response. The mechanism is simple but is a frequent source of non-obvious bugs, especially with structured outputs like JSON.

Cap on generated tokens

The parameter sets the maximum number of tokens the model may generate before it is forcibly stopped. It is an upper bound, not a target: if the model decides to finish earlier — after hitting a logical end-of-answer point or a stop sequence — it will. If the natural answer would be longer, the model is truncated mechanically, without regard to structure, semantics, or correctness.

For budget arithmetic, recall the rough rule from the Tokenization subsection above: one token covers roughly 0.75 English words, with non-English languages running 1.5–3× higher depending on the tokenizer. For exact counts use OpenAI's tiktoken (matches the model byte-for-byte); Anthropic exposes a client.messages.count_tokens() endpoint for the same purpose.

Context window vs output cap

Two related but different limits coexist.

The context window is the total cap on input + output tokens for a single call. The output cap (max_tokens, or one of the provider-specific names below) limits only the output portion regardless of input size, but it is itself bounded from above by a per-model hard cap on output that is usually significantly smaller than the full context window. Exact numbers change quickly with each model release and should be checked in the provider's model table; the invariant is the structural one — context window and output cap are separate limits.

Both apply. The effective ceiling on output is min(context_window − input_tokens, provider_output_cap), and the smaller term is often the per-model output cap rather than the window-minus-input arithmetic — easy to miss when you focus only on the headline context-window number.

Pick a model class, slide the input size and your requested max_tokens, and see the effective output ceiling expressed as min(window − input, output_cap, requested). The bronze dashed line marks the per-model output cap — for most realistic inputs, it is the binding constraint, not the headline context window.

Parameter naming across providers

The same idea has different names depending on the API surface, which trips up code copied between providers:

  • OpenAI Chat Completions (classic)max_tokens.
  • OpenAI Chat Completions for o-series and newer modelsmax_completion_tokens. The old max_tokens is still accepted for non-reasoning models but deprecated for reasoning ones.
  • OpenAI Responses APImax_output_tokens.
  • Azure OpenAI — same names as the corresponding OpenAI API version Azure exposes (so max_completion_tokens once Azure rolls out the o-series API version).
  • Anthropicmax_tokens, and unlike OpenAI it is required, not optional.

Leaving the old parameter name when migrating between providers is one of the first bugs to surface: some APIs silently ignore the unknown field and apply a small default (often 256), others fall over with an uninformative error. In production code, factor the name into a provider-specific configuration.

Silent truncation on overflow

On exceeding the limit the model does not finish gracefully — it cuts off at exactly the last allowed token, whatever it is: the middle of a word, the middle of a JSON value, the middle of a closing tag.

The API surface signals this through a completion-status field whose name and semantics depend on the API. In OpenAI Chat Completions and Anthropic Messages it is finish_reason / stop_reason, with values like "stop" for a normal end and "length" (OpenAI) or "max_tokens" (Anthropic) when the cap was hit. In OpenAI's Responses API it is status ("completed" vs "incomplete") plus an incomplete_details.reason such as "max_output_tokens". Either way, the client sees invalid JSON and falls over with a parse error; at first glance the JSON looks "almost right", just without the closing bracket — which is exactly why this bug class is so confusing in production.

Always check the completion status before parsing

Before trying to parse a response, check the provider-specific completion status. In Chat Completions-style APIs that means finish_reason == "stop" (or stop_reason == "end_turn" for Anthropic); in the OpenAI Responses API, status == "completed" with no incomplete_details. If the truncation indicator fires instead, the model did not finish its thought, and parsing is highly likely to fail; either retry with a larger cap or surface the truncation as an error. Silently accepting truncated output is a guaranteed path to invisible bugs that surface in production a week later.

A typical pattern for batch jobs: set a comfortably large max_tokens (e.g., 4096 — enough for a JSON array of labels); without it the model regularly cuts off the array mid-element and feeds invalid JSON to the parser.


Structured outputs and schema validation

Once a section talks about JSON parsing failures, the natural follow-up is: how do you reduce format errors in the first place? Prompt instructions alone ("return JSON with these fields") are not enough — the model will occasionally produce invalid JSON, miss fields, hallucinate keys, or wrap the output in markdown fences.

If the provider supports JSON schema, structured outputs, or function/tool calling, use them. These mechanisms move many format errors from "random parsing failures" into explicit API-level constraints: the API itself enforces (or strongly conditions on) the schema, and out-of-schema outputs are caught — or never produced — before they reach your parser. OpenAI exposes this as response_format={"type": "json_schema", "json_schema": ..., "strict": true} in Chat Completions and as text.format with type: "json_schema" and strict: true in the Responses API; Anthropic exposes it through tool use with an input schema. In practice, prefer strict schemas where supported — non-strict mode treats the schema as a hint rather than a hard constraint and still lets the model deviate.

This does not remove the need for validation. Schema-aware decoding guarantees syntactic compliance, not semantic correctness, and edge cases remain — over-long generations that still get truncated mid-schema, refusal-style answers that don't match any schema branch, fields populated with hallucinated but well-typed values. Anthropic's docs also call out specifically that on stop_reason: "max_tokens" or refusals the output may not match the schema even with tool use. But schema enforcement shrinks the parse-error surface enough that the catch-up loop in the next section is usually needed only for the rarer case of model omissions on long batches.


Batching

Batching — the practice of grouping several elements into a single LLM request. It sounds simple, but the right batch size is a tradeoff between several factors: context limits, attention quality on long inputs, latency, cost. Too large a batch — attention quality degrades; too small — inefficient API usage.

Splitting a large input into N parts

A typical situation: the task is to process thousands of URLs, documents, or other units. You cannot send a single gigantic "process 30,000 elements" request, but a separate request per element is also a poor tradeoff. The right approach is to split the input into batches of reasonable size — each fitting into the context window with room for the answer — and process the batches in parallel.

Why not "everything in one request"

The first problem is purely mechanical. Thirty thousand elements at 80 tokens each is 2.4 million tokens, which exceeds the context window of all current production models — and even Gemini 1.5/2.0 Pro's 1M-token mode (and the experimental 2M one) sits below that, with attention quality on the back half noticeably degraded.

Even when the volume fits, the lost-in-the-middle effect becomes a concern (Liu et al., 2023): in classic long-context evaluations, LLMs handled information at the beginning and end of a long context noticeably better than in the middle, with accuracy on mid-context questions dropping several-fold relative to edge-of-context ones. Frontier 2026 models have largely closed this gap on simple needle-in-haystack retrieval, but related long-context failures still appear when retrieval is semantic rather than exact-match, when several relevant items must be aggregated, or when answering requires reasoning over the retrieved content. A bulk request that dumps thousands of items into a single prompt lands squarely in those harder regimes.

Long requests also more often hit a timeout, especially on slow endpoints. And finally, a single big request rules out parallelism: N small batches can be sent simultaneously and yield close to an N-fold speedup against a serial baseline.

A U-shaped degradation pattern was observed in classic long-context evaluations (Liu et al., 2023): accuracy higher when the relevant fact sat near the edges than in the middle, with the dip deepening as context grew. Frontier 2026 models have substantially closed this gap on simple needle-in-haystack retrieval, but related long-context failures still appear when retrieval becomes semantic, multi-item, or reasoning-heavy. Drag the fact-position slider to see the qualitative shape; toggle context lengths to see the effect amplify. Illustrative curves inspired by Liu et al. (2023), not exact reproduced measurements.

Why not "one request per element"

Each request carries fixed overhead: network latency to the API, time to first token, model warmup. Across a thousand requests at 200ms overhead each, that is 200 seconds spent purely on overhead, before any real generation work.

System prompts also duplicate: the same long instruction is sent a thousand times, paying for it both in latency and in tokens.

The most subtle benefit, though, is context sharing. When the model sees a batch of 30 similar elements, it picks up the pattern and processes them consistently — every output is conditioned on the same set of examples. One-by-one requests have no shared context, and similar inputs can produce divergent outputs simply because the model has nothing else to anchor against.

Sweet spot — empirically

The optimal batch size is found empirically. For GPT-4o on classification-style batch tasks, 30 elements per batch is a frequent sweet spot: the model reliably returns exactly as many answers as inputs, pattern recognition works, and throughput is good.

Beyond about 50 elements the model starts omitting items — returning fewer than it was given, requiring a catch-up loop (covered in the next section). Below 10–20 the per-request overhead grows and the benefit of context sharing fades.

For other models and tasks the sweet spot will be different — and on tasks with structured prompts, complex per-element instructions, or wide variance between elements, it can shift dramatically. Treat the 30-element rule as a starting point and tune.

Throughput climbs as overhead amortizes; completeness falls once the model starts omitting items on long batches; their product — usable items/sec — peaks somewhere between. Drag the four sliders to see how that peak moves with per-request overhead, time per item, the laziness onset, and the maximum drop on huge batches. There is no universal "30" — it is whatever your task makes it.

Formula for an initial estimate

batch_size ≈ (context_window - system_prompt_tokens - max_output_tokens) / tokens_per_item

For an initial estimate there is a simple formula. Take the model's context window, subtract the size of the system prompt (a constant per request), subtract max output tokens (the reserved output), and divide by the average number of tokens per element. This is the upper theoretical limit; the practical sweet spot is usually one and a half to two times smaller because of the "lost in the middle" effect.

Parallel batching

import asyncio

async def process_all(batches):
    return await asyncio.gather(*[process_batch(b) for b in batches])

Batching combines well with parallelism. asyncio.gather lets all batches run concurrently and total wall-time approaches the time of the slowest batch rather than the sum. In real code you almost always want to bound concurrency with an asyncio.Semaphore so as not to immediately overrun the provider's rate limits — the constraint shifts from "how fast is the slowest batch" to "how fast can I push tokens through the rate-limit window".

Rate limits — RPM, TPM, 429 backoff

OpenAI and Azure OpenAI enforce two kinds of rate limit: RPM (requests per minute) and TPM (tokens per minute, often split into separate input and output budgets). Anthropic's structure is similar — requests, input tokens, and output tokens each have their own per-minute caps. Batching helps with all of them: fewer requests reduce RPM pressure, and a well-chosen batch size lets you use TPM close to the cap without overshooting. Exceeding any limit returns HTTP 429, which the client must handle with backoff and retry; exponential backoff with jitter is the standard pattern.


Validation and catching up on missing outputs

Even with the right batch size and a good prompt, LLMs sometimes return a partial result: skipping elements, mixing up identifiers, producing fewer items than there were in the input. This is not a model bug in the strict sense — it is the freedom the model takes in interpreting the task. The system must be ready for it and able to catch up on what was missed.

Why partial answers happen

Partial answers are typical of all large models, especially on long lists. Even with explicit instructions like "return a label for every input", a model may simply produce 28 results instead of 30, drop random IDs, or merge two adjacent items into one. Waiting for an ideal prompt that guarantees a complete result is futile — it is more reliable to build a system that handles partial answers correctly.

Defensive implementation

def process_with_retry(items, max_retries=3):
    # items: list[dict] with a stable "id" field
    remaining_ids = {item["id"] for item in items}
    items_by_id = {item["id"]: item for item in items}
    results = {}

    for _ in range(max_retries):
        if not remaining_ids:
            break

        batch = [items_by_id[item_id] for item_id in remaining_ids]
        response = llm_batch_process(batch)  # returns {id: label}

        for item_id, label in response.items():
            if item_id in remaining_ids:
                results[item_id] = label

        remaining_ids -= set(results)

    # Fallback for the rest
    for item_id in remaining_ids:
        results[item_id] = "Unknown"

    return results

The algorithm:

  1. Track items by stable id. Keep a remaining_ids set of IDs for which there is no answer yet, and an items_by_id map for rebuilding the next batch.
  2. While remaining_ids is not empty and the retry budget is not exhausted, send a request with only the remaining items.
  3. Each incoming answer is keyed by id, added to results, and the ID is removed from remaining_ids via set difference.
  4. After N attempts, any IDs that are still missing get a fallback value — the pipeline must not crash because of a single stubborn element.

Stable IDs are not optional. Without them you cannot reliably map outputs back to inputs across retries — especially when the model returns a different number of items than it was given, drops some, or merges adjacent ones. Indexing by ID also makes retries idempotent at the dataset level: re-running with the same IDs produces the same results dict regardless of how many retry rounds happened to be needed.

Thirty input items with stable IDs. Press run attempt to fire one batch request — each pending item independently gets omitted with the displayed probability. Filled items darken to forest green (lighter shades for later attempts); the bronze "fallback" color is reserved for items still missing after three attempts. The line just above the grid gives a closed-form prediction — P(item still missing after 3 attempts) = p3 — so you can see the expected fallback count before pressing anything.

Properties of a robust catch-up loop

Set difference is a clean way to determine what is missing: take the set of all IDs, subtract the set of already-processed IDs, get remaining_ids. Iterating the source list and checking in results works equally well in big-O terms when results is a dict — both approaches are O(n) per pass. Set difference just expresses the intent more directly: "all the IDs we still owe an answer for".

Bounded retries are mandatory. Without a hard cap on attempts, a single stubborn element that the model never processes will spin you in an infinite loop. Three attempts is usually enough.

Fallback values are uglier than retrying forever, but better than an exception in production. The pipeline keeps moving, the user sees "Unknown" for a couple of elements, and one stuck item does not bring the whole job down.

Idempotency is what makes retry safe. With temperature=0 (modulo the bit-exactness caveats from the temperature section above), a repeat request for the same element returns the same answer, and the catch-up loop converges. With higher temperatures, retries can diverge from earlier attempts, and the result becomes unstable between runs of the same pipeline.