Chapter 11 of 25
The KV Cache in One Formula
Created May 28, 2026 Updated Jun 7, 2026
LLM decoders generate tokens one at a time. At every step the model needs Q, K, V for the new token, and the K and V for every previous token to compute attention scores. Recomputing all previous keys and values at every decoding step would repeatedly redo work that's already been done — prohibitively expensive as the sequence grows. So inference engines cache them — the KV cache:
KV cache bytes ≈ 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element
Each factor maps to a concrete decision:
- 2 — one tensor for K, one for V.
- layers — every transformer block has its own attention, every block caches.
- kv_heads × head_dim — the per-token K and V vector width. Multi-head attention splits the hidden state into
kv_headsslices ofhead_dimeach. - seq_len — every cached token costs again. This is the linear-in-context-length term.
- batch — every concurrent request has its own cache.
- bytes_per_element — FP16 / BF16 = 2 bytes, FP8 ≈ 1, INT4 ≈ 0.5 (approximate — real quantized caches add packing and per-block scale/metadata overhead).
For a 70B-class model at long context this can reach multiple GB per request. The model weights are huge but fixed once loaded; the KV cache grows with seq_len × batch. At short prompts it's a rounding error next to the weights — but as context and batch grow, generation increasingly becomes a memory problem, and past some point the KV cache, not the model weights, is the serving bottleneck.
A useful frame: the cache is repeatedly read from HBM at every decode step. Decoding is memory-bandwidth-limited, not compute-limited. This is why every modern attention variant — MQA, GQA, MLA — attacks the same factor in the formula: shrink what the cache stores per token, so less of it has to be loaded each step.
Once you can read the six factors, every architecture choice in the attention-variant landscape becomes legible. Full breakdown: Attention Is All You Need — But Not All Attention Is the Same and How LLM Generation Works.