lenatriestounderstand

Chapter 11 of 25

The KV Cache in One Formula

Created May 28, 2026 Updated Jun 7, 2026

LLM decoders generate tokens one at a time. At every step the model needs Q, K, V for the new token, and the K and V for every previous token to compute attention scores. Recomputing all previous keys and values at every decoding step would repeatedly redo work that's already been done — prohibitively expensive as the sequence grows. So inference engines cache them — the KV cache:

KV cache bytes ≈ 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element

Each factor maps to a concrete decision:

  • 2 — one tensor for K, one for V.
  • layers — every transformer block has its own attention, every block caches.
  • kv_heads × head_dim — the per-token K and V vector width. Multi-head attention splits the hidden state into kv_heads slices of head_dim each.
  • seq_len — every cached token costs again. This is the linear-in-context-length term.
  • batch — every concurrent request has its own cache.
  • bytes_per_element — FP16 / BF16 = 2 bytes, FP8 ≈ 1, INT4 ≈ 0.5 (approximate — real quantized caches add packing and per-block scale/metadata overhead).

For a 70B-class model at long context this can reach multiple GB per request. The model weights are huge but fixed once loaded; the KV cache grows with seq_len × batch. At short prompts it's a rounding error next to the weights — but as context and batch grow, generation increasingly becomes a memory problem, and past some point the KV cache, not the model weights, is the serving bottleneck.

A useful frame: the cache is repeatedly read from HBM at every decode step. Decoding is memory-bandwidth-limited, not compute-limited. This is why every modern attention variant — MQA, GQA, MLA — attacks the same factor in the formula: shrink what the cache stores per token, so less of it has to be loaded each step.

Once you can read the six factors, every architecture choice in the attention-variant landscape becomes legible. Full breakdown: Attention Is All You Need — But Not All Attention Is the Same and How LLM Generation Works.