lenatriestounderstand

Chapter 4 of 25

What is Multi-Query Attention?

Created May 27, 2026 Updated May 27, 2026

Multi-Query Attention (MQA) is a decoder attention variant that reduces KV-cache memory by sharing keys and values across attention heads, while keeping separate query heads. To see why that matters, start with the KV cache itself.

LLM decoders generate text token by token. To avoid recomputing past keys and values at every step, they cache them — the KV cache. Its size:

KV cache bytes ≈ 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element

Where kv_heads depends on the attention variant:

  • MHA: kv_heads = query_heads — one K/V set per query head.
  • GQA: kv_heads = number of KV groups — a few K/V sets shared across groups of query heads.
  • MQA: kv_heads = 1 — single K/V set shared across all heads.

For a full-MHA 70B-class model at long context, this can reach multiple GB per request. With batching, the cache often becomes the serving bottleneck. The weights are huge but fixed once loaded; the KV cache grows with sequence length and batch size, so long-context serving often becomes limited by KV-cache memory and memory bandwidth.

In standard multi-head attention, every head has its own projections — Qᵢ, Kᵢ, Vᵢ. During autoregressive decoding the new token produces new queries, but all previous keys and values stay in the cache. The cache stores K and V for every layer, every past token, and every attention head.

MQA changes only the K/V side. Query heads stay separate, so different heads can still ask different attention questions. But they all attend over the same shared K and V memory. Relative to full MHA, a 64-query-head model with MQA cuts the K/V-head part of the cache 64×, because 64 K/V heads become 1.

The quality cost can be real, especially for pure MQA. Shared K/V is a strict subset of what full multi-head attention could express. Grouped-Query Attention (GQA) exists because it keeps most of the KV-cache savings while preserving more attention-head diversity — instead of one K/V set for all heads or one per head, it groups heads into a small number of K/V sets. GQA became a common default in modern open LLMs, though the exact number of KV groups varies by model family and size.

DeepSeek's Multi-head Latent Attention (MLA) goes in a different direction: instead of sharing K/V heads, it stores a compressed latent representation and reconstructs K/V through low-rank key-value joint compression.

Full breakdown of MHA / MQA / GQA / MLA and the trade-off each one is making: see Attention Is All You Need — But Not All Attention Is the Same.