lenatriestounderstand

Chapter 10 of 25

What is Grouped-Query Attention (GQA)?

Created May 28, 2026 Updated Jun 7, 2026

Grouped-Query Attention (GQA) is the middle point between full Multi-Head Attention (MHA) and Multi-Query Attention (MQA). Query heads are partitioned into a small number of groups, and each group shares one K/V set. If a model has 64 query heads and 8 KV groups, every 8 query heads attend over the same K and V — so the KV cache stores 8 K/V sets per layer instead of 64.

The whole reason this variant exists is the KV cache. Its size scales with kv_heads:

KV cache bytes ≈ 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element
  • MHA: kv_heads = query_heads — max diversity, max memory.
  • MQA: kv_heads = 1 — max savings, all query heads forced through one shared K/V.
  • GQA: kv_heads = number of groups — typically 4 or 8 — a tunable point between the two.

The intuition behind why this works: query heads asking different questions is what gives multi-head attention its expressivity. Empirically, K/V heads appear more shareable than query heads — so sharing K/V in groups loses much less than sharing queries, and you cut KV-cache bytes by query_heads / kv_groups. The win isn't only cache size: with fewer K/V heads to stream from HBM on every decode step, GQA also cuts the memory-bandwidth traffic that actually bottlenecks autoregressive decoding — often the real constraint, not capacity.

What made GQA take over open-source defaults isn't just the trade-off curve. It's that you can convert an existing MHA checkpoint into a GQA checkpoint cheaply — a small uptraining run, cheap compared with pretraining, recovers almost all the quality. Many open models use GQA, including the larger Llama 2 variants (34B, 70B), Llama 3, Mistral, and Qwen. Pure MQA was usually too aggressive for production quality; full MHA was too expensive at long context. GQA hit a usable point in the middle at a fraction of the cost of training one from scratch.

GQA is a common, durable default across open LLM families. The frontier picture is more varied — some newer architectures use compressed latent-state alternatives such as Multi-head Latent Attention (MLA), used in the DeepSeek line — but that's diversification, not GQA going obsolete; it remains a standard production choice.

The full design space — MHA, MQA, GQA, MLA, sliding window, sparse, linear, position encodings, attention sinks — is in Attention Is All You Need — But Not All Attention Is the Same.