Chapter 6 of 25

What is Multi-head Latent Attention (MLA)?

Created May 27, 2026 Updated May 27, 2026

Multi-head Latent Attention (MLA) is a decoder attention variant introduced in DeepSeek-V2 (2024) that reduces KV-cache memory by storing a low-rank latent representation for each token and using learned projections to produce the K/V information needed by attention at decode time. Conceptually, this reconstructs per-head K/V from the latent; efficient implementations may absorb parts of the projections to avoid materializing the full tensors.

It solves the same problem as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) — the KV-cache becomes the memory and bandwidth bottleneck during long-context decoding — but it picks a different point in the trade-off space.

The MQA / GQA strategy: share K and V across attention heads. Fewer K/V heads stored → smaller cache → faster memory bandwidth. The cost is reduced attention-head diversity.

The MLA strategy: store a single compressed latent vector per token instead of full per-head K and V. At attention time, learned projections turn the latent into the K and V each head needs. More per-head diversity is preserved than in MQA / GQA, because heads are not forced to share the same K/V tensors — the compression is still a bottleneck, just a different one.

In other words: GQA saves cache by reducing the number of KV heads; MLA saves cache by reducing what is stored per token. MLA replaces a sharing bottleneck with a rank bottleneck.

DeepSeek-V2's paper describes this as low-rank key-value joint compression: the K and V are factorized through a shared low-dimensional latent space. The cached representation has rank much smaller than the full kv_heads × head_dim tensor, but per-head Qᵢ, Kᵢ, Vᵢ are reconstructed at attention time without storing them all.

Trade-off space:

MHA: full per-head K/V, no compression, highest memory.
GQA: K/V shared across groups of query heads. Some head diversity preserved. Common open-source default.
MQA: K/V shared across all heads. Strongest cache reduction. Largest quality cost.
MLA: low-rank latent; learned projections give each head the K/V it needs. More per-head diversity than MQA / GQA, at the cost of a rank-limited representation and extra projection compute per step.

Where MLA shines is in decoding-throughput-limited deployments where the bottleneck is repeatedly loading K/V from memory rather than the per-step compute. MLA is the attention mechanism in the DeepSeek-V2 / V3 model family, validated in V2 and carried forward as a load-bearing piece of their inference economics.

Full breakdown of MHA / MQA / GQA / MLA and the trade-off each one is making: see Attention Is All You Need — But Not All Attention Is the Same.