Chapter 2 of 10
Attention Is All You Need — But Not All Attention Is the Same
Created May 9, 2026
In the first note on LLM generation, attention could be explained as a single operation: each token builds a query, a key, and a value, compares its query against the keys of previous tokens, gets weights through softmax, and mixes the values. That is the right mental model. It explains why a transformer can connect words, facts, and pieces of context without any recurrent state; why GPT-like models use a causal mask; why long context is expensive; and why the KV-cache matters so much during inference.
But by 2026, that mental model is no longer enough.
Many modern LLMs are no longer just "decoder-only transformers with standard multi-head attention" in the naive sense. The transformer skeleton is still recognizable on the inside: a residual stream, attention-like sequence mixing, MLP/FFN blocks, normalization, positional encoding, logits at the output. But attention itself has stopped being one thing. It has turned into a design space — a set of decisions about what memory to keep, which tokens get exact access to which others, how positions are encoded, how much KV-cache you can afford, where to pay quadratic cost, and where to replace exact access with a compressed state.
That is why the phrase "attention is all you need" sounds almost ironic today. Attention is still needed. But "attention" can now mean very different things.
The old mental model: full multi-head attention
Classical transformer attention looks like this:
tokens → embeddings → Q, K, V
attention_weights = softmax(QKᵀ / √d)
output = attention_weights · V
In a decoder-only LLM, each new token can only attend to previous tokens, because the future does not exist yet. This is causal attention. For each layer and each attention head, the model stores the keys and values of tokens it has already processed, so that generating the next token does not require recomputing everything from scratch. That stored state is the KV-cache.
In standard multi-head attention, every head has its own Q, K, and V projections. This is expressive: different heads can look at different kinds of relationships — local syntax, entity references, indentation in code, position patterns, quotations, lists, repeating formats. But you pay for that expressivity.
The cost is not only that full attention has quadratic cost in sequence length during prefill. Decoding has its own, very practical pain: the KV-cache. For every previous token, every layer, and every KV-head, the model has to store K and V vectors. The longer the context, the larger the batch, the wider the model, the more layers — the more memory pressure. In long-context inference you often hit the wall not on "attention math" as an abstract formula, but on memory bandwidth and KV-cache size.
This is why modern attention variants are best understood not as attempts to make attention "smarter," but as answers to an engineering question:
How much exact access to past tokens do we actually need, and how much memory are we willing to pay for it?
MHA, MQA, GQA: how many K/V memories do we keep?
The first big family of changes does not touch the idea of attention itself. It changes the number of key/value memories that the model maintains.
MHA — Multi-Head Attention. Many query heads, many key/value heads. Each head has its own K and V. Maximum flexibility, large KV-cache.
MQA — Multi-Query Attention. Many query heads, one shared key/value head. Query heads stay distinct, but K/V are shared. This drastically shrinks the KV-cache and speeds up decoding. The price is potential loss of expressivity: heads now ask different questions but look into the same shared memory.
GQA — Grouped-Query Attention. Many query heads, several shared key/value groups. Query heads are partitioned into groups; each group shares one K/V set. The practically important finding behind GQA is that you can uptrain an existing MHA checkpoint into a GQA checkpoint cheaply — a small fraction of the original training compute is enough to recover almost all the quality. That is why GQA propagated through Llama, Mistral, and Qwen so quickly: not just better in theory, but a near-free conversion.
GQA has become a common default in many open LLM families. The frontier picture is more mixed: the DeepSeek line uses MLA (below), and several recent flagship hybrids skip GQA in favor of compressed-state designs.
The important thing is that MQA and GQA do not make attention conceptually deeper. They solve a concrete inference pain: KV-cache. This is not a new theory of language understanding. It is memory engineering inside the transformer.
A useful mental model:
MHA → expensive, expressive, many K/V memories
MQA → cheap, aggressive sharing, one K/V memory
GQA → compromise, grouped K/V memories
If you look at an LLM as a memory system, MQA/GQA answer the question: how many copies of the past do we have to keep around for the model to still work well?
The top row is query heads; the bottom row is what the KV-cache actually stores per token, per layer. Toggle between MHA (one K and V per Q-head), MQA (one shared K and V), GQA (Q-heads partitioned into groups, each sharing one K/V set), and MLA (a single compressed latent strip used by every head). Drag the Q-heads slider to scale the picture; for GQA, drag the groups slider to see the interpolation between MHA and MQA.
Position encoding and attention sinks: where the access pattern actually lives
It is tempting to think of attention as a content-only operation: tokens compare meanings, weights are determined by similarity. In a real transformer that is half of the picture. The other half is position. Without some way to mark where each token sits in the sequence, attention is permutation-invariant and the model literally cannot distinguish "the cat sat on the mat" from "the mat sat on the cat."
How position is injected into attention has become its own subfield, and it is inseparable from the long-context story.
RoPE — Rotary Position Embedding. The dominant choice in modern LLMs. Position is encoded by rotating the Q and K vectors as a function of position; relative position then falls out of the dot product. RoPE is used in many major recent model families, including Llama, Mistral, Qwen, DeepSeek, and Gemma.
ALiBi. Adds a position-dependent bias to attention scores instead of rotating Q/K. Cheaper, with mild length-extrapolation behavior. Used by BLOOM and several earlier models.
NoPE (no positional encoding). Surprisingly, decoder-only transformers can learn position implicitly through the causal mask. Pure NoPE underperforms RoPE for most tasks but appears as one component of some hybrid designs.
RoPE-scaling: PI (Position Interpolation), NTK-aware scaling, YaRN, LongRoPE. These are post-hoc techniques to extend a model's context length beyond what it was trained on, by reshaping how RoPE frequencies map to positions. Many long-context releases ("128K context," "1M context") use some form of RoPE scaling, interpolation, or related positional adaptation.
The reason this matters for our story is that the access pattern of attention does not live only in the K/V layout. It also lives in how positions interact with that layout. MLA, for example, has a non-trivial technical detail precisely here: the latent compression has to be decoupled from RoPE, because RoPE is a position-dependent rotation that does not commute cleanly with the down-projection. Ignore RoPE and MLA looks simple; engineer with RoPE and you discover why the actual implementation is harder than the diagram.
There is one more empirical phenomenon that any modern long-context discussion has to acknowledge: attention sinks. Trained transformers tend to dump a large fraction of attention mass onto the very first few tokens of the sequence, even when those tokens are semantically meaningless. A common interpretation is that softmax forces every row to sum to 1, so heads need somewhere to "park" probability mass when there is no relevant content to attend to — and the first tokens, always visible to every later position, become a natural default sink.
This finding has practical consequences. Naive sliding-window attention, if it evicts the first tokens out of the window, can collapse — quality drops sharply because the sinks are gone. Many practical sliding-window-style systems keep the first few tokens visible (an "attention sink" or "anchor" region) in addition to the sliding window. Without that, the architecture can look fine on paper and break in practice.
Sliding-window attention: long context is not always full access
In classical full attention, every token can attend to every previous token. Any distant fact in context is potentially directly accessible. For long contexts that gets expensive.
Sliding-window attention makes a harder choice: each token sees only a recent window of previous tokens.
full causal attention:
token t can attend to tokens 1..t
sliding-window attention:
token t can attend only to tokens t-w..t
sliding-window + sinks:
token t can attend to {1..k} ∪ {t-w..t}
This is cheaper, and it works well for many language dependencies, because much of local structure really is local: grammar, neighboring sentences, indentation in code, reasoning steps, list formatting. But the cost is obvious: if the relevant information sits well outside the window, there is no direct access to it.
Mistral 7B used sliding-window attention as its main mechanism (window 4096). The 2024–2026 lineage has tended toward mixed designs — Gemma 2 alternates sliding-window and full-attention layers; Llama 3 leaned on full attention plus RoPE-scaling; many recent models keep sliding-window only as one ingredient inside a hybrid. Pure sliding-window attention is less often presented as the whole long-context solution by itself; it more often appears as one component in a mixed design, partly because IO-aware kernels such as FlashAttention made full attention substantially more practical than early long-context discussions assumed.
The phrase "long context model" therefore does not always mean "every token has full exact access to everything in the long context." Sometimes long context is achieved through a mix of local attention, occasional global layers, special tokens, recurrence-like memory, RoPE-scaling, chunking, or other schemes for propagating information.
This is an important practical point for RAG and long-context QA. You can hand the model a long document and assume the truth is "in the context." But architecturally, access to that truth may be uneven. Some tokens have direct access, others indirect; some layers are local, others global; some information may propagate through compressed representations.
Long context is not always full context.
Sparse and global patterns: from hand-crafted to learned
Sliding-window attention is a special case of a broader idea: not all token-token interactions are equally important. You can build sparse patterns where most tokens look locally but a few positions get global access — special summary tokens, document-boundary tokens, question tokens, or selected global layers.
The first wave of sparse attention was mostly hand-designed: local windows, global tokens, fixed block patterns. It did not really win, for two reasons. The efficiency argument weakened once IO-aware kernels made full attention much cheaper, and the hand-designed patterns tended to leak quality on retrieval-heavy tasks.
The newer wave is more learned and hardware-aware: instead of hard-coding which positions get global access, the model learns which blocks deserve attention, with the sparsity pattern co-designed with the GPU kernel. Two examples worth knowing are NSA and MoBA — both train from scratch with learned sparse access, and both report strong long-context results while keeping sparse access patterns. MoBA in particular dynamically selects historical K/V blocks, somewhat in the spirit of mixture-of-experts applied to attention itself.
The general framing is unchanged. The architecture introduces a hierarchy:
local details → cheap local access
important anchors → wider access
global summaries → compressed propagation of information
This resembles how a human reads a long document. We do not retain every word with the same precision. We hold local detail, headings, summaries, key entities, links between parts. Modern attention variants are moving in a similar direction: memory becomes non-uniform.
But this brings a new failure mode. If an important fact never lands in the "right" place in the memory hierarchy, the model can have a long context formally and still fail to use the relevant evidence. This is one of the bridges between architecture and hallucination: lack of exact access to evidence pushes the model back toward priors, templates, and plausible completions.
The same lower-triangular causal mask, four ways. Toggle between full causal (every prior token), sliding window (recent W only), window + sinks (window plus the first K positions always visible), and sparse global (window plus user-selected anchor columns — click any column header to toggle). The window and sink sliders adjust live; the density readout shows how many cells stay legal vs the full-causal baseline.
MLA: compressing the cache, not just sharing it
Multi-Head Latent Attention, introduced in the DeepSeek line and carried through their later models, is another answer to the KV-cache problem. If MQA/GQA ask "how many K/V heads should we keep?", MLA asks a different question:
Can we store a more compact latent representation, from which we later reconstruct the attention information we need?
In standard attention, the KV-cache stores K and V for every token / layer / head. In MLA, the model down-projects K/V into a low-dimensional latent vector that is what actually gets cached, and uses up-projections at attention time to recover per-head keys and values for the dot product. The cached state can be much smaller than standard MHA, and competitive with — or smaller than — aggressive head-sharing schemes, depending on configuration; meanwhile each head still has its own effective K/V at compute time.
The technical subtlety is RoPE. A naive low-rank decomposition of K/V breaks under rotary position encoding, because RoPE applies a position-dependent rotation that does not commute with the down/up-projection cleanly. MLA handles this by splitting each head into two channels: a RoPE-free part that goes through the latent compression, and a small RoPE part that bypasses it. That decoupling is what makes MLA work in practice; without it, the low-rank trick collides with positional encoding.
The point is not that MLA "understands language better" on its own. The point is that it attacks one of the main bottlenecks of modern inference — the memory footprint of the KV-cache — more aggressively than head-sharing schemes can. It can also be more expressive than aggressive head-sharing schemes at a comparable cache size, because per-head distinctions are preserved at compute time even though storage is shared.
MLA is interesting precisely as a symptom of an architectural shift. Attention is no longer only about which tokens to compare. Attention is now also about in what form to store the past.
Linear, recurrent, and state-space attention: replacing explicit past with compressed state
A more aggressive step is to walk away from full explicit access to past tokens. Full attention stores the past as a set of key/value vectors: when needed, a new token can compare itself against every previous token representation. Powerful, but expensive.
Linear attention and recurrent attention try to replace this with a compressed state. Instead of storing all past K/V explicitly, the model maintains a state that is updated as new tokens are read.
full attention memory:
keep representations for many previous tokens
allow explicit token-level lookup
recurrent / linear memory:
update a compressed state
read from that state later
The lineage here is broader than "linear attention" suggests, and the terminology is worth disambiguating:
- Kernelized linear attention — Performer, Linear Transformer, and similar. Replace the softmax with a kernel feature map so attention can be reordered into a recurrent form. Cheap, mathematically clean, historically weak on recall.
- Gated linear attention — GLA, RetNet, DeltaNet. Add gating and decay to the recurrent update. Closes much of the recall gap with full attention while keeping linear cost.
- Selective state-space models — Mamba and Mamba-2. Not linear attention in the kernelized sense; a different formulation derived from state-space models with input-dependent state transitions. Empirically very competitive with transformers on language modeling.
- Recurrent variants in classical form — RWKV and xLSTM. RNN-shaped architectures redesigned for transformer-era scale and parallel training.
History is not stored as a list of all token memories; it is folded into a state. The win is obvious: cheaper long context, smaller KV-cache, faster decoding. The trade-off is that a compressed state can be worse for precise retrieval. If a task requires fishing a rare fact out verbatim from a distant location in context, full attention has a natural advantage — the relevant token is still sitting in memory. In a recurrent/linear scheme, that fact had to be written into the state correctly and not blurred by subsequent updates.
For a long time the consensus was cautious: efficient and elegant, but not always good enough for every regime. By 2026 the consensus has become less dismissive. Gated linear attention and selective state-space models are no longer viewed as merely elegant but weak alternatives. They are competitive enough that the serious question has shifted from "can they work at all?" to "where do they still need exact attention?" — and that question is what motivates hybrid designs, rather than a wholesale replacement.
Hybrid attention: exact access where it matters, cheap memory elsewhere
The dominant trend is not to replace full attention entirely, but to use hybrids.
Some layers are lightweight: sliding-window, linear, gated linear, recurrent, state-space-like, local. Other layers keep the more expensive exact/global attention. The result is a memory hierarchy inside the model:
cheap layers:
move local information
compress history
support long context cheaply
expensive layers:
provide more exact access
help with retrieval-like operations
repair weaknesses of compressed memory
The 2024–2026 hybrid landscape is real and varied:
- Jamba — Mamba blocks interleaved with transformer blocks, plus MoE. One of the first production-scale hybrids.
- Zamba and Zamba 2 — Mamba backbone with shared-attention layers.
- Samba — Mamba combined with sliding-window attention.
- Hymba — a parallel hybrid where Mamba and attention heads sit side by side rather than in alternating layers.
- RecurrentGemma — Griffin-style gated linear recurrence with periodic attention.
- Falcon-Mamba — pure Mamba-2 at competitive scale.
- Kimi Linear — a hybrid linear-attention architecture built around Kimi Delta Attention (in the gated DeltaNet lineage), with periodic full-attention-style layers in reported configurations.
Treat this list as a sample, not a ranking. Different mixes win on different axes — long-context recall, decoding speed, throughput, cost — and there is no single "best" hybrid as of 2026.
In 2024–2026 designs, the architectural border between transformer attention, recurrent memory, and state-space mechanisms has tended to get softer. Models often keep the transformer skeleton but vary the sequence-mixing blocks: some information flows through efficient finite-state memory, some through more expressive attention-like layers. You used to be able to say transformer = attention + MLP. It is more accurate now to say:
modern LLM = residual stream
+ several forms of sequence mixing
+ several forms of memory
+ MLP / experts
+ position encoding choices
+ inference-oriented implementation tricks
Attention does not disappear. It becomes one of several memory mechanisms.
Implementation-level: FlashAttention and KV-cache compression
The variants above all change the model — what it stores, what it can attend to, in what form. There is a parallel layer of techniques that does not change the model at all, only how it runs. These are easy to confuse with architectural variants and worth keeping separate.
FlashAttention: same attention, different hardware behavior
FlashAttention does not change the meaning of attention. It is still exact attention: the same softmax, the same result up to numerical detail. What changes is how it is computed on the GPU.
A naive implementation materializes the full attention matrix — for long sequences a giant n × n matrix. The bottleneck is data movement: reading and writing huge intermediate tensors between GPU HBM and faster on-chip memory is expensive. FlashAttention is IO-aware. It tiles the computation into blocks, loads chunks into fast SRAM, computes block by block, and avoids materializing the full attention matrix in memory. The backward pass uses recomputation rather than storing intermediate softmax outputs.
The technique evolved across generations. FlashAttention-2 (2023) improved work partitioning across GPU warps. FlashAttention-3 (2024) targets Hopper GPUs and exploits asynchrony and FP8. Flash-Decoding is a separate variant for the inference regime where the query is a single new token and the K/V is very long — exactly the long-context decoding case where naive attention becomes memory-bound.
standard attention:
materialize the full score matrix
store / read large intermediate tensors
memory traffic dominates
FlashAttention:
tile the computation
keep small blocks close to the compute
avoid storing the full attention matrix
same attention, better IO behavior
Architectural variants change the model. FlashAttention changes the kernel. Both matter; they answer different questions.
KV-cache compression: a parallel axis
MHA / MQA / GQA / MLA all decide the shape of the KV-cache at training time. There is a separate, model-agnostic family of techniques that compresses the KV-cache at inference time:
- Eviction policies. Keep only the tokens whose K/V actually carries useful attention mass and drop the rest; representative approaches include H2O, SnapKV, and StreamingLLM-style eviction. Empirically, in many regimes, a relatively small fraction of tokens accounts for most of the attention mass.
- Quantization. Store K and V in 4-bit or 2-bit instead of 16-bit; KIVI, KVQuant and similar work shows that quality loss can be small when quantization is designed around the empirical structure of K/V activations.
- Cross-layer sharing. Share K/V across layers, not just across heads within a layer (CLA-style). Yet another orthogonal axis for cutting cache size.
These techniques compose with the architectural choice. A GQA model can also be quantized; an MLA model can also have layers with eviction. "How big is the KV-cache" is therefore a product of two decisions: the architectural shape (set at training time) and the inference-time compression policy (set per deployment).
A back-of-the-envelope example
For a Llama-3-70B-shaped model at 64K context, fp16, using only the architectural numbers:
- 80 layers × head dim 128 × 2 (K and V) × 2 bytes = 40 KB per token per KV-head.
- MHA-equivalent (64 KV heads): on the order of 2.5 MB per token, around 150–160 GB at 64K.
- GQA (8 KV heads, what Llama-3 actually uses): on the order of 320 KB per token, around 20 GB at 64K.
- MLA-style compression can push this further down, depending on latent rank and implementation; reported reductions on DeepSeek's own architectures are very large relative to MHA.
- 4-bit KV quantization gives roughly another 4× cut on top of whichever architectural shape you started from.
These are back-of-the-envelope numbers, not a serving-cost estimate. Real memory footprint depends on batch size, tensor parallelism, dtype, padding, paged / block-based KV cache layouts, and framework overhead. The point is the order-of-magnitude gap between architectural choices, not a specific GB figure: that gap is the difference between "fits comfortably on one accelerator" and "needs a multi-GPU setup just to hold the cache."
Pick a model size and a context length; compare MHA, MQA, GQA, and MLA on a log-scale bar chart; toggle the dtype to layer KV-quantization on top of the architectural choice. The two-orders-of-magnitude gap between MHA and MLA at long context is the reason architectural variant choice is not academic — it is what decides whether a configuration fits on a single accelerator or needs a multi-GPU shard just to hold the cache.
MoE: an orthogonal axis
For completeness: Mixture-of-Experts is not an attention variant. It changes the FFN block, not the attention block — each token is routed to a subset of expert MLPs instead of through one shared MLP. MoE and the attention choices above generally compose without conflict: DeepSeek V3 pairs MoE with MLA; Jamba combines Mamba, transformer, and MoE; Mixtral combines GQA with MoE.
Mentioning it here only because the line modern LLM = residual stream + sequence mixing + memory + MLP/experts + ... is otherwise mysterious. Sequence mixing (attention and its relatives) and the FFN (dense or MoE) are two independent design axes; modern frontier models pick on both.
Attention variants solve different problems
The most common mistake is to lump all attention variants into one bucket — "different ways to make attention faster." Yes, almost all of them are tied to efficiency, but they solve different bottlenecks.
MHA:
baseline expressivity
many independent heads
expensive KV-cache
MQA:
reduce KV-cache aggressively
share K/V across query heads
faster decoding, possible quality trade-off
GQA:
compromise between MHA and MQA
grouped K/V sharing
common modern default
MLA:
low-rank latent compression of K/V
smaller cache than MQA at higher expressivity
requires care around RoPE
FlashAttention (and FA-2 / FA-3 / Flash-Decoding):
same mathematical attention
IO-aware GPU implementation
less memory traffic, faster kernels
KV-cache compression (H2O / SnapKV / KIVI / CLA):
inference-time, model-agnostic
eviction, quantization, cross-layer sharing
Sliding-window attention:
local token access
cheaper long context
weaker direct long-range access; needs sinks
Sparse / global patterns (NSA, MoBA, differential):
selective access
learned rather than hand-designed in modern variants
Linear / gated linear attention (GLA, RetNet, DeltaNet):
kernelized recurrent form
closed most of the recall gap with gating
State-space (Mamba, Mamba-2):
selective SSM, not kernelized linear attention
competitive with transformers on language modeling
Recurrent (RWKV, xLSTM):
RNN-shaped, transformer-era scale
Hybrid (Jamba, Zamba 2, Samba, Hymba, RecurrentGemma, Kimi Linear):
cheap memory + occasional exact access
dominant direction for long-context models
Position encoding (RoPE, ALiBi, NoPE; PI / NTK / YaRN / LongRoPE):
governs how position interacts with attention
central to long-context behavior
Attention sinks (StreamingLLM):
empirical phenomenon, not a variant
explains why naive sliding-window collapses
So "which attention is best?" is almost the wrong question. The right question is better for what? Training throughput? Decoding latency? KV-cache size? Long-context retrieval? Copying rare tokens? Reasoning over many steps? Serving many users cheaply? Different variants optimize different axes.
Why this matters for understanding LLM behavior
At first glance all of this can look like low-level engineering. But attention design affects more than speed. It affects how the model uses context.
If the model has full attention, every token potentially has a direct path to every previous token. If the model has sliding-window attention, distant information has to propagate through layers or special patterns. If the model has recurrent or state-space memory, distant information has to be folded into a state. If the KV-cache is compressed or quantized, some token-level detail is represented differently than in standard MHA. If the position encoding is RoPE with NTK scaling, the effective long-context behavior may differ from a model trained with that context natively.
This does not mean efficient attention is necessarily worse. It is often practically better, because it lets you hand the model more context, larger batches, lower latency, cheaper serving. But it does mean that "context length" as a number on a model card does not fully describe the availability of information.
Two models can both support 1M context. But one may have more exact token-level access; the other, more compressed memory. One may be better at copying rare strings out of the middle of a document; the other, better at holding the overall topic over long text. One may be cheaper in production; the other, more stable on needle-in-a-haystack tasks.
The reasoning-model wave makes this sharper. Long thinking traces from o-series and R1-style models can produce very long generated sequences, and generated tokens also become part of the context the model must attend to. This turns attention design into a test-time-compute problem, not just a document-context problem. This creates pressure toward architectures with cheap-but-not-blurry memory — roughly the territory MLA and the better hybrids are aimed at. "How much can this model think before its KV-cache becomes a problem?" becomes a question of attention design, not just parameter count.
This is also why the architecture of attention matters for RAG, agents, and hallucination. If evidence is formally present in context, that does not yet mean it will be used. Between "evidence is present" and "the answer is faithful" sits the internal mechanics of access, mixing, and retention.
Conclusion
Attention really did turn out to be one of the central ideas of modern deep learning. But in a 2026 LLM, attention is no longer a single block you can explain once through Q, K, V and consider the topic closed.
It has decomposed into a design space. MHA, MQA, GQA, and MLA decide how many K/V memories to keep and in what form. Sliding-window, sparse, and learned-sparse variants restrict the access pattern. Linear, gated linear, and state-space variants replace explicit memory with a compressed state. Hybrid architectures mix these regimes. Position encodings shape how the access pattern interacts with sequence length. Attention sinks explain why some of these designs collapse without anchor tokens. FlashAttention and KV-cache compression sit alongside the architectural choices as a parallel implementation axis.
"Attention" is no longer a single operation from a formula. It is a family of ways to organize memory inside an LLM, plus a position scheme that decides how that memory is indexed, plus an implementation layer that decides how much of it actually fits on the GPU.
Attention is still what lets the model use context. But not all attention gives the same kind of memory.
Not only how much context it accepts, but what kind of access to that context it actually has on the inside — and where that access breaks down.