lenatriestounderstand

Chapter 24 of 25

Attention Is a Kernel Method in Disguise

Created Jun 7, 2026 Updated Jun 7, 2026

Almost everyone who works with transformers can write the attention formula. Far fewer notice that its shape is sixty years old. Attention is kernel regression — the same weighted-average-by-similarity idea as Nadaraya-Watson smoothing — with one upgrade: the similarity is learned instead of fixed. This isn't a loose metaphor. The two formulas line up term for term, and the alignment is the cleanest way to understand both what attention is doing and how the "linear attention" family was derived.

Attention, for a single query. A query q attends over keys kⱼ and values vⱼ:

out(q) = Σⱼ  softmax(q · kⱼ / √d) · vⱼ
       = Σⱼ  [ exp(q · kⱼ / √d) / Σₗ exp(q · kₗ / √d) ] · vⱼ

That is a weighted average of the values, where each weight is the similarity between the query and that key, normalized to sum to 1.

Kernel regression (Nadaraya-Watson, 1964). To predict an output at a test point x from training pairs (xⱼ, yⱼ):

ŷ(x) = Σⱼ  [ K(x, xⱼ) / Σₗ K(x, xₗ) ] · yⱼ

A weighted average of the training outputs, where each weight is the kernel similarity between x and that training input, normalized to sum to 1. It is the same equation.

The correspondence is exact:

attention                  kernel regression
─────────────────          ───────────────────────────
query     q          ↔     test point        x
keys      kⱼ         ↔     training inputs   xⱼ
values    vⱼ         ↔     training outputs  yⱼ
exp(q·kⱼ/√d)         ↔     kernel  K(x, xⱼ)
softmax normalization↔     Σ-normalization in N-W

So attention is Nadaraya-Watson kernel smoothing with kernel K(q, k) = exp(q · k / √d), evaluated over a "training set" that is the context itself. And that kernel is not exotic: if q and k are length-normalized, ‖q − k‖² = 2 − 2·q·k, so

exp(q · k)  ∝  exp( −‖q − k‖² / 2 )

— literally a Gaussian / RBF kernel. Softmax is just the normalization step Nadaraya-Watson already had.

Left: Nadaraya–Watson smoothing — a test point x returns a weighted average of training outputs yⱼ. Right: attention — a query q returns a weighted average of values vⱼ. Same positions, same weights, same output; only the names differ. The kernel bandwidth ℓ on the left is the temperature on the right.

Why the reframe pays off.

1. Linear attention falls right out. The O(n²) cost of attention comes from forming every query-key similarity. But suppose the kernel factors through a feature map, K(q, k) = φ(q) · φ(k). Then you can reassociate the sum:

softmax form:   out = (Q Kᵀ) V          → O(n²) in sequence length
kernel form:    out = φ(Q) ( φ(K)ᵀ V )   → O(n)  in sequence length

Compute φ(K)ᵀV once and reuse it for every query. That single algebraic move — replace the softmax with a kernel feature map so the products reassociateis the Performer / Linear Transformer family. The whole research line is "find a φ whose inner product approximates exp(q·k/√d)." You cannot even state the idea without the kernel view.

2. It explains what attention actually does. It's non-parametric memory: the context tokens are the training set, and each query runs kernel regression against them on the fly — a learned nearest-neighbor smoother over the prompt. That is a clean mechanical account of why in-context learning works at all: the model isn't updating weights, it's regressing over the examples sitting in its context.

3. It says what's genuinely new. The averaging mechanism is 1964. Attention's real innovation is that Q, K, V are learned projections — the similarity metric is trained end-to-end, not fixed in advance like an RBF bandwidth. Attention = kernel smoothing with a learned kernel.

One honest caveat. After row-wise softmax normalization, the attention "kernel" is data-dependent and not a positive-definite Mercer kernel in the strict sense, so the deep RKHS theory doesn't transfer wholesale — the analogy is exact for the smoothing form and looser for the function-space theory. But the practical consequence — linear attention — is real and comes directly from taking the kernel view seriously.

The kernelized-linear-attention lineage (Performer, Linear Transformer, and where it sits among the other variants) is laid out in Attention Is All You Need — But Not All Attention Is the Same; the plain softmax-attention mechanism it builds on is in How LLM Generation Works.