What is Grouped-Query Attention (GQA)?
GQA sits between full multi-head attention and MQA: query heads are partitioned into a small number of groups, and each group shares one K/V set. Most of the KV-cache savings of MQA, most of the head diversity of MHA — and a cheap conversion path from existing checkpoints.
- llm
- attention
- gqa
- kv-cache