Chapter 16 of 25

BM25 vs Dense Embeddings: Why Hybrid Retrieval Wins

Created May 28, 2026 Updated Jun 7, 2026

Two paradigms for retrieving relevant text. They look like competitors. In production they're complements.

BM25 (1994) is sparse, lexical, statistical. It scores (query, document) pairs based on which query terms appear in the document, how often, and in how long a document — with two corrections on top of TF-IDF:

Saturation. The contribution of a term stops growing linearly after a few occurrences. Seeing kubernetes ten times doesn't mean a document is ten times more about kubernetes than seeing it once.
Length normalization. Long documents don't get an automatic advantage from accidentally containing query words.

No learning. No semantic understanding. Just term statistics. It has been the default in Lucene and Elasticsearch for thirty years, and it still is for any query involving exact identifiers.

Dense embeddings are learned, semantic, geometric. A bi-encoder model maps each text to a fixed-length vector trained so cosine similarity reflects meaning. car and automobile land close even though they share no letters. The space is continuous; what gets retrieved is whatever lies near the query vector in that space.

They fail in opposite ways.

BM25 fails on paraphrase. Query "the car got more expensive" against "the automobile rose in price" — zero overlap, near-zero score. Synonyms, multilingual queries, abstract concepts — all problems.

Dense fails on exact identifiers. Query "policy ID #INS-2847" and a dense embedding will soften #INS-2847 into something like other policy IDs. The model has every incentive to generalize. Specific tokens — error codes, function names, version strings, model numbers, dates — are exactly what dense retrieval treats as noise.

Hybrid retrieval runs both and merges the rankings. The standard merge is Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σ_r 1 / (k + rank_r(d))

For each retriever r, a document gets 1 / (k + rank) (with k ≈ 60 a typical smoothing constant). Ranks across retrievers are summed. RRF is rank-based, so it sidesteps the score-normalization problem — BM25 scores and cosine similarities aren't on the same scale, so you can't add them directly. Ranks are.

A typical hybrid stack:

Run BM25 and dense retrieval in parallel.
Take top-50 from each.
RRF-merge to a combined top-N.
Optionally cross-encoder rerank the top-N.

The obvious question is whether we can get all these signals from a single model. The 2024 answer is increasingly yes. BGE-M3 emits a dense vector, a sparse-lexical vector (per-token weights similar in spirit to BM25 / SPLADE), and a multi-vector representation for late interaction — all from one forward pass, combined at query time. The dense + sparse + multi-vector template is increasingly the default for high-quality retrieval, because the three signals catch different failure modes of each other.

If exact identifiers matter — IDs, codes, names, versions — BM25 is often the safer first retriever: cheap, no training cost, and strong on the exact matches dense embeddings systematically blur. Then add dense for paraphrase recall. (For paraphrase-heavy corpora where identifiers rarely appear — semantic search over prose, say — dense-first can be the better starting point instead; match the channel to the queries.) Either way, pure dense without a sparse channel is the configuration that fails on policy ID #INS-2847, and you only find out in production.

Full lineage — BoW → TF-IDF → BM25, then Word2Vec → BERT → SBERT → BGE-M3 — and where each one still lives: How Text Became Geometry and Embeddings: How Geometry Pretends to Be Meaning.