Chapter 17 of 25
Bi-encoder vs Cross-encoder: Why Retrieval Is Two-Stage
Created May 28, 2026 Updated Jun 7, 2026
Two architectures, two different jobs. Production retrieval pipelines run both.
A bi-encoder is the model that produces embeddings. Query goes through one tower → vector. Document goes through another tower (or the same one) → vector. The two vectors meet only at the cosine step. The query and document never see each other inside the model.
This independence is what makes it fast:
- Document vectors can be precomputed once and stored in an index.
- Query time: one forward pass through the encoder, then an HNSW lookup against millions of cached vectors.
- Whole operation runs in milliseconds.
The price is what gets lost. A bi-encoder can't say "the query is asking about safety — but this document says not safe." It encoded both before they met. The interaction information is gone before scoring ever happens. This is the source of the classic embedding failures: negation flips, role confusion ("who did what to whom"), topical-but-not-answering matches.
A cross-encoder is built differently. Query and document are concatenated into one sequence and run through a single transformer that attends across all of their tokens at once. Output: one scalar — a relevance score. The model literally sees safe in the query and not next to safe in the document and can lower the score.
The price is speed. Nothing can be precomputed. Every (query, document) pair needs a full transformer forward pass. Running a cross-encoder against a million documents per query isn't a retrieval strategy; it's a pricing accident.
The standard solution is a cascade:
Stage 1 (bi-encoder + ANN):
millions of docs → top 100–200 candidates [milliseconds]
Stage 2 (cross-encoder rerank):
top 100–200 → top 10 [hundreds of ms]
→ LLM (or user)
The bi-encoder filters: cheap, broad, slightly dumb. The cross-encoder resorts: expensive, narrow, much sharper. Each one does the part the other can't afford. In a hybrid retrieval stack, Stage 1 is BM25 and dense in parallel, RRF-merged — and the cross-encoder reranks the merged candidates.
In many retrieval benchmarks (BEIR among them), adding a cross-encoder reranker often delivers a larger quality lift than the next incremental upgrade to a heavier embedding model. The lift is concentrated in exactly the failure modes the bi-encoder is built to have — interaction-sensitive judgments the geometry alone can't make.
The tuning question becomes "top-k of the first stage":
- Top-20 to the reranker: latency stays low, but if the answer-bearing chunk is rank 25 in the bi-encoder, it's already lost.
- Top-100 to the reranker: recall goes up, but the cross-encoder now processes 5× the pairs, and reranking latency grows roughly proportionally — the absolute numbers depend on model size, GPU, and batching. Past some point it's time to move the reranker to GPU, or quantize it.
That trade — recall before reranking vs reranker compute — is the central knob of two-stage retrieval. It's also why "which embedding model is best" stops being the central question once a reranker is in the pipeline.
A third middle-ground architecture exists: late interaction (ColBERT, ColBERTv2). It keeps token-level embeddings and computes MaxSim between query and document tokens at query time — closer to cross-encoder quality, but with the document side precomputable. The neat idea: it moves the query–document interaction down from the document level to the token level, yet still keeps documents precomputable — so it recovers much of a cross-encoder's discrimination without paying a per-pair forward pass at query time. Cost: a much larger index (dozens of vectors per chunk instead of one). Increasingly absorbed as one signal inside hybrid models like BGE-M3.
Full breakdown of the bi-encoder / cross-encoder split, late interaction, and where each one fails: Embeddings: How Geometry Pretends to Be Meaning.