Chapter 2 of 3
How Text Became Geometry
Created May 19, 2026 Updated May 21, 2026
In the Embeddings: How Geometry Pretends to Be Meaning note I treated embeddings as if the modern world had always been there: text goes in, a vector comes out, cosine similarity decides what's close.
But that picture is the end of a long story, not the beginning. For most of computing history, text wasn't geometry. Text was counted.
It starts with counting words. It ends with a 1024-dimensional vector that "knows" cat and kitten belong together. Between those two points lies sixty years of incremental work — and one query, "is my car covered if I'm driving abroad?", that we'll trace through every era of how it would have been represented. Each step solved one kind of blindness and exposed the next.
This is not a history of search methods. It's a guided tour of retrieval representations, ordered historically because each method was invented to fix the previous method's blindness. I'll go method by method, but not as a museum tour. For each step, I'll ask the same questions: what object becomes a vector, what creates the geometry, what kind of similarity it can see, and what it remains blind to. The dates are scaffolding; the mechanics are the point.
Bag-of-words: count and you're done
The simplest answer, and the oldest, is just to count.
Bag-of-words (BoW), which goes back to information retrieval research in the 1950s and '60s, represents a document as a vector of word frequencies. The dimensionality of the vector equals the vocabulary size of the entire corpus. The sentence "cat sat on the mat", in a vocabulary [the, cat, sat, on, mat, dog, ran, ...], becomes [1, 1, 1, 1, 1, 0, 0, ...] — five ones for the words that appeared, zeros for everything else.
For a corpus of 20,000 unique words, every document is a 20,000-dimensional sparse vector — sparse because almost every coordinate is zero (most words don't appear in any given document).
Mechanically, BoW builds a vocabulary from the corpus and assigns one coordinate to each word. A document is represented by activating the coordinates of the words it contains. Similarity is then just overlap in that explicit word space: if two documents share important coordinates, they are close; if they use different words, they are far apart, even if they mean the same thing.
This sounds primitive, and it is. But it has three remarkable properties: zero training cost (you just count), full interpretability (every coordinate is one specific word), and very fast computation. And surprisingly, for many tasks — topic classification, retrieval on well-defined queries, document deduplication — it works well enough.
BoW is already geometry, but an extremely literal one. Every axis has a name. One dimension is cat, another is dog, another is insurance. There is no hidden meaning in the space: if two documents are close, it's only because they activate the same word coordinates. The geometry is sparse, explicit, and brutally honest. It cannot infer anything — but it also cannot hallucinate similarity that isn't present in the tokens.
What it doesn't capture: word order ("the dog bit the man" and "the man bit the dog" have identical BoW vectors), and any notion of semantic similarity (synonyms are different coordinates, so the model has no idea that car and automobile are related).
TF-IDF: rarity is information
The first real refinement was TF-IDF. The term frequency × inverse document frequency idea was crystallized by Karen Spärck Jones (1972) and popularized through Salton's SMART system throughout the 1970s. Instead of raw counts, weight each word by how rare it is in the corpus.
The intuition is simple. If the appears in every document, knowing it appears in this one carries no information. If kubernetes appears in only fifty documents out of a million, its presence is a strong signal that this document is about something specific. So term frequency in the document is multiplied by inverse document frequency, and the rare informative words get high weight while the common filler words get nearly zero.
TF-IDF doesn't change the geometry's axes; it changes their gravity. The coordinate for the gets almost no mass. The coordinate for pre-authorization or kubernetes suddenly matters. Two documents are close now not because they overlap on many words, but because they overlap on the informative ones. Similarity has its first weight function. The notion of "what counts as evidence" has stopped being uniform.
Mechanically, TF-IDF keeps the same sparse vector as BoW but changes the values inside it. A coordinate is no longer just "did this word appear?" or "how often did it appear?" It becomes "how often did this word appear here, adjusted by how rare it is globally?" The geometry is still lexical and still living in vocabulary-sized space, but it's no longer flat: rare words pull documents together much more strongly than common ones.
TF-IDF was, for decades, the workhorse of every search engine and document classifier. It still is, in many production systems where simplicity and interpretability matter more than peak benchmark numbers.
BM25: the industrial standard
The next refinement was BM25 (Robertson and Walker, 1994, building on a longer "Best Match" series with Spärck Jones at City University London and Cambridge). BM25 takes TF-IDF and adds two things: saturation — the contribution of a word to a document's score stops growing linearly after a few occurrences (seeing kubernetes ten times in a document doesn't make it ten times more about kubernetes than seeing it once) — and length normalization (long documents don't get an unfair advantage just for being long).
These two corrections encode two very practical beliefs about documents. First: evidence has diminishing returns. Once a document has clearly committed to a topic, repeating the same word doesn't make it any more about that topic — it just inflates the score. Second: a long document isn't automatically more relevant; it just has more chances to accidentally contain your query words. BM25 is the moment when the scoring function stops being naive arithmetic and starts modeling how a reader would actually judge a document. It still doesn't understand meaning — but it understands evidence better than anything that came before.
BM25 is not a dense learned embedding model. It's typically used as a query-document scoring function over sparse lexical statistics. You can represent lexical weights as sparse vectors in some hybrid or sparse-vector retrieval systems — but the space is still explicit vocabulary space, not learned semantic geometry. Whenever someone says "BM25 vs dense embeddings," that's the distinction they're really pointing at: lexical evidence vs. learned geometry.
BM25 became the default or baseline scoring function in Lucene, Elasticsearch, and many Lucene-style production search systems, and stayed there for over twenty years. When you ship a search bar that mostly works, it's almost certainly BM25 — or something close — under the hood, and it's almost certainly fast.
This whole lineage — BoW → TF-IDF → BM25 — is what gets called the bag-of-words family or sparse retrieval. It's lexical, interpretable, and based purely on word statistics. It's the foundation of sparse retrieval today and one half of hybrid search.
LSA: the first attempt at dense semantic vectors
The whole BoW → TF-IDF → BM25 lineage hits a hard wall, and the wall is the same one BoW had on day one: the model has no idea that car and automobile mean the same thing. Query for "the car got more expensive" against a document that says "the automobile rose in price", and BM25 scores them as essentially unrelated — zero word overlap, near-zero similarity. For any text retrieval that has to deal with paraphrases, synonyms, or cross-lingual content, that's a dealbreaker. For decades, the workaround was lexical: stemming, lemmatization, synonym expansion via thesauri, query rewriting — all manual, brittle, language-specific. A different idea was needed: don't represent the document by which words appear in it, represent it by the meaning those words carry. The trouble was — how do you compute meaning from text? It took twenty years to get a serviceable first answer, and another twenty to get a good one.
The first serious attempt to compute meaning by statistics came in 1990, when Deerwester and his Bell Communications Research colleagues published Latent Semantic Analysis (LSA) — also known as Latent Semantic Indexing (LSI). The idea was algebraic, not neural. Take the term-document matrix from BoW or TF-IDF — a huge sparse matrix where each row is a word, each column a document, each cell a weight. Apply singular value decomposition (SVD) (a matrix factorization that finds the strongest directions of variance) and keep only the top-k singular values. What comes out is a low-rank approximation in which each word becomes a dense vector in a k-dimensional latent space (typically k = 100 to 300).
The math forces words that frequently co-occur with similar other words to land close together. automobile and car, even if they never appear in the same document, end up near each other in the LSA space because they tend to co-occur with the same surrounding words — engine, tire, traffic. For the first time, "synonym similarity" emerged from a corpus without anyone hand-curating a thesaurus.
This is also the first moment where the axes of the space stop having names. In BoW and TF-IDF, every coordinate was a specific word — you could point at dimension 4,217 and say "this is the kubernetes axis." In LSA, the 200 latent dimensions don't correspond to any single word. They're directions in space that mix many words together — one axis might roughly capture "vehicle-and-transport-ness," another "legal-document-ness," and most of them resist any clean human label at all. Geometry has stopped being a literal inventory of vocabulary and started being an abstraction over it. Every dense embedding model since works in some version of this nameless space.
Mechanically, LSA is the first method in this story that turns sparse lexical vectors into dense latent vectors. The original term-document matrix is huge and mostly empty. SVD compresses it into a smaller space where the dimensions are no longer words, but latent directions inferred from co-occurrence. Similarity is no longer exact word overlap; it becomes proximity in a compressed co-occurrence space.
LSA worked but didn't scale gracefully. SVD on a vocabulary of a million words across a million documents is computationally brutal, and re-running it whenever the corpus changed was worse. Quality was modest by today's standards. But LSA is the conceptual ancestor of every dense embedding that followed: a learned low-dimensional space where co-occurrence statistics get encoded as geometric proximity. Word2Vec is, in a real sense, LSA without the matrix algebra — a neural network that arrives at very similar representations far more efficiently.
Word2Vec: the distributional hypothesis materializes
The breakthrough came in 2013 with Word2Vec (Mikolov et al., Google).
The underlying idea was older: the distributional hypothesis — words that appear in similar contexts tend to mean similar things. You shall know a word by the company it keeps, as the linguist J. R. Firth put it in 1957. If bank and loan keep showing up near each other, they're related. If cat and kitten appear in similar sentences — about pets, fur, sleeping in the sun — they should be related too.
Word2Vec made the hypothesis computational. Train a shallow neural network to predict words from their surrounding context (CBOW), or predict the context from a word (Skip-gram). The network has a hidden layer; after training, the weight matrix of that hidden layer becomes the embedding matrix. Each word now has a vector — typically 100 to 300 dimensions — and the geometry of the resulting space encodes the distributional structure of the training corpus.
The famous demonstration: vector arithmetic. king − man + woman ≈ queen. Paris − France + Italy ≈ Rome. The model wasn't taught these analogies. They emerged from the structure of where words appear in language.
Conceptually, Word2Vec was doing something close to LSA — but locally, and via prediction rather than decomposition. LSA looked at the entire corpus at once and solved a global matrix factorization; Word2Vec slid a small context window through the text and asked, billions of times, "given these surrounding words, what word goes here?" The geometry fell out as a byproduct of being good at that prediction game. This is also where the term representation learning starts to mean what it means today: the embedding isn't a hand-designed feature or a closed-form decomposition, it's whatever internal state happened to make the network good at its task. The vector for cat is not "the answer to a math problem about cats" — it's the row of weights the network ended up with after a lot of training. Every neural embedding since is some version of this trick.
For the first time, similarity-by-meaning had a concrete mathematical representation — at the word level. With simple averaging of word vectors or phrase-level representations, texts that share related vocabulary could start moving closer than pure lexical overlap would allow. But this was still a crude sentence representation, not retrieval-ready sentence embedding — the query "refund not received" would land somewhat near "my money hasn't come back yet", but the geometry was approximate. Bridging the gap from word vectors to genuinely retrieval-ready sentence embeddings is the work of the next decade.
It's worth being precise about what Word2Vec embeds. Word2Vec does not embed documents. It embeds words. Its training objective is local prediction: predict a word from nearby words, or nearby words from a word. The embedding is not the output label; it is the internal representation that made that prediction task easier. That's why its geometry captures distributional similarity — but also why it doesn't automatically solve retrieval for full questions and passages, where you need a representation of an entire sentence, not a bag of word vectors stacked together.
GloVe and FastText: refinements
Two follow-ups arrived shortly after. GloVe (Pennington et al., Stanford, 2014) is, in a sense, the return of matrix factorization — but done right. Instead of Word2Vec's local prediction game, GloVe builds a global word-word co-occurrence matrix from the whole corpus and factorizes it, much like LSA did, but with a clever weighting that downweights both very rare and very common pairs. The resulting vectors land in a space that's geometrically very similar to Word2Vec's — analogies still work, neighborhoods look familiar — but the training is more stable and the dependence on global statistics is explicit rather than implicit. For a few years, "Word2Vec or GloVe?" was a real engineering choice rather than a sharp conceptual one.
FastText (Facebook, 2016) does something more philosophically different: it changes the unit of representation. Word2Vec and GloVe both treat a word as an atom — running and ran and runner are three completely unrelated vectors that just happen, if the training was good, to land near each other. FastText breaks each word into character n-grams (run, unn, nni, nin, ing, ...) and represents the word as the sum of its piece vectors. Suddenly the model has a built-in notion of morphology: a word it has never seen at training time can still be embedded, because its pieces are familiar. For morphologically rich languages — Russian, Turkish, Finnish, where one root produces dozens of surface forms — this isn't a nice-to-have, it's the difference between working and not working. FastText also makes the embedding space mildly self-repairing: typos, rare technical terms, brand-new compounds all get a plausible vector instead of nothing.
Contextual embeddings: BERT and the transformer era
Word2Vec, GloVe, FastText all share one fundamental limitation: the vector for a word is fixed, regardless of context. The word bank gets one vector — but river bank and money bank are entirely different things. The word cell is the same vector whether it's a biological cell or a prison cell. For sentence-level retrieval, where you want a meaningful representation of a whole phrase rather than just one word, static word embeddings hit a ceiling fast. Averaging the word vectors of a sentence gives you something, but it's a blurry average that loses the structure. What was needed: a model where the representation of a word depends on the sentence around it.
In 2018, two things changed everything. First, ELMo (Peters et al., AllenNLP) showed that contextual word representations were possible at scale — using deep bidirectional LSTMs trained as a language model and read out from intermediate layers. A few months later, BERT (Devlin et al., Google) did the same thing on a different architecture: the transformer (Vaswani et al., 2017). The transformer's parallelism, longer effective context, and scaling properties made it dominant within a year; BERT became the foundation for almost every contextual embedding that followed.
In BERT, every token gets a vector that depends on every other token in the sequence. The word bank in river bank and money bank now gets two different vectors. Self-attention — the transformer's core mechanism — lets each token "see" every other token in the input and adjust accordingly. The model produces contextual token representations; to get a sentence-level vector, those token vectors still need to be pooled or otherwise adapted.
Mechanically, BERT changes the unit of representation again. The object being embedded is no longer a word type like bank, but a token occurrence inside a specific sentence. The same surface word receives different vectors depending on the surrounding tokens. This solves polysemy, but it doesn't solve retrieval: BERT was trained to model language (masked-token prediction), not to make cosine similarity between sentence vectors meaningful.
It's worth being precise about what BERT did and didn't solve. What it solved: contextual understanding — the same surface word now means different things in different sentences, and the model's internal representations reflect that. What it did not solve: producing a single sentence vector well suited to retrieval. Those are not the same task. A model can read a sentence beautifully and still produce a [CLS] vector or a mean-pool that's geometrically mediocre for nearest-neighbor search — too anisotropic, too dominated by stop-word artifacts, too aligned to whatever the pretraining objective happened to optimize. Out of the box, raw BERT embeddings were a famous disappointment for direct cosine-similarity retrieval, and the field spent the next several years figuring out how to coax retrieval-grade geometry out of an architecture that was originally designed to fill in blanks. The next step — turning contextual representations into usable sentence vectors — was its own research program, which is exactly what SBERT (and later DPR) addressed.
SBERT: from tokens to sentences
The first practical adaptation came with Sentence-BERT (SBERT) (Reimers and Gurevych, UKP Lab, 2019). Take a BERT-style model, feed it a sentence, pool the token vectors into one sentence vector (typically by averaging — mean pooling), and you have a sentence-level embedding suitable for retrieval. SBERT trained this with a siamese architecture on sentence-similarity datasets, and the result was the first widely usable sentence embedding model.
The contribution wasn't really "we pooled the tokens" — anyone could have pooled the tokens. The contribution was the combination: a pooling strategy plus a training objective that explicitly rewarded the pooled vectors for behaving well under cosine similarity. SBERT fine-tuned on NLI and similarity datasets so that paraphrases would land close and unrelated sentences would land far, in a way raw BERT had no incentive to learn. Out of this came two lessons that the rest of the field absorbed permanently. First: how you pool matters, but how you train the model that you're going to pool matters more. Second: a sentence embedding is not a side-effect of a language model — it's a thing you train for on purpose, with the geometry you eventually want in mind.
Stated as a method: SBERT takes a BERT-style encoder, pools its token vectors into one sentence vector, and trains that pooled vector so that cosine similarity reflects sentence-level semantic similarity. The important part isn't pooling alone — pooling existed before. It's training the pooled representation to actually live in a useful comparison space.
For the first time, you could embed a full sentence — not a word, not an averaged bag — into a single vector that carried contextual meaning. The era of dense retrieval finally became practical.
DPR: encoders specialized for retrieval
The next step came from a different direction. DPR — Dense Passage Retrieval (Karpukhin et al., Facebook AI, 2020) — pushed the dual-encoder idea into open-domain question answering. Two separate BERT encoders, one for questions and one for passages, trained on (question, answer-passage) pairs with contrastive loss and in-batch negatives — each query's passage in the batch becomes a positive, and the other queries' passages serve as free negatives. The architecture made the asymmetry of retrieval explicit: query and document have different roles, so they get different encoders.
Stated as a method: DPR changes the target relation the geometry is trained on. Instead of "these two sentences mean similar things," the objective becomes "this passage is evidence for this question." That is the retrieval leap. Query and passage are no longer symmetric, and they're not expected to be paraphrases. They are trained to be close because one answers the other.
The conceptual leap underneath DPR is worth naming. SBERT learned semantic similarity — "these two sentences mean roughly the same thing." DPR learned answerability — "this passage answers this question." Those are different geometric relations, and they don't always agree. The question "is my car covered if I'm driving abroad?" and the passage "International coverage applies to private passenger vehicles in countries listed in Schedule B..." are not paraphrases of each other; they barely share any vocabulary. Under SBERT-style similarity they might land somewhat near each other, but under DPR-style training they're explicitly trained to be neighbors, because one is the answer to the other. This is also where the asymmetry between query and document becomes a first-class modeling decision — queries are short, vague, often phrased as questions; passages are longer, declarative, often phrased as statements. Treating them with one shared encoder and no role distinction leaves performance on the table; from this point on, retrieval models started taking the asymmetry seriously — through two separate encoders, instruction-conditioned single encoders, or related architectures.
DPR's lasting contribution wasn't just one model; it was the template. Train an encoder on (query, relevant passage) pairs with contrastive loss. Use in-batch negatives so each batch produces many training signals from few labels. Bake the retrieval objective into the embedding itself rather than relying on a generic sentence-similarity model. Many mainstream single-vector retrieval embedding models published since 2020 inherit this basic template, although later systems differ in data construction, hard-negative pipelines, instruction formatting, distillation, and sometimes the retrieval architecture itself (late interaction, multi-vector retrieval).
E5, BGE, and web-scale retrieval embeddings (2022–2024)
Starting around 2022, the field takes DPR's template and pushes it to a different order of magnitude. Open models like E5 from Microsoft Research (Wang et al., 2022), BGE from BAAI, and arctic-embed from Snowflake make the training recipe visible: take a transformer base (often XLM-RoBERTa for multilingual, or RoBERTa for English-only), then train it on enormous corpora of weakly-supervised (query, document) pairs scraped from the web — title→body, question→answer, paraphrase pairs, parallel translations — with contrastive loss and in-batch negatives, augmented later with hard-negative mining. Commercial API models such as OpenAI's text-embedding-3 family (released January 2024) belong to the same product era from a user's perspective — strong general-purpose retrieval embeddings, with features like a shortenable dimensions parameter — but their full training recipe and role-handling are not public.
DPR trained on a hundred thousand curated question-answer pairs; E5 and its peers train on hundreds of millions of noisy pairs harvested from the web. Most of the pairs are imperfect; some are outright wrong. But contrastive learning at this scale is robust to a surprising amount of label noise, and the resulting models inherit a statistical sense of what "queries about X look like" and "documents about X look like" across an enormous range of domains. These models bake the retrieval objective into the embedding itself: they learn explicitly that "this query and this passage belong together," not just that two pieces of text are about the same topic.
A second design choice baked in at this point: explicit query/passage role conventions. E5 makes the convention literal — the query side is prefixed with "query: " and the passage side with "passage: ", and the prefixes shift the resulting vector accordingly. BGE-style models use task or query instructions in some variants, especially for short-query-to-long-passage retrieval, where the instruction sits on the query side and the passage stays bare. The exact convention is model-specific, and the model card matters — the shared principle is that explicit role-handling has become part of how these models are deployed, not optional decoration. Run E5 without its prefix and the geometry degrades; cosines stay reasonable for paraphrase pairs but the discrimination between related and unrelated drops noticeably.
Matryoshka representations (2022)
A parallel branch of the same year went after a different problem: the embedding's dimension is usually fixed at training time. You either store everything at 1024 dimensions and pay the storage cost, or you train a smaller model from scratch and rerun the whole indexing pipeline.
Matryoshka Representation Learning (Kusupati et al., 2022) kept the single-vector shape but changed how it was trained. The model is trained so that the first 256, 512, 1024 dimensions of its output are themselves usable embeddings — nested prefixes, like a Russian doll, each one a valid retrieval vector at its own dimension. One model, multiple effective dimensions, chosen at deployment time without retraining. OpenAI's text-embedding-3 family supports shortenable output dimensions; nomic-embed-text-v1.5, gte-v1.5, and a growing number of newer open models train this way. A production system can now store passages at 768 dimensions for high-recall search and serve queries at 128 dimensions when latency matters more — same model, several operating points.
The widget below compares a Matryoshka-trained model (nomic-embed-text-v1.5) against a non-Matryoshka one (all-MiniLM-L6-v2) under the same truncation. The non-MRL model's cosines drift as dimensions are dropped — the leading prefix isn't a meaningful embedding, it's an arbitrary projection. The MRL model holds its full-dim ranking down to 64 dimensions.
NV-Embed and decoder-LLM encoders (2024)
The 2022 family — E5, BGE, arctic-embed — all built on encoder-only transformers (~100M parameters). From 2024 onward, a new line takes the same contrastive recipe but swaps the backbone for a decoder LLM: NV-Embed (NVIDIA, August 2024), gte-Qwen2 (Alibaba, mid-2024), Linq-Embed-Mistral (Linq AI), Stella. The encoder is now a 1.5–7B-parameter pretrained language model (Qwen, Mistral, LLaMA), with the final hidden state pooled into an embedding. The contrastive training recipe is the same as E5/BGE — what changed is the size and pretraining quality of the encoder underneath.
The car-example numbers show the difference. E5 with the right prefix produces cosines like paraphrase: 0.90, unrelated: 0.66 — a gap of about 0.24. gte-Qwen2-1.5B produces paraphrase: 0.82, unrelated: 0.37 — a gap of about 0.45, almost twice as wide. What gets better is not the related-pair cosine; it's the unrelated-pair cosine. The larger pretrained backbone has stronger general-knowledge representations, and it pushes unrelated text much further apart in the embedding space. Retrieval quality on real corpora improves not because relevant passages move closer to the query, but because the long tail of plausibly-relevant-but-wrong passages moves further away.
The cost is real: 1.5–7B parameters, slower inference, more memory per vector. For a year decoder-LLM embeddings topped MTEB by significant margins; by 2026 they're a mainstream baseline wherever serving budget allows, though most production systems still default to E5/BGE-class models for cost reasons.
BGE-M3 and hybrid retrieval (2024)
Up to this point every milestone in this note has been one representation per text — one dense vector, period. BGE-M3 (BAAI, January 2024) broke that assumption in production. A single forward pass through the model returns three signals at once:
- Dense: the familiar single-vector embedding, cosine-scored.
- Sparse-lexical: per-token weights similar in spirit to BM25 (and to SPLADE, the 2021 model that pioneered learned-sparse retrieval). Captures exact-term matches that dense embeddings often soften — model numbers, identifiers, rare domain terms.
- Multi-vector: ColBERT-style late interaction across token vectors. Captures token-level evidence that single-vector pooling loses.
At query time, the three scores are combined — typically by Reciprocal Rank Fusion or a weighted sum — into one final ranking. The point isn't that any single signal is best; it's that the three signals are complementary. Dense beats sparse on car ↔ automobile; sparse beats dense on policy ID #INS-2847; multi-vector beats both on long technical passages where token-level evidence is dispersed.
By 2026 this template (dense + sparse + multi-vector from one model, combined at query time) is increasingly common in high-quality retrieval stacks — it absorbs both the contrastive single-vector recipe and the ColBERTv2 late-interaction recipe into one operational pipeline, which is why ColBERTv2 as a standalone milestone has largely been folded into hybrid models like BGE-M3.
Where the toy example stops working: 2024–2026 directions
Up to this point, every milestone has shown its difference on the same four insurance sentence pairs. From here on the gains are real but the toy example stops being the right thread — they live in domains four short English sentences can't reach.
Long-context embeddings. Voyage (now part of MongoDB), Cohere Embed v3+, jina-embeddings-v3 stretched the context window from the 2022 encoder era's 512 tokens into the 8K–128K range — model-specific, with jina-v3 around 8K, Voyage models around 32K, and Cohere Embed v4 up to 128K. This reduces the need for aggressive pre-chunking and enables document-level or section-level representations, but it does not make chunking disappear: a single vector over a long document is still a lossy compression, and production stacks routinely mix long-context embeddings with chunked retrieval, late chunking, and hierarchical search depending on the data. The car example uses 12-token sentences, so the context window is irrelevant here — but on real RAG over long technical documents, long-context embeddings have become a serious option where the 512-token cap left no good alternative.
Multimodal embeddings. Voyage-multimodal, Cohere embed v4 put text and images in a shared vector space — a paragraph and a screenshot can be neighbors. On a text-only insurance example there's nothing to demo, but a production RAG over slide decks, scanned policies, or product catalogs increasingly uses one of these instead of a separate text + image stack.
Specialized verticals. Voyage-code for code, dedicated legal and biomedical embeddings, mega-multilingual models (BGE-M3 covers 100+ languages). The "one model for everything" assumption has loosened — production retrieval over a specific corpus often does better with a verticalized model than the latest general SOTA. None of this changes the insurance cosines, but it's why MTEB rankings stopped being the only thing teams look at.
ColBERTv2 (standalone). The 2021 multi-vector / late-interaction model is still in production at some serious shops (Vespa, RAGatouille), but as a standalone milestone it has been largely absorbed into hybrid models like BGE-M3, which produce the same multi-vector signal as one output among several. On four short sentence pairs there's no measurable difference from single-vector dense; the gain is on long technical passages where token-level evidence is dispersed.
Instructor and the instruction-conditioning trend. The 2022 Instructor model itself isn't widely deployed anymore, but the idea — prepend a task instruction so the same model produces task-specific embeddings — propagated into gte-instruct, E5-instruct, and most decoder-LLM embedders by default. On our four pairs, the instruction prefix barely moves cosines; on cross-task retrieval (the same model serving search, clustering, classification, code search) it's the mechanism that makes one model viable across all of them.
Distillation. Small models (~100–300M parameters) now routinely match large-model quality by being distilled from 7B teachers. A distilled 200M-parameter embedding often comes within 2–3% of a 7B model on MTEB at a fraction of the inference cost. This is a deployment story, not a representation story — but it's why "use the biggest embedding you can afford" has stopped being good advice by 2026.
The common thread under all of this: 2024–2026 didn't bring a new representational paradigm. The dual-encoder contrastive template from DPR is still doing the work. What changed is the operating envelope it can be deployed in — bigger backbones, longer contexts, multiple modalities, multiple signals from one model, domain specialization, and aggressive distillation back down to small.
What general retrieval embeddings still don't fix
Running quietly underneath all of these branches is synthetic training data: instead of mining noisy pairs from the web, generate them with LLMs ("given this passage, write three questions it could answer", "rewrite this query as someone less technical would phrase it"). That expands training into domains and languages where natural pairs are sparse, and by 2026 it's a standard part of how the leading embedding models get trained.
But even with all of this, it's worth being honest about what general retrieval embeddings still don't solve. A general-purpose retrieval embedding is, by construction, a compromise geometry — one space that has to work well enough for medical literature and legal contracts and customer-support tickets and code search, with no real way to be optimal for any of them. The geometry that minimizes contrastive loss across the whole web is not the geometry that minimizes it on a single insurance KB. This is why production retrieval stacks layer on top of these models — fine-tuning on domain data, prompt-style instructions, hard-negative mining from the actual deployed corpus, and reranking on top — rather than trusting a generic embedding to be perfectly aligned to a specific task. Each generation of these models is genuinely better than the one before it, but the gap between "a strong general embedding" and "a strong embedding for your corpus" is still real, and most of the remaining engineering happens in that gap.
What each step actually fixed
Each step solved one kind of blindness and exposed the next one.
- BoW could count words, but couldn't see synonyms.
- TF-IDF could tell rare words from filler, but still lived inside exact tokens.
- BM25 could weight saturation and document length, but still depended on shared vocabulary.
- LSA discovered latent similarity from co-occurrence, but was heavy and corpus-bound.
- Word2Vec made meaning local and scalable, but froze each word into one vector.
- BERT made words contextual, but its raw representations weren't retrieval-ready.
- SBERT turned BERT into a sentence-level encoder, but trained on general similarity, not retrieval specifically.
- DPR trained two encoders for retrieval as a task, but only on small curated question-answer datasets.
- E5 / BGE / text-embedding-3 (2022–2024) scaled DPR's contrastive recipe to web-size, with explicit query/passage role conventions baked into deployment (E5-style
query:/passage:prefixes, BGE-style task instructions, OpenAI'sdimensionsparameter — model-specific) — but still on encoder-only backbones and inheriting the same corpus-alignment limits. - Matryoshka representations (2022) kept the single-vector shape but made it truncatable — one stored vector serves multiple latency tiers, at the cost of slightly higher training complexity.
- NV-Embed / gte-Qwen / decoder-LLM encoders (2024) swapped the backbone for a 1.5–7B-parameter pretrained language model — the related cosines barely move, but unrelated cosines drop hard, so discrimination roughly doubles.
- BGE-M3 and hybrid retrieval (2024) broke the "one vector per text" assumption — one model emits dense + sparse + multi-vector signals, combined at query time. Absorbed ColBERTv2 late interaction as one signal among three.
- Long-context, multimodal, verticals, distillation (2024–2026) rounded out the operating envelope — full documents in one vector, text and images in one space, code/legal/biomedical specialists beating general SOTA on their domains, distilled small models matching large-model quality. None of these added a new representational paradigm.
Read down that list, and the modern stack stops looking arbitrary. Each technique is the answer to a specific problem the previous one couldn't see — and none of them is the final answer.
Where we ended up
Sixty years, and the question is the same. The answer has changed.
Take a single query — "is my car covered if I'm driving abroad?" — and trace how each era would have represented it.
An early bag-of-words system sees a bag of tokens: {is, my, car, covered, if, I, am, driving, abroad}. It matches against documents that share those tokens, weighted by nothing in particular.
TF-IDF (1972) keeps the same bag, but kills the weight on is, my, I, am (they're everywhere) and promotes covered, driving, abroad (rarer, informative). Matching gets sharper.
BM25 (1994) keeps TF-IDF's intuition, but adds saturation (the tenth abroad doesn't matter much more than the first) and length-normalization (long documents don't auto-win). For thirty years this is good enough for the entire search industry, and it still is for any query that involves an exact term.
LSA (1990) is the first to leave the bag-of-words family. The query becomes a 200-dimensional dense vector via SVD on the corpus's term-document matrix. car and automobile finally land in the same neighborhood, and so do covered and insurance.
Word2Vec (2013) ditches the matrix algebra and trains a neural network to predict context from word; the same car ≈ automobile geometry emerges from a model that scales to billions of tokens. Vector arithmetic works: king − man + woman ≈ queen.
BERT (2018) finally gives each token a contextual representation. The bank in river bank and money bank stop colliding. But turning those token representations into a useful sentence-level retrieval vector is still a separate problem.
DPR (2020) specializes BERT for retrieval: two encoders, one for queries and one for passages, trained on question-answer pairs so query→passage similarity becomes the objective.
E5, BGE, text-embedding-3 (2022–2024) take DPR's recipe to web-scale retrieval embeddings. Open models like E5 and BGE make query/passage role handling visible through explicit prefixes or task instructions; commercial APIs such as OpenAI's text-embedding-3 expose different deployment knobs (shortenable dimensions, for instance), while their internal training recipe and role handling stay a black box from the outside.
Matryoshka representations (2022) keep the same single-vector shape but make it nestedly truncatable — the first 64, 128, 256 dimensions of the same output are all valid embeddings at their own dimension. The query lives at multiple sizes at once.
NV-Embed / gte-Qwen / Linq-Embed-Mistral (2024) swap the encoder for a 1.5–7B-parameter decoder LLM. The query is still one dense vector, but the encoder is much larger and trained on far more diverse data — and unrelated passages are pushed much further away, so the long tail of wrong answers stops crowding the right one.
BGE-M3 (2024) breaks the "one vector per text" assumption entirely. The query produces three signals from one forward pass — a dense vector, per-token sparse-lexical weights, and a set of token vectors for late interaction — combined at query time by rank fusion. The single-vector and multi-vector branches converge into one operational pipeline.
And around that, the 2024–2026 operating envelope kept expanding — long-context embeddings (Voyage, Cohere Embed v3+, jina-v3) that stretched the 512-token encoder cap into the 8K–128K range and reduced (without eliminating) the need for aggressive chunking, multimodal models (Voyage-multimodal, Cohere Embed v4) that put text and images in one space, vertical specialists (Voyage-code, legal- and biomedical-tuned models) that beat general SOTA on their domains, and distillation that brings most of the quality down to 200M-parameter inference cost. The query is still a vector, but the choice of which vector — at which dimension, with which modality, from which specialized model — has become a meaningful design decision in itself.
Sixty years compressed into a paragraph. The progression isn't ornamental — every step is still alive somewhere in production. BM25 and other sparse retrieval handle exact identifiers and lexical precision; contextual encoders handle semantic search; contrastive-trained models handle retrieval-as-task. The 2026 stack is not a replacement for what came before, it's a coexistence: hybrid systems run BM25 next to dense retrievers and reconcile their outputs.
The history of retrieval isn't a march from stupid methods to smart ones. It's a gradual realization that text has many kinds of similarity. Sometimes similarity means exact words. Sometimes it means shared topics. Sometimes it means answerability. Modern search works because it stopped choosing one geometry and learned to combine several.