Chapter 1 of 3
Embeddings: How Geometry Pretends to Be Meaning
Created May 19, 2026
We talk about embeddings as if they're just a technical detail. Text becomes a vector. The vector goes into a vector database. The user's query also becomes a vector, and the system searches for the closest ones.
At the level of the diagram, it all looks almost too tidy.
text -> embedding -> vector search -> retrieved context -> LLM answer
But if you stop here, the idea starts to look strange. Why should the distance between arrays of numbers have anything to do with the distance between meanings? Why does car engine overheating end up next to vehicle temperature problem, even though the words barely overlap? Why do cat, kitten, and small domestic animal live in the same region of space, even though as strings they're built differently? Why does retrieval sometimes work perfectly, and sometimes return something roughly on-topic and completely useless on substance?
Embeddings start to make sense only once you stop seeing them as "numbers for text." They're not an encoding. They're an attempt to make geometry behave as if it carried meaning. Not to store meaning inside the numbers. Not to build a dictionary. But to train a model so that similar linguistic situations occupy similar positions in space.
From this point on, embeddings stop being magic. They become an engineering system with a specific physics, with losses and trade-offs.
Take a concrete system — we'll come back to it later. A policy KB for an insurance company: coverage terms, claims procedures, plan limits, legal disclaimers. About a million chunks. Queries like is my car covered if I'm driving abroad for three months? or do I need pre-authorization before an MRI?. We need the system to find the right rule — not "something topically similar." Most of this article is about the distortions the geometry of embeddings introduces into a task like that, and where to manage them.
Embedding is a compression
An embedding isn't text. It's the trace text leaves inside the model.
When we turn a paragraph into an embedding, we're asking the model to do a brutal thing: take a rich, ambiguous, contextual piece of language and compress it into a fixed-length vector. That piece may have contained topic, tone, entities, relations, causality, time, exceptions, legal qualifications, unstated context. All of it has to pass through a narrow gate of fixed size.
384 numbers. 768. 1024. 1536. Sometimes more.
From the outside, it's an array of floats. From the inside, it's a bottleneck.
Like any bottleneck, it can't preserve everything equally well. It preserves what was useful for the training objective. If the model was trained to bring similar texts close and push dissimilar ones apart, it preserves the features that help solve precisely that task. Not "truth." Not the full text. A compressed geometry of useful similarity.
That's why an embedding can amaze and frustrate at the same time. It catches that two texts are talking about the same problem in different words — and immediately loses the small condition that flips the answer. It understands the topic — but mixes up the roles of the entities. It pulls together documents that are similar on general topic but don't answer the question.
It's not a bug in the "it broke" sense. It's a consequence of compression.
How language becomes geometry
The tempting wrong picture: the model knows the meanings of words, then arranges them by coordinates.
It's the other way around. The model doesn't start with a dictionary. It starts with a huge number of language examples and a task that forces it to predict, distinguish, match, or reconstruct text. To solve such a task well, it has to build internal representations in which similar linguistic situations are handled similarly. If two phrases often appear in similar contexts, it's useful for the model to have similar internal states for them. Not because it "understands." Because similar structure helps it predict.
For many modern retrieval-oriented embedding models, this task has a name: contrastive learning.
The idea is almost embarrassingly simple. The model is shown pairs (query, positive) — two texts we know should be close. A question and its correct answer. A title and a paragraph from the same article. The same text in two languages. A paraphrase and its original. For each positive, we also need negatives — texts that should be far away. And the embedding is trained so the distance to the positive shrinks and the distance to the negatives grows.
The key technical trick is in-batch negatives. If a batch has 256 pairs, then for each query the positive is its own partner, and the remaining 255 positives automatically serve as negatives. You don't need to label negatives separately: positives have already been gathered somehow — manually, heuristically, or synthetically — and negatives come from them for free. The cost is false negatives: sometimes another document in the batch actually is relevant to the query, and the model gets a noisy signal. On large, diverse batches that noise is usually small, but it isn't zero.
In one step, the model learns to solve the task: "here's a query and 256 candidates — which one is the real partner?" The loss function expressing this — InfoNCE — is built as a softmax over dot products: the correct pair has to be ranked above all the others.
Forget the formulas: the idea is that the model plays the same game over and over — find the real partner among lookalikes. And the embedding space gradually becomes a board on which that game is easy to win.
The scheme scales pleasantly. The bigger the batch, the more "free" negatives, and the finer the distinctions the model is forced to learn. That's why modern embedding models are trained on batches of thousands, and some teams go further and add hard negatives: pre-selected documents that look superficially similar to the positive (shared keywords, overlapping topic) but don't actually answer the query. This is the most valuable signal: the model learns to tell almost-right from right, which is critical for retrieval.
The geometry is born out of this mechanic. After millions of such steps, the space deforms: regions that often turned out to be "correct pairs" contract; regions that diverged spread apart. cat and kitten end up near each other not because someone wrote down a rule, but because training pressure repeatedly pushed them into related contexts. Same with vacation policy and time off rules, with refund and money back.
Meaning, in this view, is not placed into the vector. It is induced by training pressure.
The embedding space isn't a small encyclopedia. It's a space in which the statistical structure of language plus a training objective have turned into distances, directions, and densities. That's why embeddings work well where meaning can be approximately reconstructed from distributional patterns. And break down where surface contextual closeness isn't the same as the closeness we actually need.
Anatomy of an embedding
Between "the model is trained" and "I have a vector" there are a few steps no one talks about, but they explain half the surprises in production.
Inside a BERT-style model, text is first split into tokens, and each token gets its own contextual vector. If the text has 50 tokens, the model outputs 50 vectors. But retrieval wants one vector per text. So those 50 have to be combined somehow. This step is called pooling, and it determines what actually ends up in the embedding.
Two main variants:
- CLS pooling — take the vector of the special
[CLS]token, which the model learned during training to use as a "summary." - Mean pooling — average the vectors of all tokens.
This isn't cosmetic. The CLS vector tends to concentrate "topical signal," whereas the mean vector represents the text as a blend of its parts. Modern models like E5 or BGE typically use mean pooling, and you can't override it at inference time: if you take an embedding with one pooling and then switch to another somewhere downstream, you'll get a vector from a different space. Geometry will shift.
The next non-obvious thing is the asymmetry of retrieval. Query and document are not the same thing. A query like "how do I reset my password?" is short, idiomatic, grammatically a question. A document that answers it is long, formal, expository. If you train a model to handle both types identically, it ends up as a compromise — good for neither.
That's why many modern models — E5, BGE, instructor — are trained on pairs with explicit roles, and ask you to mark the role at inference time. For E5 it looks like a prefix:
query: how to apply for vacation
passage: To request time off, submit a request via...
These prefixes aren't "query decoration." They change which part of the space the text lands in. Forgetting the prefix, or swapping query and passage, is a textbook production bug when migrating from OpenAI to an open-source model: everything works, retrieval still finds something, but quality silently drops — a realistic range is, for example, 5–15% nDCG (a standard retrieval ranking metric; higher is better), with the exact number depending on model and corpus. And a week will pass before anyone notices.
While we're here, the symmetric case is worth calling out — it's often skipped. The asymmetry described above applies when query and document are structurally different: a short question against a long document. But retrieval can also be symmetric — when the two texts being compared are of the same nature, and neither is structurally a "query." Finding similar tickets in a base of already-labeled tickets. Paraphrase mining. Semantic deduplication. Clustering. Mental test: if you swap the two sides, does the task stay the same? Then it's symmetric.
In the symmetric regime, the E5 FAQ prescribes using query: on both sides — not because "both are queries," but because such a task is easier to solve in one half of the space, and the query half was trained to be suited for it. Applying the asymmetric query: / passage: scheme to a symmetric task (or, conversely, leaving both sides under query: for classical RAG) breaks retrieval just as silently as forgetting the prefix entirely. The geometry, again, isn't the one the model was trained to work in. Choosing the regime is an engineering decision that has to be made deliberately, and it's easy not to notice it as a decision at all.
It's also worth separating which models all of this even applies to. The prefix scheme isn't universal, and transferring the rule "always prefix query and passage" to a model where it doesn't work is a separate way to quietly hurt retrieval. By how the role mechanism is implemented, embedding models fall into three categories.
Text-prefix models. The role is written directly into the text before the content. All of them were trained with explicit prefixes as part of the input, and using them without a prefix, or with the wrong one, produces the silent quality degradation we just described. The exact prefix format varies by model — code written for E5 won't transfer to BGE, and vice versa:
| Model | Query side | Document side |
|---|---|---|
| E5 | query: how do I reset my password? |
passage: To reset your password, go to settings… |
| BGE | Represent this sentence for searching relevant passages: how do I reset my password? |
To reset your password, go to settings… (no prefix on docs) |
| instructor | Represent the medical question for retrieving relevant answers: what causes diabetes? |
Represent the medical answer: diabetes is caused by… |
| nomic-embed-text | search_query: how do I reset my password? |
search_document: To reset your password, go to settings… |
| arctic-embed | Represent this sentence for searching relevant passages: how do I reset my password? |
To reset your password, go to settings… (no prefix on docs) |
Forms don't transfer between models: applying E5's query: to BGE input does nothing useful, and pasting BGE's long instruction in front of E5 input just inflates the token count without invoking the role mechanism. The asymmetry and symmetry discussion above is all about this category.
API-param models. The asymmetry is there, but it's implemented through an API parameter, not text. Cohere embed v3 takes an input_type with values search_document, search_query, classification, clustering. The model does the same thing a text-prefix model does, but the prefix is applied server-side, and the calling code doesn't concatenate anything. Conceptually it's the same two or three regimes of the space, just a switch that's a parameter instead of a string. Forgetting to set input_type and relying on the default is the same kind of silent bug as forgetting passage: in E5.
Role-agnostic models. One model for everything, no documented role distinction. OpenAI text-embedding-3 (small and large) doesn't expose or document separate query/document modes; in practice, treat it as role-agnostic unless your own evaluation proves that a custom prefix helps. Classical Sentence-BERT models (all-MiniLM-L6-v2, all-mpnet-base-v2) and most early open-source embedding models fit the same category. For models like these, a prefix like query: is just extra tokens the model wasn't trained to interpret — it nudges the geometry toward noise without an obvious upside. The right move is not to prefix.
Rule of thumb for an unfamiliar model. Open the model card. If the usage examples contain prefixes (query: ..., passage: ..., Represent ...:), they're required. If the API docs have a parameter like input_type, the model is asymmetric server-side and that parameter has to be passed. If neither is present, the model is role-agnostic — don't prefix. The doubt is resolved by a simple experiment: compute cosine similarity on the same pair with and without a prefix. A noticeable difference means the model "hears" the prefix and the regime matters. Identical results mean the model is indifferent to it.
Migration between categories costs more than it looks. Moving from OpenAI to E5 isn't "we'll just change the model name in the config." It's: add text-prefixing to the pipeline, separate the query path from the passage path in code, prefix the diagnostic scripts and the evaluation harness too, agree on a convention so nobody accidentally calls model.encode(text) without a wrapper. And only after all of that — reindex. A model's category is a contract between the embedding service and everything around it, and it can't be changed in one line of config.
The last thing worth knowing about anatomy: embeddings from different models are not comparable to each other. A vector from OpenAI text-embedding-3-small and a vector from E5-large live in different spaces, with different dimensions, different geometries, and different training objectives. You can formally compute cosine similarity between them, but the number won't mean anything. Which means: changing the embedding model is always a full reindex of the database. Not "we'll switch the API." It's "we'll recompute millions of vectors and rebuild the index from scratch."
Similarity is not one thing
When we say "two embeddings are similar," we pretend that similarity is an obvious concept. But there's more than one kind of similarity.
Take a concrete example. In a medical knowledge base, there are two sentences:
A: Recent studies suggest the treatment is safe for patients with mild hypertension.
B: Recent studies suggest the treatment is not safe for patients with mild hypertension.
To an embedding model, these two texts are almost identical. Same length, same syntax, almost the same vocabulary, same topic, same entities. The cosine similarity between them can be surprisingly high — sometimes high enough to outrank the actually useful evidence. One small not got lost in the overall compression. The embedding of the chunk with the opposite statement may end up closer to the query is this treatment safe? than the chunk with the correct answer.
This isn't a quirk of one particular model. It's a consequence of how contrastive learning is set up: the training objective often doesn't create enough pressure to separate such pairs as opposite statements. Negation rarely turns out to be the signal the model has to discriminate on in order to do well on the contrastive task. If there's no pressure, the geometry doesn't separate them.
Embeddings have similar difficulty with:
- Entity roles —
John sued MaryandMary sued Johnend up close. - Quantitative differences —
5% growthand50% growthend up close. - Tense and modality —
may happenanddid happenend up close. - Logical conditions — text with an answer and no condition is close to text with the same answer "only if X."
Embedding similarity isn't magical truth similarity. It's the kind of similarity the embedding model learned in its own space. In production, retrieval doesn't search for "truth." It searches for candidates that look close in the chosen geometry. If this point isn't internalized, you can treat vector search as a smart knowledge base for years and keep wondering why it keeps pulling out almost-right garbage.
Why cosine similarity works at all
Embeddings are almost always compared via cosine similarity.
cosine(A, B) = (A · B) / (||A|| ||B||)
There's nothing mystical about the formula. We're comparing the direction of two vectors and ignoring their length. Intuitively: do these two representations point in the same semantic direction?
Cosine has a technical reason to be the standard. The length of an embedding vector often correlates with the length or "energy" of the source text rather than its meaning. A long document can produce a vector with a larger norm simply because it has more tokens. Cosine normalizes that length out and keeps the angle.
Cosine isn't the only option. Euclidean distance — ‖a − b‖, the straight-line gap between the tips of two vectors — is the other classical choice, and it's tempting because it's geometrically simpler. For raw embeddings it has the same problem as the dot product on its own: the distance mixes the angle (which carries meaning) with the magnitude (which often doesn't). Two semantically related texts of different lengths can end up far apart in Euclidean distance for the wrong reason.
But here's the catch: most modern embedding models output L2-normalized vectors — every embedding sits on the unit sphere by construction. On the unit sphere, cosine and Euclidean are not independent metrics; they're related by ‖a − b‖² = 2 − 2·cos(a, b). The function is monotonic, which means ranking by one is identical to ranking by the other. The "cosine vs Euclidean" debate evaporates once vectors are normalized — they give the same top-k.
The convention is still to talk about cosine, partly because the scale (-1 to 1) is more intuitive than Euclidean's (0 to 2 on the unit sphere), and partly because dot(a, b) on normalized vectors is mathematically the same as cos(a, b) — so one fast numpy.dot gets you both, with no division. In production code you usually see normalize once at write time, then dot at query time.
Two vectors in 2D. Drag the tip of either one to see how cosine similarity (the angle), Euclidean distance (the segment between tips), and the vector lengths respond. Toggle normalize to unit length: both vectors snap to the unit circle, and the two metrics become monotonically equivalent — every increase in cosine is matched by a proportional decrease in Euclidean. The "which metric to use" question stops mattering the moment normalization happens.
But there's a less obvious reason cosine is the right metric, and it matters more. Inside a BERT-style model that hasn't been fine-tuned specifically for retrieval, the embedding space turns out to be anisotropic: all vectors lie inside a narrow cone, very close to each other in angle. The cosine similarity between two random unrelated texts there can be surprisingly high — on the order of, say, 0.8 (with the exact number depending on layer and model) — because all texts in that space basically point in the same direction. In a space like that, cosine doesn't measure meaning; it measures noise.
That's exactly why modern embedding models are trained with a contrastive loss and a temperature τ that deliberately stretches the overall space. The temperature in InfoNCE — exp(sim/τ) — controls how sharply the model "penalizes" close negatives. A small τ (0.01–0.05) makes the distribution sharp: positives have to be substantially closer than negatives, and the model is forced to spread the space out to make that possible. The result is a more isotropic geometry in which cosine actually carries signal.
Put differently: contrastive training doesn't only pull similar things together. It simultaneously pushes the space apart, so that similar texts have a chance to be genuinely similar — rather than drowning in a chorus where "every vector points the same way."
Thirty vectors that start clustered inside a narrow cone (the anisotropy that raw BERT-style embeddings have). Press train: contrastive pressure pulls labelled pairs together and pushes everyone else apart. The cone unfolds into a full circle, and the average cosine similarity between random pairs drops from near-1 toward zero — geometry that cosine can actually read. Drag the temperature τ slider: small τ is aggressive separation, larger τ is gentler.
A comparison metric doesn't rescue a bad space. It just reads what's there. A good space is built so that cosine on it means something, because training was specifically designed to make it that way.
That closes the first arc of the article — how the embedding space is built and why comparisons inside it carry any meaning at all. From here on out, we're in engineering parameters: what width of bottleneck we choose, what we push through it, and what native pathologies the resulting space has.
Dimension is the width of the bottleneck
Now it's clearer what embedding dimension is. It's not "the size of the array." It's the width of the channel through which all of text's complexity passes into the fixed representation.
Smaller dimension means harder compression. Fewer coordinates with which to separate different semantic distinctions. More collisions: texts that, for our task, should be distinguishable end up too close. In return, the index is more compact, distance computation is cheaper, latency is lower.
Larger dimension means more room for nuance. But it's not free. The cost of storage and comparison grows linearly with dimension. Concretely: one float is 4 bytes. One 1024-dimensional embedding is 4 KB. A million documents in 1024-dimensional embeddings is 4 GB just for the vectors, plus roughly the same again for the ANN index. At 384 dimensions, the same million documents take about 1.5 GB. If you have ten million chunks, the difference between 384 and 1024 is the difference between "fits in RAM on one machine" and "doesn't fit."
Larger dimension also doesn't mean linearly better quality. Past a certain point, additional dimensions go into noise, and diminishing returns set in sooner than many people expect. Which is why "what dimension is best?" is almost always the wrong question. The right one is: what dimension gives enough semantic resolution for this task at acceptable latency, memory, and index cost?
A recent twist is Matryoshka representation learning. The idea: train the model so that the first N coordinates already function as a standalone, smaller embedding. The same vector can be used in full where you need quality, and truncated to 256 or 128 coordinates where you need speed. OpenAI's text-embedding-3 and some open-source models are already built this way. It turns the choice of dimension from a "decide once and forever" call into an adaptive knob.
At production scale, vector search quickly becomes a memory problem dressed up as a math problem. Brute-force O(N·d) comparisons are too expensive; ANN indexes are used instead — HNSW (a graph of nearest neighbors), IVF (clustering of the space into regions), product quantization (compression of vectors via sub-vector codes). They don't give you the exact nearest neighbor; they give you good-enough neighbors fast enough. And here dimension surfaces again: large vectors fit in cache worse, miss it more often, and the bottleneck ends up being not the arithmetic of the dot product but how quickly hardware can deliver the right chunks of memory.
Bottom line: dimension isn't about "quality vs. size" in some platonic sense. It's about whether your infrastructure can live with the embedding you picked.
Hubness — the pathology of high dimensions
In high-dimensional spaces, a strange thing happens, known as hubness: certain points suddenly become nearest neighbors to a disproportionately large number of other points. Not because they're "very relevant," but because the geometry of high dimensions is built that way. It's not a bug in the embedding model. It's a property of the space.
In retrieval, this shows up as "vampire" points: the same document keeps appearing in the top-k for the most varied queries, even though it actually fits a smaller share of them. Often it's short, "flat" texts — headers, generic phrases, boilerplate disclaimers. Their vector lands in the "central" part of the space and is formally close to everything.
A practical mini-symptom: if the same chunk keeps showing up in top-3 for completely unrelated queries, it's almost certainly a hub. By scale, this isn't an "academic curiosity": the effect becomes noticeable starting at roughly tens of thousands of documents and any production-scale embedding dimension (on the order of 100+). Practical check: track the distribution of how often each chunk appears in top-k. If the head is suspiciously heavy, you have a hubness or boilerplate problem.
Hubness is fixed not at the level of the embedding model, but at the level of the retrieval pipeline: rescoring via mutual k-nearest neighbors, or simply a reranker that sees query and document together and easily discards "generic" neighbors. A useful reminder: even a perfectly trained embedding space isn't a perfectly working retrieval. The geometry of high dimensions always adds its own distortions.
Stepping back: up to here, the article has been about the properties of the space itself — how it's built, how to read it, what happens with dimension, what its native pathologies are. From here on, it's about what we put into that space. And as it often turns out, this choice weighs more than the choice of model.
Chunking changes the object you embed
One of the least visible traps in RAG is thinking that chunking is just slicing text before embedding. In reality, chunking changes the very object that ends up in semantic space.
A large chunk — the embedding becomes a representation of a mixture. A definition, an example, an exception, and a footnote can all fall into one piece. The model compresses all of it into a single vector. The result is semantic averaging. A chunk like this matches well on broad topic, poorly on a specific question.
A small chunk — the embedding becomes sharper. Less topic mixing, more precise landing, better precision. But small chunks lose surrounding context. They can contain an answer without its condition, a formula without its explanation, a pronoun without its antecedent, a conclusion without its premise.
In our insurance KB, this means: a chunk with the sentence vision correction is covered can end up in the index separately from the next sentence, only after 12 months of continuous coverage. Retrieval confidently finds the chunk. The LLM confidently answers "yes." And gets it wrong — not because it hallucinated, but because the piece of truth it saw was cut off before it became a lie.
Chunk size doesn't only govern the number of documents in the index. It governs the shape of the semantic space. Large chunks make it smooth and blurry. Small ones make it detailed but fragmented.
That's why changing chunk size isn't a local setting. It changes precision, recall, the number of vectors, index size, ranking behavior, the load on the reranker, and the quality of final context assembly all at once. Smaller chunks mean a bigger base, more candidate points, each cleaner on its own — but loss of context around the answer and, possibly, a need for a reranker. It's a change in the physics of the whole retrieval pipeline.
An insurance policy paragraph with internal dependencies — an answer in one sentence, the condition that gates it in the next. Drag the chunk-size slider: at large sizes, each chunk holds both the answer and the condition; at small sizes, the answer is in one chunk and the condition in another. For a sample query, the simulator shows which chunk gets retrieved, and whether the LLM ends up looking at the complete rule or just the half that says "yes."
Context window is not embedding dimension
A recurring source of confusion in conversations about LLM systems: a bigger context window and a larger embedding dimension are often compared as if they were about the same thing. Which one matters more?
They're different kinds of capacity.
Context window governs how many raw tokens the model can see at once during generation. The size of the working surface. More tokens means more material in front of the model: retrieved documents, conversation history, instructions, examples, tool results.
Embedding dimension governs how richly a single piece of text can be compressed into a vector representation. Not a working surface. The resolution of a compressed semantic fingerprint.
Embeddings compress. Context window retains. These are almost opposite operations.
A bigger context window doesn't replace retrieval. It changes its role. If you can put more text into a prompt, retrieval can be less aggressive, but the noise problem doesn't go away. The model still has to find the relevant pieces inside a long context. And a long context isn't infinite understanding. It's more material that attention has to fight over.
And here a well-documented phenomenon kicks in — "lost in the middle." When you put a long list of retrieved chunks into a prompt, the model uses the beginning and the end far better than the middle. Information that landed in the middle 50% of the context is systematically ignored. In practice: if you have 20 retrieved chunks and the right answer is in the 11th, the model often acts as if it doesn't see it, even though it technically has access. A big context window doesn't solve the retrieval problem; it relocates it. You stop losing on "didn't find" and start losing on "found and dropped in the middle."
A larger embedding dimension also doesn't replace context window. It helps you pick what to show better. The text you found still has to fit in the prompt and actually be used by the model when it generates.
Two different bottlenecks in one system. One determines how well we choose what to show. The other determines how much and how well the model can hold on to, once we've shown it.
Dense retrieval is not keyword search with AI flavor
Dense embeddings work great on paraphrases and conceptual similarity. That's exactly why they sometimes work worse on exactness.
If the user is searching for an error code ERR_TIMEOUT_4012, a function name parseRequestBuffer, an invoice number, or a legally precise phrase — sparse retrieval can be stronger. BM25 and inverted indexes don't try to understand meaning. They latch onto tokens. And sometimes that's exactly what you need.
Dense search may decide that two texts are close on topic, even though the identifier didn't match. Sparse search may find the exact identifier match, even though the overall meaning of the document isn't ideal.
Hybrid search exists because the two approaches fail in different ways. Sparse preserves lexical precision. Dense adds semantic recall. Together they often make a more robust system. But hybrid isn't "best of both worlds" out of the box. You have to decide how to combine scores (usually via reciprocal rank fusion), how to normalize ranking, when to trust an exact match, and when to trust a semantic one. And, again: embeddings aren't a feature you add. They're a change in retrieval physics.
Reranking as the second pass
Many of the problems above — negation, role confusion, hubness, topical-but-not-answering — share a common source. A bi-encoder (the model that produces the embeddings) encodes query and document independently. They never "see" each other. All the comparison work happens afterward, as cosine similarity. That's exactly why it's fast: document embeddings are precomputed once, and each query costs only a single inference and a fast ANN lookup.
But independent encoding pays a price. A bi-encoder can't say "wait, the query is about safety — and this document says not." It encoded the query before seeing the document, and the document before seeing the query. The information about the interaction between them is gone.
A cross-encoder is built differently. It takes query and document together, as one concatenated sequence, and runs them through a transformer that sees attention between all of their tokens simultaneously. The output is a single number — a relevance score. A cross-encoder can explicitly match safe in the query against not safe in the document and lower the score. It can notice that entity roles don't match. It can tell almost-right from right where cosine gets confused.
The price is speed. A cross-encoder can't be precomputed. Every query-document pair needs a full inference. Comparing a query against a million documents through a cross-encoder isn't a retrieval strategy anymore; it's far too slow and expensive for interactive search.
Which is why many production retrieval systems become two-stage once quality starts to matter:
- Bi-encoder + ANN pull the top-100 or top-200 candidates out of millions in milliseconds.
- Cross-encoder reranker rescores those 100–200 and returns the top-10 that go to the LLM.
This isn't "another optimization." It's a structural answer to the limits of embeddings. Embeddings are there to filter candidates quickly. The cross-encoder is there to resort them carefully. On retrieval benchmarks like BEIR (a standard suite of tasks for evaluating retrieval models), a cross-encoder reranker often delivers a large quality lift — sometimes noticeably larger than moving to a heavier embedding model. The price is latency and compute on top of every query.
And that's exactly why an article about embeddings without reranking is a story about an engine without a transmission. They work as a pair.
Left: a bi-encoder. The query goes through one tower into a vector. The document goes through another tower into a vector. The two vectors meet only at the cosine step — the model never saw query and document together. Right: a cross-encoder. Query and document are concatenated and pass through a single transformer that attends across both. The output is one number — a relevance score that reflects token-level interaction. The diagram makes visible what the prose says: interaction information is either preserved or thrown away, and which one you chose shapes everything downstream.
RAG fails in layers
When RAG returns a bad answer, the easiest thing to say is: "the LLM hallucinated." Sometimes that's true. But often the error happened earlier.
The embedding model may have failed to express the relevant distinction. Chunking may have blurred the answer-bearing passage. Vector search may have returned a broad topical match instead of the specific evidence. Hybrid ranking may have placed the exact match below a semantic neighbor. The reranker may have failed because the right chunk wasn't in the top-100. Context assembly may have dropped the right chunk to fit the token budget. A long context may have buried the answer in the middle. The LLM may have received the evidence — and not used it.
From the outside, all of this looks like one bad answer. From the inside, these are different failure points that need different fixes. And evaluation has to tell them apart: recall@k checks the retriever, not the model; faithfulness checks the model, not the retriever.
Embeddings are responsible only for part of the chain: how text becomes searchable semantic geometry. They create the candidate space. If the candidate space is built badly, everything downstream is built on sand.
The same chain from the prose, drawn as a horizontal pipeline: chunk → bi-encoder → ANN → reranker → context assembly → LLM. Each node is clickable. Click to expand: what can go wrong at that stage, what the symptom looks like at output time, and the natural class of fix. The diagram is mostly a reminder that "the LLM hallucinated" is one node out of six — and that the question of where the system failed is upstream of the question of what to fix.
Where the system pays
Back to the insurance KB from the beginning. Say it's in production with the following starting configuration:
- ~1M chunks in the index.
- Embedding dimension 1024.
- Bi-encoder + HNSW pull the top-20 candidates.
- A cross-encoder reranker sees the top-100 and rescores.
- 8 chunks go into the LLM.
These aren't "the right" numbers. They're a starting point. What happens if we pull any one of the levers?
Shrink chunk size from 500 to 200 tokens. Retrieval precision goes up — each chunk is now about a single thought, and the embedding doesn't get blurred. But the answer-bearing chunk is now often missing conditions, antecedents, or exception footnotes — exactly the scenario we just walked through. The chunk base grows by 2–3×, the ANN index along with it. To cover the same amount of answer context, top-k probably has to grow from 20 to 50. The cascade pulled the whole pipeline.
Raise embedding dimension from 1024 to 3072. Fine distinctions separate better. Memory for the index triples — from ~4 GB to ~12 GB per million chunks. On a large collection, that's the transition from one machine to a cluster. Latency may not triple, because ANN search isn't just raw dot products — but memory pressure and cache behavior get much worse.
Raise top-k from 20 to 100. Recall goes up — we lose the answer-bearing chunk before the reranker less often. But the cross-encoder is now processing 5× more pairs. If the reranker's raw latency is 30 ms per pair, the budget grew from 600 ms to 3 s per query. Maybe it's time to move to GPU. Maybe to quantize the model. Final answer quality may improve substantially — or barely move, if the bi-encoder was already finding the right chunks in the top-20.
Raise context budget from 8 to 30 chunks. The model sees more evidence. But lost-in-the-middle kicks in for real: middle chunks may get systematically ignored. Sometimes answer quality even drops — because a shorter, more relevant set gave the model less room to err.
Turn on a cross-encoder reranker if it wasn't there. Ranking quality often jumps significantly. The price is a few hundred extra milliseconds per query, a new infrastructure component, and a new signal to monitor separately from the bi-encoder scores.
At this point the system stops asking "which embedding model is best?" It starts asking: where do we want to pay — in RAM, in latency, in annotation, in reranking, or in hallucination risk?
Each of these "taxes" is real. They can't be zeroed out, only redistributed.
Beyond the standard picture
Everything we've discussed is the standard RAG architecture of around 2023: chunk → bi-encoder → ANN → reranker → stuffed context → LLM. It works, but it has several structural weaknesses, which we named along the way:
- Chunking severs the connection between a chunk and its surroundings before the model ever sees it.
- A bi-encoder encodes query and document independently — interaction information is lost.
- Retrieval happens once, up front — no feedback loop if the first attempt is bad.
- The context window grows, but lost-in-the-middle doesn't go away.
In the last couple of years, the main direction of work has been trying to step around each of these weaknesses — not by tuning parameters, but by changing structure.
Late chunking (Jina, 2024) inverts the order: the model first runs the entire document through the transformer, producing contextual token embeddings, and only then pools over chunks. Each chunk gets an embedding that "remembers" the surroundings of the whole document. Small chunks no longer lose antecedents and conditions as severely. Contextual Retrieval (Anthropic, 2024) solves a similar problem more simply: for each chunk, an LLM writes a short summary of its role in the document and prepends it to the chunk before embedding. A direct response to our insurance scenario, where a chunk without context is useless.
Late interaction (ColBERT and its descendants) keeps token-level embeddings and compares them via MaxSim: for each query token, the maximum similarity against any token of the document is taken, and the maxima are summed. It's a compromise between a bi-encoder (everything compressed to one vector, cheap, dumb) and a cross-encoder (sees all interactions, accurate, expensive). Late interaction sees nearly what a cross-encoder sees, but the index can be precomputed. The price is a much larger index: instead of one vector per chunk, dozens are stored.
Agentic retrieval. The standard pipeline does one retrieval call up front. Agentic systems are built so that the model itself decides whether a search is needed, what specifically to look for, and whether to reformulate the query if the first attempt produced nothing. This isn't "another component." It's a shift in who makes retrieval decisions. Instead of a fixed pipeline, you get a loop in which retrieval is one tool the model has alongside generation.
Long context as an alternative to retrieval. When context windows reach 1–2M tokens, the temptation appears to "just put everything in the prompt." For a million chunks that's still unrealistic, but the border has moved. On many mid-sized tasks, retrieval is no longer required for functionality — it's needed for speed, cost, and focus. Retrieval's role shifts from "find what to show the model at all" to "choose what to show first, so the model doesn't spend attention on filler."
Not all of these directions will take root equally. Late interaction lives in niches because of storage cost; agentic retrieval needs more capable models and careful orchestration; long-context economics don't add up everywhere. But the overall trajectory is visible: each approach takes one of the limits of the standard 2023 pipeline and tries to step around it structurally.
Embeddings don't disappear in this world. They become one layer in a more layered system, where the boundaries between retrieval and reasoning blur.
What it means to understand embeddings
To understand embeddings isn't to memorize that "embeddings are vectors." It's to start seeing consequences.
Smaller dimension — harder compression, more collisions, cheaper memory. Smaller chunks — sharper signal, more fragmented context. Longer window — more material, higher attention cost, more lost-in-the-middle. Dense only — better paraphrases, worse identifiers. Sparse only — the reverse. No reranker — the bi-encoder runs into its ceiling. Without understanding contrastive training, it won't be clear why a model confuses safe with not safe.
After this, questions about RAG stop being a set of recipes. They become questions about how information moves through the system. Where do we compress? What do we lose in compression? Where do we search? What do we count as closeness? Where does raw text come back into the prompt? How much evidence does the model actually see? What happens if the right chunk was retrieved but ended up below top-k? If retrieved text is on topic but doesn't contain the answer?
That's the level at which real work with embeddings begins. Not at the vector database. Not at the API call. Not at the choice of a fashionable embedding model. But at the understanding that we've built an artificial geometry of meaning, and the whole system depends on the distortions that geometry introduces.
The useful lie
Embeddings are a useful lie. They pretend that meaning can be placed at a point in space.
In a strict sense, it isn't true. Meaning isn't a vector. The same text means different things in different contexts. The importance of a detail depends on the task. Similarity doesn't exist on its own; it's always relative to a question.
But as an engineering approximation, this lie has turned out to be remarkably useful. It lets us search not only by words, but by neighborhoods of meaning. It lets us build semantic search, recommendation systems, clustering, deduplication, RAG, memory systems, routing, anomaly detection. It turns messy language into an object we can do math with.
The main thing is not to forget that it's an approximation.
The embedding space doesn't know what you need. It knows the geometry it was taught. Retrieval doesn't know what counts as evidence. It knows what's close. The reranker doesn't know what's true. It knows that the interaction between query and document hints at relevance. The LLM doesn't know whether the retrieved context is sufficient. It receives tokens and continues generation.
A good AI system appears not when we trust embeddings. It appears when we understand where embeddings usefully lie, where they start to lie dangerously, and which engineering layers are needed around them.
And then the whole subject suddenly becomes less magical. Not simple. But understandable.