Chapter 3 of 3
Chunking Strategies
Created Apr 28, 2026 Updated May 26, 2026
Most introductions to RAG treat chunking as plumbing. Pick a chunk_size, set an overlap, use RecursiveCharacterTextSplitter, move on. The interesting decisions look like they sit elsewhere — which embedding model, whether to add a reranker, how to evaluate.
What pulls attention back to chunking is a class of retrieval failure that doesn't fit anywhere else. The answer is in the corpus. The embedding model is recent. The reranker is on. But the right passage isn't in the top-K, and the failure is visible at the boundary — the condition for the claim ended up in a different chunk than the claim itself. The chunk that did retrieve was a faithful answer to incomplete evidence. Nothing downstream had a chance.
Looked at carefully, "chunking" isn't a preprocessing step. It's the choice that shapes what every downstream component can ever see. A chunk is what the embedding model gets, what the vector index stores, what BM25 scores, what the reranker reads, what the LLM treats as evidence, and what the user clicks as a citation. Six different systems, one stored object. They don't agree on what the object should look like.
The 2024–2026 toolbox is also larger than chunk_size and overlap make it sound. There's late chunking — one forward pass over a long document, then chunk embeddings derived by pooling token ranges. There's contextual retrieval — a small LLM call per chunk that prepends a one-sentence document-context summary before embedding. There's contextualized chunk embeddings — the same idea baked into a model architecture. There's BGE-M3, which returns dense, sparse, and multi-vector signals from one forward pass. There's ColBERT-style late interaction, which changes the storage model entirely. None of these is a tweak to chunk size; each is a different shape of stored object.
What follows is a walk through what those decisions actually decide.
The contract a chunk has to satisfy
The first thing to see clearly: a chunk is one stored object, but it gets consumed by several different systems, and they don't all want the same thing.
The embedding model reads the chunk through a fixed context window and truncates silently if it overruns. The vector index stores it as one node in an HNSW graph plus dim × bytes of vector data. BM25 tokenizes it into a sparse term-frequency vector and wants exact identifiers — error codes, function names, dates — present in the chunk text. A cross-encoder reranker reads (query, chunk) together and wants enough context to judge relevance, but not so much that the relevant span gets buried. The LLM stuffs the chunk into its generation prompt and wants surrounding context that lets it answer safely. The user sees a citation and wants something that looks like a coherent unit of source material.
These pull in different directions. The embedding model wants semantic coherence — one topic per chunk. The reranker wants the same. But BM25 wants exact rare-term presence, which can argue for slightly larger chunks so that context-words around an identifier appear. The LLM wants enough surrounding context to interpret the chunk safely, which argues for even bigger units. The citation wants something the user can read in one glance, which argues for smaller units.
A single chunk size cannot be optimal across all of these. One pattern that resolves the tension architecturally is parent-child retrieval: index small chunks (good for embedding precision, reranking, BM25 exact matching), serve large parents (good for generation and citation). That's an architectural choice, not a chunk_size parameter, and it'll come back later in detail.
Most chunking failures in production turn out to be cases where one of these requirements got quietly satisfied at the expense of another. Shrinking chunks improves dense retrieval, but the BM25 hit rate drops, usually because generated summaries replaced original identifier tokens. Switching to semantic chunking pushes recall@10 up but answer faithfulness down, usually because semantic boundaries cut just before the condition that qualified the claim.
Chunking decisions look like one-parameter tuning. They're multi-objective trade-offs.
Long context didn't kill chunking
When long-context embedding models started shipping it was tempting to expect chunking to dissolve as a problem. Older encoder models like E5-base and BGE-large still cap at 512 tokens, but most mainstream embedding APIs now sit in the 8K–32K range, and Cohere Embed v4 advertises 128K. On the LLM generation side, 200K+ token windows are routine in 2026. If everything fits, why chunk?
The argument doesn't quite go through, and seeing why is worth doing carefully.
Retrieval is selection, not fitting. The job of retrieval is to surface a small relevant subset of a corpus. A 100M-token codebase, a 10M-passage support KB, a 1M-clause contract corpus — none of those fit anywhere. Something searchable still has to be embedded, and that something is the chunk.
Generation cost scales with prompt length. Even when 128K fits in the LLM, you pay for the tokens. Per-call cost is roughly linear in input tokens once you exit the prompt-cache hit. Passing 50K tokens of "probably-related" context for every query is the kind of decision that feels free in development and shows up as a six-figure monthly invoice in production.
Long context degrades attention. Even within a model's stated window, retrieval accuracy from the LLM drops as the relevant span moves toward the middle of long context. The "lost in the middle" effect is less dramatic on 2024+ models than it was on GPT-3.5, but it's still measurable. Tight relevant chunks beat sprawling document dumps for answer quality, not just for cost.
So chunking-as-fitting is largely over. Chunking-as-evidence-selection turns out to be the part that was doing the work all along, just hidden behind the simpler framing.
The two failure modes
Read enough retrieval failures and two patterns show up over and over. Almost everything goes one way or the other.
Blur. A chunk covers too many topics, and its embedding compresses all of them into a single vector. The vector has decent cosine similarity to many queries but sharp similarity to none. Retrieval misses because the specific signal has been diluted.
A clean example: a 4000-token "Account Management" section covering password resets, account deletion, contractor access, SSO troubleshooting, VPN setup, and temporary credentials. As one chunk, its dense vector lives in some vague "account stuff" region. A query about VPN setup for contractors pulls adjacent chunks first; this one ranks somewhere on page 3.
Amputation. A chunk is small and sharp, but cut off from the context that made it true.
The classic example:
Vision correction is covered.
Coverage starts after 12 months of continuous enrollment.
If the chunk boundary falls between these two sentences — even with overlap — and the query is "Is vision correction covered?", retrieval returns sentence 1. The LLM answers "Yes." The answer is faithful to the retrieved evidence. The system failed before generation. This is the failure mode that doesn't look like a chunking bug, because the LLM did exactly what it was asked to do.
Almost every modern chunking technique — overlap, parent-child, late chunking, contextual retrieval, contextualized embeddings — is an attempt at the same problem: how to avoid amputation without forcing blur. They pick different points in the (storage cost, embedding compute, indexing complexity) space.
What RecursiveCharacterTextSplitter actually does
Worth knowing the baseline algorithm exactly, because most pipelines start there:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.create_documents([text])
Inside, the algorithm is a depth-first split with backtracking:
- Take the input text. If
len(text) <= chunk_size, return it as one chunk. - Try the first separator (
"\n\n", paragraph break). Split the text on it. - For each piece, if
len(piece) <= chunk_size, accept it. - If a piece is still too large, recurse with the next separator down (
"\n", then" ", then""— character-by-character). - Merge accepted pieces back into chunks of approximately
chunk_size, then apply overlap by copying the trailingchunk_overlapcharacters of each chunk to the start of the next.
Deterministic, O(n) per document, easy to debug. With the algorithm visible, three properties become important.
The character-fallback failure. If the text has no \n\n and no \n boundaries within chunk_size, the splitter eventually lands on the character separator. This is fine on continuous English prose. It is destructive on a 2000-token JSON document, a code file with rare line breaks, or a minified HTML extraction. The splitter doesn't warn. A common production accident: a parsing change strips newlines from an upstream extractor; the chunker silently starts producing 800-character windows over JSON syntax; recall collapses; the cause isn't obvious without reading actual chunks.
Overlap is purely textual. It copies the last 150 characters of chunk N to the start of chunk N+1. Sentence and token boundaries are ignored. Overlap can begin mid-word. Every overlapping span is re-embedded, so on per-token pricing the overlap is a real, linear cost multiplier.
Character length is a leaky proxy for token length. A chunk_size=800 setting produces chunks that tokenize to roughly 200 tokens on English prose (~4 chars/token), 250–350 tokens on Python code (~3 chars/token), and 400–500 tokens on Russian or Chinese, depending on the tokenizer. If the embedding model's hard limit is 512 tokens and the corpus is multilingual, chunks that look "safe" in characters silently truncate. Most embedding APIs and most sentence-transformers invocations cut at the model's max length without raising.
The token-aware version:
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4",
chunk_size=300, # tokens, not characters
chunk_overlap=50,
)
The split-time tokenizer should match the embed-time tokenizer, or at least approximate it. Mismatches are a quiet source of off-by-one truncation. In any corpus with code, JSON, multilingual text, or unusual layout, token-length drift exists until measured. Tokens are what the model actually sees.
Overlap economics
Overlap is the cheapest defense against amputation. It's also a multiplier on every downstream cost.
Total indexed tokens, ignoring final-chunk effects, work out to approximately:
indexed_tokens ≈ source_tokens × (chunk_size + chunk_overlap) / chunk_size
≈ source_tokens × (1 + overlap_ratio)
A 15% overlap (a common default) costs 15% more embeddings, 15% more vector storage, 15% more HNSW graph nodes, 15% more BM25 index size. A 30% overlap means roughly 30% more indexed tokens, and a much higher chance that near-duplicate neighbors occupy the same top-K.
The harder cost shows up at retrieval. With overlap, adjacent chunks become near-duplicates in vector space. A query that matches the relevant span often pulls 2–3 overlapping chunks into the top-10. Without deduplication, the reranker scores them separately (cross-encoder cost is linear in candidate count) and the LLM gets the same content three times. With deduplication, the storage cost was paid but most of the retrieval contribution was discarded.
How different corpus shapes tend to respond:
- Short, well-bounded units (FAQ entries, support tickets, individual log messages): little or no overlap. Overlap mostly hurts when chunks are already self-contained.
- Continuous prose (articles, documentation): 10–15% of chunk size. The default exists for a reason.
- Dense reference material (contracts, policies, scientific papers): 15–25%. Boundary conditions matter more.
- Ordered procedures (troubleshooting steps, recipes): overlap aligned to step boundaries works better than fixed-length overlap, or step-aware splitting that keeps each step intact.
- Code: overlap is usually the wrong tool. Function- or block-level chunking solves the same problem more cleanly.
The signal that overlap has stopped being the right answer is needing to push it past 25% to fix retrieval gaps. At that point the problem isn't boundaries; it's that chunks need information they don't have, and the techniques further down solve that more directly.
Structured content: when chunks carry the layout that produced them
The hardest chunking work is on text that wasn't really text in the first place. Tables, lists, code, headings, slides, and PDFs all carry meaning in their structure. Treat them as prose and the structure evaporates from the embedding.
Tables
A table row without its header is data without keys. Splitting
Plan | Waiting period | Includes vision | Includes dental
Premium | 30 days | Yes | No
Basic | 90 days | No | No
so that Premium | 30 days | Yes | No lands as a standalone chunk produces evidence that retrieves on the query "what is the waiting period?" but doesn't actually answer it — the chunk doesn't say "waiting period." A query for "premium plan waiting period" might still match because of "Premium" and "30 days." A query for "policy with 30-day waiting period" usually won't.
A table-aware handling typically does:
- Detect tables before recursive splitting touches them. Markdown tables are easy; HTML tables need parsing; PDF tables need a layout-aware extractor (Marker, Docling, Unstructured, pdfplumber).
- Tables fitting in one chunk: keep them whole, embed with
content_type=tablemetadata. - Oversized tables: split by row groups, repeat the header in every chunk, tag with the table title and source section.
- Dual-indexing where it helps: keep the structured form for citation, add a generated textual summary ("Premium plan: 30-day waiting period, vision included, dental not included") as a parallel chunk for retrieval. The summary embeds better; the table cites better.
The underlying point: the chunk has to carry the metadata the layout was carrying. If the splitter can't see structure, the structure has to be put back through metadata or summarization.
Lists
A bullet without its introduction is a fragment.
The following documents are required for visa renewal:
- Passport
- Proof of address
- Proof of income
If Proof of income becomes a standalone chunk, the words are correct but the meaning is incomplete. Proof of income for what? A visa renewal? A mortgage? A school application?
Short lists: keep them together. Long lists — procedure steps, eligibility criteria, compliance items — split into groups while repeating the introductory sentence in each chunk. Same pattern as table headers: the parent sentence is the key under which the bullets are values.
Code
Code is text but it's not prose. A function loses meaning if cut from its signature. A class method loses meaning if cut from class context. An import can change the meaning of the entire file.
Cutting code by character count is almost always wrong. The natural units are functions, classes, methods, top-level configuration blocks, notebook cells, file sections.
For Python, JavaScript, Java, Go, Rust, and a few others, language-aware splitters use the AST (or tree-sitter) to find structural boundaries:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=600,
chunk_overlap=100,
)
The separators get language-specific: \nclass , \ndef , \n\tdef . Still recursive splitting — just with better fallback boundaries.
For serious code-search RAG, AST-aware splitting goes further: each chunk becomes one function or method, with file path, surrounding class signature, and imports attached as metadata. Identifier-level BM25 hits become reliable. Dense retrieval matches "JWT validation" against def validate_bearer_token(...). Hybrid retrieval is required, because dense vectors blur identifiers — only BM25 reliably finds validate_token, TS-999, /api/v1/users, NullPointerException.
Headings as metadata
A heading is compressed context.
Security > VPN Setup > Contractor temporary access
Open the settings page and enable temporary access. The change expires
automatically after 24 hours.
The paragraph alone might refer to any kind of access. The heading path disambiguates. Authors don't repeat heading words in every paragraph — they assume the reader sees the structure. The embedding model doesn't, unless it gets put there.
The move that pays back most reliably: extract the heading path during parsing (every parser worth using supports this for markdown, most for HTML), prepend it to the chunk text before embedding, keep it as separate metadata for filtering and display.
chunk_text = f"{heading_path}\n\n{body}"
chunk_metadata = {
"source": doc_id,
"heading_path": heading_path,
"section_id": section_id,
"page": page_number,
"content_type": "text",
}
The cost is ~10–30 tokens per chunk. The retrieval gain is usually visible at recall@10 when sections have descriptive headings and bodies that don't repeat them. The reason it works: the heading carries information the author deliberately didn't restate in the body. Adding it back to the embedded text restores that signal.
PDFs and slides
PDF text extraction is an adversarial problem. A PDF is a layout, not a stream of tokens — multi-column documents become interleaved gibberish under naive extraction, tables flatten into row-major streams that lose column headers, footnotes interleave with body text, page numbers and headers appear mid-paragraph.
In 2026, layout-aware parsers are the baseline for serious RAG over PDFs:
- Marker (datalab): converts to markdown with table and figure detection, GPU-accelerated.
- Docling (IBM Research): structured document representation with tables, lists, figures, code blocks.
- Unstructured: broader format coverage (PDF, DOCX, HTML, PPTX, EPUB), with category labels per element.
- Microsoft's Document Intelligence (cloud): high accuracy for forms and tables, API-only.
The output of each is a structured document model, not plain text. Chunking happens over the structure, not over a flattened text dump.
Slides are stranger. A slide might have five words on the canvas, one chart, three arrows, and a 300-word speaker note that contains the actual content. Indexing only the visible text leaves the slide near-empty. Indexing only the notes loses the visual evidence. The unit of evidence that holds together: slide title + canvas text + speaker notes + rendered image, treated as one chunk with multiple content channels.
Multimodal embedding models — Voyage-multimodal-3, Cohere Embed v4 multimodal — allow the rendered slide image itself to become part of the chunk vector. That doesn't remove the chunking decision; it changes what a chunk can contain.
Late chunking — an architecture choice, not a flag
Standard chunking is early: split first, embed each chunk independently. The embedding model only sees the chunk. Pronouns, references to prior sections, and context-dependent terms all have to be resolved from inside the chunk text, which often means they can't be.
Late chunking (Günther et al., Jina AI, 2024) inverts the order. The exact computation:
- Take a document up to the embedding model's context limit (8K–128K depending on the model).
- Run one forward pass through a long-context embedding model. Get token-level hidden states for the whole document.
- Apply chunk boundaries to the token stream after the forward pass.
- For each chunk, mean-pool the token hidden states within its boundary range to produce the chunk embedding.
The chunk embedding now incorporates document-level attention. The token representations were computed with the whole document in context. Pronouns, anaphora, and cross-section references get partially resolved in the hidden states before pooling.
What's worth being careful about: this is not the same as making chunks larger. Larger chunks change the indexed object. Late chunking keeps small, precise indexed objects but enriches their vectors. Retrieval granularity stays high; the representation gets a context boost.
Late chunking is an embedding-model decision, not a chunking-strategy flag. Conditions under which it actually works:
- The model has to have meaningful long-context pretraining. Truncating a 512-token encoder to give it 8K context doesn't produce useful late chunking — the hidden states past the trained context are noise.
- Token-level hidden states have to be accessible, not just the final pooled output. Many embedding APIs don't expose this. Open-weight long-context embedders are easier here.
- One forward pass over a 32K document is much more expensive than 64 forward passes over 512-token chunks (attention is quadratic in sequence length). The trade is per-chunk compute for shared whole-document compute.
Late chunking helps most on documents with heavy local references — research papers ("the above limitation"), legal documents ("the aforementioned party"), narrative content ("the second experiment"). On documents where each chunk is already self-contained — FAQ entries, support tickets, independent product descriptions — it adds cost without changing retrieval quality.
Contextual retrieval — an LLM call per chunk
Anthropic published Contextual Retrieval in September 2024. The algorithm:
- For each chunk, take the whole containing document.
- Make an LLM call: "Here is a chunk from this document. Write a short (~50-token) explanation of where it comes from and what it refers to."
- Prepend the generated context to the chunk text before embedding and before BM25 indexing.
Example. Raw chunk:
The company's revenue grew by 3% over the previous quarter.
Contextualized chunk:
This excerpt is from ACME Corp's Q2 2023 financial report, in the section
discussing quarter-over-quarter revenue growth.
The company's revenue grew by 3% over the previous quarter.
The chunk now has the company name, the period, and the topic visible to both the embedding model and BM25. The retrieval signal becomes much more specific.
Anthropic's reported result on their benchmark: dense embeddings alone improved retrieval failure rate by 35%. Adding contextual BM25 brought it to 49%. Adding a reranker on top of contextual hybrid retrieval brought total reduction to 67%.
The cost model is the part that's not obvious. Naive contextual retrieval is N_chunks × LLM_call. For a 10M-chunk corpus that's prohibitive. The technique works in practice because of prompt caching: the document content is the same for every chunk in that document, so it gets cached. With cache hits (about 10% of full input cost on Anthropic's pricing), the per-chunk cost drops by roughly 90%. For a document with N chunks and average chunk length C, indexing cost is approximately:
cost ≈ doc_tokens × cache_write_price (once per document)
+ N × C × full_input_price (per-chunk text)
+ N × C × cached_input_price (per-chunk re-read of document)
+ N × ~50 tokens × output_price
The bulk is the cached re-read, and prompt caching is what makes it manageable. Without caching or batching, contextual retrieval becomes difficult to justify at large corpus sizes — the indexing economics fall apart.
Things to watch for when running it:
- The generated context is text that becomes part of the indexed object. Hallucinations enter the retrieval corpus.
- Caching strategy has to match indexing strategy. If the pipeline doesn't reuse the document across all its chunks within a single prompt-cache TTL window, full price gets paid.
- Retrieval recall can go up while answer faithfulness drifts, because the prefix introduces claims the chunk itself didn't make.
Contextualized chunk embeddings — the model-native version
Same direction, baked into the model. Voyage's voyage-context-3, introduced in 2025, is one example of this model-native direction: trained with a contextualized chunk embedding objective, it produces chunk-level embeddings that already incorporate document context, without a separate LLM call at indexing time.
Architecturally this is closer to late chunking than to contextual retrieval. The model sees more than the chunk during the embedding forward pass. The difference from late chunking is that the model is trained to produce chunk-level outputs incorporating context, rather than chunk-level outputs being derived post-hoc by token-range pooling over a generic long-context model's hidden states.
The trade-off space across the three context-aware techniques:
| Technique | Per-chunk cost at index | Per-doc cost at index | Model lock-in | Latency at query |
|---|---|---|---|---|
| Standard chunking + embed | 1 embed call | 0 | None | low |
| Contextual retrieval | 1 embed + 1 LLM call (with caching: cheap re-reads) | 1 cache write | LLM choice | low |
| Late chunking | 0 separate calls | 1 long-context embed call | long-context embedder | low |
| Contextualized chunk embeddings | 0 separate calls | 1 model call (returns N chunk vectors) | vendor-specific | low |
All three produce richer per-chunk embeddings. They differ in where the work happens and who pays for the model. Late chunking is open-source-friendly (any long-context embedder + a custom pooling layer). Contextual retrieval is model-agnostic but adds an LLM dependency at indexing time. Contextualized chunk embeddings tie you to a specific vendor's model in exchange for a cleaner integration.
Parent-child retrieval
Some queries want precision in retrieval and context in generation. Parent-child retrieval is the architectural pattern that handles that explicitly.
The shape:
- Children: small chunks (~200–400 tokens). Indexed in the vector DB. Used as first-stage retrieval candidates.
- Parents: larger sections (~1500–3000 tokens) containing one or more children. Not indexed; stored in a key-value store keyed by parent ID.
- Children carry parent IDs as metadata.
At query time:
- Vector-search children.
- For each retrieved child, look up its parent.
- Deduplicate parents (multiple top-K children often share a parent).
- Pass parents to the reranker and then to the LLM.
The dedup step is the easy-to-miss part. With overlapping children, top-K can easily contain 8 children from 3 parents. Without dedup, the LLM gets the same parent text three times; with naive dedup applied after reranking, the highest-ranked child sometimes gets discarded by accident.
Storage cost: only children are vectorized, so vector index size matches a fine-grained-chunks-only strategy. Parents add raw text storage (cheap) and an extra key-value lookup at query time (also cheap). The real cost moves to the generation prompt, which is larger — pay-per-token inference becomes more expensive linearly.
What parent-child trades: it solves the amputation problem at the generation stage rather than the embedding stage. Late chunking and contextual retrieval solve it at the embedding stage. The choice between them is essentially a question of where the context should live — in the indexed object's vector (late chunking), in the indexed object's text (contextual retrieval), or in a separate lookup at query time (parent-child).
Semantic chunking — boundary detection by embedding distance
Recursive splitting picks boundaries from text-surface separators. Semantic chunking picks them from embedding distance between consecutive units (usually sentences).
The algorithm:
- Split text into sentences (any sentence tokenizer; quality matters).
- Embed each sentence individually.
- Compute cosine distance between consecutive sentence embeddings.
- Find boundaries where the distance exceeds a threshold — fixed, or a percentile of the distribution, or a multiple of the median absolute deviation.
- Group consecutive sentences into chunks at those boundaries.
The cost is N_sentences embed calls during indexing, plus a final chunk-embedding pass. Roughly 2–5× the embedding compute of recursive splitting, depending on sentence length.
Where it pays off: long continuous prose where topic shifts happen but headings don't mark them — articles, narratives, transcripts. Where it doesn't: documents with strong existing structure (headings, lists, tables), where structural splitting is cheaper and more accurate; repetitive content (FAQ entries, product catalogs), where threshold detection misfires on similar consecutive sentences.
Semantic chunking isn't the senior version of recursive splitting. It's a different tool with different failure modes, addressing a specific kind of problem (unmarked topic shifts in continuous prose) that recursive splitting can't reach.
Hybrid retrieval and chunk size
For many serious document RAG systems, hybrid retrieval becomes hard to avoid: dense embeddings plus BM25, SPLADE, BGE-M3 sparse signals, or another lexical layer, and sometimes ColBERT-style late interaction on top. Reciprocal Rank Fusion combines the rankings, typically with k=60 (Cormack et al., 2009) as the default constant.
Chunk size affects each signal differently:
- Dense embeddings want semantically coherent units. Below ~100 tokens the embedding has too little signal. Above ~500 tokens blur takes over.
- BM25 wants chunks containing the exact terms users search. Small enough to be precise; large enough to include context-words around rare identifiers. The range that typically works best is 200–400 tokens.
- ColBERT / multi-vector wants chunks short enough that per-token storage cost is bearable. ColBERT v2 stores ~128-byte vectors per token after compression. A 200-token chunk costs ~25KB indexed (versus ~4KB for a dense 1024-d vector). Storage cost is usually the binding constraint.
BGE-M3 makes the contract problem unusually visible: the same input produces dense, sparse, and multi-vector representations from one forward pass, so a single chunk-boundary choice affects three retrieval signals at once. A chunk optimized for dense recall can be worse for sparse exact-match, and vice versa.
The storage shape of multi-vector indexing is what changes the design. There's no longer one vector per chunk — there are many token-level vectors per chunk. Naive ColBERT indexes can reach an order of magnitude more storage than dense-only for the same corpus; ColBERTv2-style residual compression and PLAID-style optimized indexes pull that back down substantially, but it remains specialized infrastructure with its own operational profile, not a drop-in replacement for a dense vector store.
The common production shape in 2026 is still dense + sparse hybrid. Adding multi-vector late interaction is a real precision win on the right corpora — long technical documents, dense reference material — at the cost of running a separate indexing path.
Reranking inverts the chunk-size question
Without a reranker, first-stage retrieval ranking is what the LLM sees. First-stage precision matters a lot, because the right chunks have to be in the top-5 or top-10.
With a cross-encoder reranker (BGE-reranker-v2-m3, Cohere Rerank 3, or similar), first-stage retrieval becomes a recall problem. The top 50–100 candidates go to the cross-encoder, which reads (query, chunk) pairs jointly and scores them — typically at 10–50ms per pair on a GPU. So 50 candidates adds 0.5–2.5 seconds of latency.
This inverts the chunk-size question. Without reranking, larger chunks help first-stage precision — each candidate has to be self-explanatory because the LLM sees it directly. With reranking, smaller chunks help first-stage recall — the cross-encoder can judge relevance from less context, so it's safe to retrieve more candidates and let it filter.
Reranking also changes whether parent-child retrieval is worth it. With reranking, the cross-encoder runs on children (cheap, small inputs), and parents get fetched only for the survivors. Without reranking, parent-child decouples retrieval ranking (based on child text) from what the LLM sees (parent text), and that decoupling sometimes hurts.
Typical latency budget of a hybrid retrieval pipeline: vector search (~10–50ms) → BM25 (~5–20ms) → RRF (~1ms) → rerank top 50–100 (~500ms–2s) → generation (~500ms–5s). The rerank step is usually the latency dominator.
HNSW and what chunk count actually costs
The vector index is not a black box. Its parameters interact with chunking choices.
HNSW (Hierarchical Navigable Small World, Malkov & Yashunin, 2018) is the dominant index in production. The parameters that matter:
- M: number of bidirectional connections per node. Typical values: 16, 32, 48. Higher M improves recall but adds memory and slows insertion.
- ef_construction: dynamic candidate list size during construction. Typical: 200–400. Higher gives better-quality graph at higher build cost.
- ef_search: dynamic candidate list at query time. Typical: 50–200. Trades recall for latency.
Memory cost of the HNSW graph is approximately M × 8 bytes × N_nodes for the graph structure, plus dim × 4 bytes × N_nodes for vectors if f32. A 1M-chunk index at dim=1024 with M=32:
vectors: 1M × 1024 × 4 bytes = 4 GB
graph: 1M × 32 × 8 bytes ≈ 256 MB
total: ≈ 4.25 GB (without payload metadata)
With int8 quantization, vectors drop to 1 GB. With binary, to 128 MB (plus rerank-on-fp32 to recover precision). For 10M chunks at f32, that's 40+ GB just for vectors — a single-machine memory question becomes real.
Chunk count drives all of this. Halving chunk size doubles chunk count, doubles index memory, doubles graph construction time, increases per-query latency. Doubling chunk size halves all of that but invites blur.
The lever is quantization. Modern production stacks routinely run int8 (4× compression) with negligible recall loss, or binary (32× compression) with a rerank-on-uncompressed step. Quantization changes the chunking economics — if storage is 32× cheaper, 32× more chunks become viable, which means smaller chunks are economically reachable in a way they weren't before.
Reindexing economics
The painful part of changing chunking strategy is that everything gets reindexed.
Per chunk under a new strategy: re-tokenize, re-embed (API or GPU), re-insert into the vector index, rebuild BM25 index, update metadata stores. For a 10M-chunk corpus that's millions of embed calls.
At current OpenAI pricing, text-embedding-3-small is cheap enough that pure embedding cost is rarely the bottleneck on a one-time reindex — rate limits, pipeline runtime, index rebuilds, validation, and rollout safety usually matter more. Re-embedding billions of tokens takes hours to days depending on rate limits, and the index build that follows often dominates wall-clock time.
The patterns that show up on live systems:
- Shadow indexing. Build the new index alongside the old one. Compare retrieval quality on a held-out eval set. Switch only after the new index demonstrably wins.
- Dual writes. During cutover, write to both indexes. Query the old one until the new one is verified, then flip the read path.
- A/B at the retrieval layer. Route some fraction of queries to the new index, compare downstream metrics (answer quality, citation accuracy, user feedback) before full rollout.
The practical implication: chunking changes are expensive, so they're worth exploring offline before committing. A representative small corpus (10K–100K chunks) is enough to compare recursive vs semantic vs late-chunked vs contextual on an eval set — measuring recall@k, MRR, and downstream answer faithfulness — before paying for a full reindex.
A common trap: varying chunk size and embedding model in the same experiment. The numbers move; the cause isn't separable.
Evaluation, separated by stage
Chunking impacts every stage downstream, so chunking evaluation has to separate those stages. Three independent measurement layers:
1. Retrieval recall. Given a query and a known relevant chunk (or set), did first-stage retrieval surface it in the top-K?
- Metrics: recall@k (k = 5, 10, 50), MRR, nDCG.
- Needs: ground truth (query, relevant-chunk) pairs. The bottleneck for most teams.
2. Reranking quality. Given retrieved candidates, did the reranker order them well?
- Metrics: same, evaluated on reranker output.
- Useful for isolating reranker contributions from first-stage retrieval.
3. Answer faithfulness and grounding. Given retrieved chunks and the generated answer, is the answer supported by the chunks?
- Metrics: faithfulness scores (ragas, autogen-style LLM judges), citation accuracy, hallucination rate.
- Needs: a judge LLM and a small corpus of human-evaluated answers to calibrate the judge.
A chunking change can improve retrieval recall and degrade answer faithfulness at the same time. Smaller chunks improve recall (more focused, easier to embed precisely) but amputate conditions the LLM needs to answer correctly. A pure recall@k metric will signal that chunking got better when end-to-end it got worse.
A small corpus-specific eval set — 200–500 (query, expected-answer) pairs on real queries against the real corpus — is enough to make meaningful comparisons. Ragas-style automated eval works as a screening tool, not as a replacement for human-labeled ground truth on a corpus that actually matters.
Failure mode fingerprints
When retrieval fails in production, the most useful thing to do is also the most boring: open the retrieved chunks, open the source document, find the evidence that should have been retrieved, look at chunk boundaries.
Each failure mode has a fingerprint:
- The right chunk wasn't in top-K, but its neighbor was. Boundary issue. Either increase overlap, switch to parent-child, or use late chunking.
- The retrieved chunk contained part of the answer but missed a qualifying condition. Amputation. Parent-child or contextual retrieval.
- The retrieved chunk was correct but the LLM's answer was wrong. Look at generation, not chunking — unless the chunk was so small it forced the LLM to fill gaps.
- The retrieved chunk was related but on the wrong topic. Blur. Reduce chunk size or add heading paths.
- Dense retrieval missed an exact identifier (error code, function name, ID). BM25 probably not in the stack, or chunks lost the exact token. Hybrid retrieval; check identifier survival through preprocessing.
- The retrieved chunk was a table row with no header. Layout failure. Re-run document parsing with a table-aware extractor.
- Multiple near-duplicate chunks in top-K, no useful diversity. Overlap or chunking redundancy. Add dedup, reduce overlap, or use MMR (maximum marginal relevance) post-processing.
Without this kind of inspection, chunking tuning becomes guesswork. With it, every fix becomes targeted at a specific decision in the pipeline.
Rough decision sheet by corpus shape
A shorthand for which kinds of corpora respond to which strategies. Not a prescription — just the patterns that show up repeatedly when reading failure cases:
| Corpus type | Working strategy | Modifiers |
|---|---|---|
| FAQ entries / support tickets | One entry = one chunk | No overlap. Heading metadata. |
| Long-form articles / blogs | Recursive (token-aware) | 10–15% overlap. Heading paths. Late chunking if context budget allows. |
| Technical documentation | Markdown-aware split by H2 | Heading paths. Hybrid retrieval. |
| Code repositories | AST-aware split by function/class | Hybrid retrieval. Identifier metadata. |
| Legal contracts / policies | Clause-level | Parent-child for full-section context. Heading paths. |
| Financial filings / earnings | Section-aware + table-aware | Dual-index tables (raw + summary). |
| Scientific papers | Section-aware | Late chunking for cross-reference resolution. |
| Customer support KB | Recursive + contextual retrieval | Helps cross-document references. |
| Slide decks | Slide as chunk (title + canvas + notes) | Multimodal embedding if budget allows. |
| Multimodal docs (PDF with charts) | Layout-aware parser (Marker/Docling) → chunk per element | Multimodal embeddings. Image + caption as a unit. |
| Code + documentation mixed | Separate strategies per content type | Track content_type in metadata. |
This is the starting cut. The real decision happens against the specific corpus, the specific queries, and the specific answer-quality eval — and corpora are surprising often enough that the table is best treated as a starting point, not a conclusion.
The mental model
A RAG system doesn't retrieve from your documents. It retrieves from your representation of your documents. Chunking is the act of creating that representation, and it happens before embedding, before vector indexing, before BM25, before reranking, before generation, before citation. Every stage downstream inherits whatever chunking produced.
The 2024–2026 toolbox has more entries than chunk_size and overlap. Late chunking gives chunk-level embeddings document-level context through one long-context forward pass. Contextual retrieval prepends LLM-generated context, economically viable with prompt caching. Contextualized chunk embeddings bake the same idea into a model architecture. Parent-child retrieval separates the indexed unit from the generation unit. BGE-M3 multi-functionality produces three signal types from one forward pass. ColBERT late interaction changes the storage model entirely — many token-level vectors per chunk — and pays back in precision on the right corpora.
Recent chunking research is converging on the same shape: chunking isn't one knob, it's a design space across segmentation strategy and embedding paradigm, and the best choice depends on the task.
The chunk is the unit of evidence the retrieval system gets to see. Pick it badly and none of the cleverer machinery downstream — better embeddings, better rerankers, better generators — recovers what got thrown away. Much of what later looks like a retrieval, reranking, or generation problem was already shaped at that one step.