Why Your RAG Pipeline Returns Confident Garbage: Chunk Sizing, Reranking, and the Eval Harness You Need Before Shipping

May 26, 2026#RAG production failures

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Retrieval-Augmented Generation has a dirty secret: the demo always works. You paste a few documents into a notebook, fire a query at the vector store, and the LLM answers fluently and correctly. You show the stakeholder. Everyone is impressed. You ship.

Three weeks later, production RAG is hallucinating policy numbers, confidently misquoting the wrong version of a document, and sending your support team a queue full of confused follow-up tickets.

This is not a model problem. The model is doing exactly what you asked: synthesising an answer from the retrieved chunks. The problem is that the wrong chunks came back — or the right chunks came back sliced in ways that destroyed the context the model needed. RAG production failures are almost always retrieval failures dressed up as generation failures.

In this post I want to go through the three layers that break most RAG systems in the wild — chunking, reranking, and evaluation — using failure modes I see repeatedly during code audits and AI integration projects. By the end you should have a concrete eval harness you can run offline before every significant change to your pipeline.

The Chunking Problem Nobody Talks About Honestly

When developers first build a RAG pipeline they usually copy the chunk size from whatever tutorial they found. 512 tokens is common. Sometimes 256. Occasionally 1,000. There is rarely a principled reason behind the choice, and that matters more than most people expect.

Chunk size is not a hyperparameter you tune once and forget. It is a function of three things that interact: the structure of your source documents, the context window of your embedding model, and the nature of the queries users actually ask.

Short chunks (128–256 tokens) preserve lexical precision — a specific clause in a contract, a single step in a procedure. They also destroy semantic context. If a user asks "what is the refund window for enterprise customers?" and the answer spans a paragraph that says "Enterprise accounts are eligible for refunds within 90 days, provided the request is submitted through the billing portal and approved by account management" — a 128-token chunk might give the LLM the 90-day window without the conditions, or the conditions without the window. The model will synthesise a confident but incomplete answer.

Long chunks (1,024–2,048 tokens) preserve context but dilute the embedding. A 1,500-token chunk about your pricing page, your refund policy, and your enterprise SLA rolled into one will produce an embedding that sits in a vague region of vector space. It will be retrieved for questions it is not the best answer to, and missed for questions it is the best answer to.

The honest answer is that there is no universally correct chunk size — but there are principled ways to choose one. Start with the typical length of the meaningful unit in your source corpus. Legal documents have clauses. Support documentation has procedures. Product changelogs have releases. Identify the semantic unit and set your target chunk size to contain roughly one of them, with overlap of 10–20% of the chunk size to prevent context severing at boundaries.

Overlap is not optional for sequential documents. If you have a ten-step procedure and chunks happen to cut between step 3 and 4, a query about step 4 might return a chunk that starts mid-sentence. The embedding will be confused and so will the model. An overlap of 50–100 tokens on a 512-token chunk is cheap and prevents most of these failures.

One pattern that consistently improves retrieval quality in audits is hierarchical chunking: store both a small chunk (for precise retrieval) and a larger parent chunk (for context delivery). You embed and search the small chunks, but when a small chunk is selected, you inject the full parent chunk into the prompt. This gives you retrieval precision and generation context simultaneously. Libraries like LlamaIndex call this "small-to-big retrieval" but the pattern is straightforward to implement directly against pgvector if you are running a Symfony stack — see our pgvector architecture blueprint post for the schema details.

Why a Reranker Changes the Outcome More Than a Better Embedding Model

Most teams that optimise their RAG pipeline focus on swapping embedding models. They move from text-embedding-ada-002 to a newer model, run a few manual tests, and feel good about the change. Embedding model quality matters — but it is not the highest-leverage intervention for most production pipelines.

The highest-leverage intervention is adding a cross-encoder reranker between vector retrieval and prompt assembly.

Here is the difference. A bi-encoder (the kind of model used to generate embeddings) encodes the query and each document independently, then computes similarity in vector space. This is fast and scalable, which is why it works for retrieval over millions of documents. But bi-encoders lose the fine-grained token-level interaction between the query and the document — two texts can be semantically close in embedding space while one of them is actually a much better answer to the specific question.

A cross-encoder takes the query and a candidate chunk together as a single input and produces a relevance score. It attends to every token of the query in relation to every token of the document. This is far more accurate — and far slower, which is why you do not use it for the initial retrieval pass.

The standard pattern is: retrieve 20–50 candidates from the vector store (cheap), rerank with a cross-encoder (moderate cost, bounded by candidate count), take the top 3–5 for the prompt. On most of the RAG pipelines I have audited, adding a cross-encoder reranker improved faithfulness scores by 15–30 percentage points without touching the embedding model or the prompt template. That is a significant improvement for the cost of one additional model call per query.

For self-hosted stacks, cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace is a practical starting point — small enough to run on a single GPU or even a beefy CPU instance for moderate query volumes. For managed options, Cohere Rerank is simple to integrate and works well if you are already using their embedding model. The latency budget for a cross-encoder rerank over 20 candidates is typically 50–150ms, which is acceptable inside a user-facing chat interface.

The decision tree for "is a reranker worth it" is roughly: if your retrieval recall (does the right document appear somewhere in the top-20 results?) is already above 80% but your precision (is the right document in the top-3 going into the prompt?) is below 60%, add a reranker. If your recall is below 80%, fix your chunking and embedding first — a reranker cannot recover documents that were never retrieved.

The Four-Question Eval Harness

The single most consistent failure I see in AI product development is shipping without an offline evaluation harness. Teams test by feel — they run a few queries, the answers look reasonable, they deploy. This works until it catastrophically does not.

A good RAG eval harness answers four questions about every query in a representative test set:

1. Faithfulness — does every claim in the generated answer appear in the retrieved context? A faithfulness score of 1.0 means the model made no claims that were not grounded in the retrieved chunks. Hallucinations show up as faithfulness below 1.0. You measure this by extracting the atomic claims in the answer and checking each one against the retrieved context — this can be done with a small LLM judge prompt. RAGAS is the most widely used open-source library for automating this.

2. Answer relevance — does the answer actually address what the user asked? A model can be perfectly faithful (every claim is grounded) while still dodging the question or answering a slightly different question than the one asked. Measure this with a semantic similarity score between the question and the answer, or with a judge prompt that scores whether the answer resolves the user intent.

3. Context relevance — are the retrieved chunks actually relevant to the question? This is a retrieval quality metric, not a generation quality metric. If your context relevance is low, your retrieval layer is broken regardless of what the model does with the bad context. Track this separately from answer quality so you know whether to tune retrieval or generation.

4. End-to-end latency — P50, P95, P99. RAG pipelines under production load behave differently than in testing because vector stores have cold caches, rerankers have warm-up costs, and LLM providers have rate limits. Include latency in your eval so you have a regression baseline, not just quality baselines.

Building a test set requires discipline. You need at least 50–100 question-answer pairs drawn from realistic user queries, not queries you made up while building the feature. The fastest way to build a realistic test set is to collect the first two weeks of production queries, sample a representative spread, and annotate ground-truth answers manually. Yes, manually. Synthetic test sets generated by the same LLM you are evaluating are circular and unreliable — the model will score well on its own failure modes.

Run this harness on every change to your pipeline: every prompt edit, every chunk size experiment, every embedding model swap, every reranker threshold adjustment. A change that improves faithfulness but degrades context relevance has moved the problem, not solved it.

The Failure Modes We Actually See in Audits

When we conduct AI integration audits on RAG-backed products, the issues cluster into a few recurring patterns worth knowing in advance.

Version confusion is one of the most damaging. If your document corpus contains multiple versions of the same policy, procedure, or specification, your RAG pipeline will retrieve whichever version has the highest embedding similarity to the query — which may not be the current version. The LLM will answer confidently from an outdated source. The fix is to include version or effective-date metadata in your chunks and filter at retrieval time: only retrieve chunks where effective_date = latest. Metadata filtering in pgvector is inexpensive and this single change eliminates a large class of confidence-but-wrong answers.

The lost needle problem appears when a small piece of critical information — a specific number, a threshold, a name — is embedded inside a long chunk full of surrounding context. The embedding for the full chunk will not strongly signal the specific fact, so queries targeting that fact will miss it. Hierarchical chunking addresses this, but so does explicitly extracting high-value structured data (numbers, dates, named entities) into a separate retrieval path — sometimes the right answer is a SQL lookup, not a vector search.

Prompt bleeding happens when you retrieve multiple chunks and they contain contradictory information. The model will typically synthesise across them, producing an answer that blends both sources. Without faithfulness evaluation you will never detect this. The retrieval layer should deduplicate near-duplicate chunks and your prompt template should instruct the model to flag contradictions rather than resolve them silently.

Token budget collapse is a silent killer at scale. Your retrieval logic retrieves 5 chunks. Each chunk is 512 tokens. You have a system prompt, a user message, and conversation history. Suddenly you are approaching your model context limit and the framework silently truncates the last two retrieved chunks before sending the prompt. The model never sees the most relevant context, and your eval harness (which tested with short conversations) never caught it. Always instrument your actual token usage in production and set hard limits on retrieved context length before the conversation history compresses it away.

Before You Ship the Next RAG Feature

None of these problems are exotic. Chunking strategy, reranking, and eval harnesses are all well-understood — the gap is usually that teams do not know they need them until after they have already shipped without them.

If you are building or maintaining a RAG-backed product and want an independent assessment of your retrieval quality before the next release, reach out at hello@wolf-tech.io or start a conversation at wolf-tech.io. We can usually identify the highest-leverage fixes within a single audit session and give you a concrete before/after baseline from the eval harness.

A production AI feature that hallucinates is not an AI problem — it is an engineering problem. And engineering problems have engineering solutions.