RAG Evaluation Metrics: Measuring Retrieval Quality Before You Blame the Model

#RAG evaluation metrics
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

A RAG feature gives a wrong answer in a demo, and the room reaches the same conclusion within seconds: the model is not good enough. Someone suggests a bigger model, someone else suggests a different provider, and a sprint disappears into prompt tuning that changes nothing. The missing step is measurement. RAG evaluation metrics tell you whether the retrieval layer actually handed the model the right context, and in most failing pipelines it did not. Before you touch the generation prompt or upgrade the LLM, you need numbers that separate a retrieval problem from a generation problem.

This post is about those numbers. Not the academic versions you will never compute, but the handful of retrieval metrics that fit on a dashboard and tell you where the failure lives. I will define each one in plain terms, show what a bad score looks like in practice, and explain how to assemble a small offline harness you can run before every meaningful change to your pipeline.

Why "Blame the Model" Is Almost Always Wrong

Retrieval-Augmented Generation has two stages, and the failure modes look identical from the outside. A user asks a question, your system retrieves some chunks from a vector store, and the model writes an answer using those chunks. When the answer is wrong, it could be because the model reasoned badly over good context, or because the model reasoned perfectly over bad context. The output looks the same: a confident, fluent, incorrect paragraph.

The instinct to blame the model is understandable because the model is the part that produced the visible text. But in the pipelines we review during code audits and AI integration work, the retrieved context is wrong far more often than the reasoning is. The model is usually doing exactly what you asked: summarising whatever chunks arrived. If the right chunk never came back, no model upgrade will save you. This is why RAG evaluation metrics focus on the retrieval layer first. You measure what came back before you judge what the model did with it.

The practical consequence is a rule worth writing on the wall: never tune generation until retrieval is measured. A team that swaps models while retrieval recall sits at 0.4 is optimising the wrong half of the system.

The Core RAG Evaluation Metrics for Retrieval

You need a labelled evaluation set to compute any of these. That means a list of representative questions, and for each question the set of document chunks that genuinely contain the answer. Building this set is the real work, and I will come back to it. Once you have it, four metrics carry most of the signal.

Precision at k answers the question: of the top k chunks I retrieved, how many were actually relevant? If you retrieve 5 chunks and 2 of them contain answer-bearing content, precision@5 is 0.4. Low precision means you are stuffing the model's context window with noise. The model then has to ignore irrelevant chunks, and it often fails to, which produces answers that drift toward whatever distracting text was retrieved.

Recall at k answers the inverse: of all the chunks that contain the answer, how many made it into the top k? If a question has 3 relevant chunks in your corpus and only 1 appears in the top 5, recall@5 is 0.33. Low recall is the most damaging failure in RAG, because the model cannot cite information it never received. When recall is poor, the answer is incomplete or hallucinated, and no prompt engineering fixes it.

Mean Reciprocal Rank (MRR) measures how high the first relevant chunk sits in your ranking. If the first correct chunk is in position 1, the reciprocal rank is 1. If it is in position 4, the reciprocal rank is 0.25. Averaged across your evaluation set, MRR tells you whether your reranking is putting good context near the top, where the model weights it most heavily. A pipeline can have decent recall and still answer badly because the relevant chunk is buried at position 8 behind seven distracting ones.

Context relevance is the softer, judgement-based metric. For each retrieved chunk, you ask: is this actually on-topic for the question? You can score this with human raters on a sample, or with an LLM-as-judge prompt at scale. Context relevance catches the case where chunks are topically adjacent but not answer-bearing, which keyword and embedding similarity scores will happily reward.

These four cover most diagnoses. Precision and recall tell you what the retriever found and missed. MRR tells you whether ranking order helps or hurts. Context relevance tells you whether "similar" actually means "useful" for your corpus.

Reading the Metrics Together

The diagnostic power comes from looking at the metrics as a pattern, not in isolation.

High recall and low precision means you are retrieving the right chunks but also a lot of junk. The fix is usually a reranker or a smaller k, not a different embedding model. You are finding the answer, you are just drowning it.

Low recall regardless of precision is the serious one. The answer-bearing chunks are not in your candidate set at all. Look at chunking strategy first, because chunks that split a fact across boundaries cannot be retrieved whole, then look at embedding model fit for your domain and at whether your query is being embedded the same way as your documents.

Good recall and good precision but bad answers points back at generation or at ranking. Check MRR. If the relevant chunk is consistently buried, the model is weighting earlier, irrelevant chunks more heavily. If MRR is also healthy and answers are still wrong, only then do you have a genuine generation problem worth a prompt or model change.

This ordering matters because it stops the expensive guessing. Each metric narrows the search space for the fix, so a sprint goes toward the layer that is actually broken.

Building an Offline Eval Harness You Will Actually Run

A metric you compute once during a panic is not an evaluation system. The point is to run these numbers automatically, offline, every time you change chunk size, swap an embedding model, adjust k, or add a reranker. That turns RAG evaluation metrics from a forensic tool into a regression gate.

Start with the dataset, because it is the hard part. Collect 50 to 100 real questions, ideally from actual user logs rather than invented ones, since real queries are messier and more revealing than the clean examples a developer writes. For each question, have a domain expert identify which chunks in your corpus contain the answer. This labelling is tedious and there is no shortcut, but 50 well-labelled questions are worth more than 5,000 synthetic ones. You can bootstrap the first pass with an LLM that proposes candidate relevant chunks, then have a human confirm, which cuts the labelling time substantially.

With the set in place, the harness is mechanical. For each question, run retrieval, capture the ranked chunk IDs, and compare against the labelled relevant set to compute precision@k, recall@k, and MRR. Run a context-relevance judge over a sample. Store the run with a timestamp and the configuration that produced it. The first run becomes your baseline, and every subsequent change is measured against it rather than against a feeling.

Two practices keep the harness honest. Version the evaluation set alongside your code so a metric change always corresponds to either a pipeline change or a deliberate dataset change, never a silent drift. And separate retrieval metrics from end-to-end answer quality, because a single blended score hides exactly the precision-versus-recall distinction that tells you where to look. If you are running pgvector on a Symfony or Next.js stack, the harness slots in next to your existing test suite, and the schema work we cover in our Symfony and pgvector architecture blueprint gives you the chunk and metadata structure these metrics need. For the upstream chunking and reranking failures these metrics will surface, our walkthrough of why RAG pipelines return confident garbage covers the fixes in detail.

What Good Looks Like

There is no universal pass mark, because acceptable recall for a customer support bot differs from acceptable recall for a system that drafts legal summaries. What matters is direction and stability. You want recall@k high enough that the answer-bearing chunks are reliably present, MRR high enough that they sit near the top, and precision high enough that the model is not wading through noise to find them. Then you want those numbers to hold or improve with every change, never quietly regress.

The teams that ship reliable RAG are not the ones with the largest model. They are the ones who can answer "did retrieval improve or get worse?" with a number instead of an opinion. Once you have that, the conversation in the room changes. Nobody guesses at the model. Somebody pulls up the metrics and points at the layer that is actually failing.

If you are debugging a RAG feature that works in the demo and falls apart in production, or you want a second set of eyes on a pipeline before it ships, we help teams build exactly this kind of measurement into their custom software. Reach us at hello@wolf-tech.io or at wolf-tech.io, and bring your worst-performing query. It usually tells the whole story.