Agent Evaluation Frameworks: Testing Non-Deterministic AI Features Before Release

#agent evaluation
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Your test suite is built on a promise: the same input produces the same output. Assert that add(2, 2) returns 4, and it returns 4 today, tomorrow, and on every CI run until someone changes the function. That promise is the foundation of everything we trust about automated testing, and agent evaluation is what you reach for when a feature breaks it.

Agentic AI features quietly break it. Ask an LLM-backed agent to "summarise this support thread and decide whether to escalate" and you get a different phrasing every time, occasionally a different decision, and once in a while a confidently wrong one. There is no assertEquals for that. This is exactly where agent evaluation comes in: a discipline and a set of tools for testing non-deterministic features before they reach users, so you ship on evidence rather than on a hopeful demo.

I have watched more than one team push an agentic feature that worked beautifully in the founder's hands and then misbehaved for one customer in twenty, with no test that could have caught it because the team was still testing AI features like deterministic code. This post is about the framework that prevents that, and the engineering decisions that make it work in a real CI pipeline.

Why Traditional Testing Fails for Agents

A conventional unit test answers a binary question: did this exact output appear? Agentic features fail that question for three independent reasons, and you need to understand all three before you can design around them.

The first is non-determinism in the model itself. Even at temperature zero, the same prompt can yield different tokens across model versions, across providers, and sometimes across calls, because floating-point non-determinism and provider-side batching leak through. You cannot pin the output the way you pin a hash.

The second is valid variation. "The customer wants a refund and is frustrated" and "Refund requested; customer is unhappy with the delay" are both correct summaries. A string comparison marks the second one as a failure even though a human would call it a pass. Your test harness has to judge meaning, not characters.

The third, and the one that bites hardest, is multi-step trajectories. A real agent does not produce one output. It plans, calls tools, reads results, and decides what to do next. An agent can reach the right final answer through a broken path that happens to work this time and fails next week, or reach a wrong answer through reasoning that looked fine at every individual step. You are not testing a value; you are testing a process that unfolds over several decisions.

This is why "it worked when I tried it" is not evidence of anything. A single manual run samples one path through a probability distribution. Agent evaluation exists to sample that distribution systematically and tell you what fraction of the time the feature does the right thing.

The Anatomy of an Agent Evaluation Framework

Strip away the tooling and every serious evaluation setup has the same four parts. Get these right and the specific library you choose barely matters.

An evaluation dataset. This is a curated set of inputs paired with what a good outcome looks like. Not necessarily an exact expected string, often a rubric or a set of must-have properties. The dataset is the single most valuable asset you will build, and it is worth more than any framework, because it encodes your actual definition of correct behaviour. Start it with twenty to fifty hand-written cases covering your happy paths, then grow it from production. Every time the agent fails for a real user, that interaction becomes a new test case. Within a few months your dataset reflects the messy reality of your users rather than your optimistic assumptions.

A runner. This executes the agent against every case in the dataset, capturing not just the final output but the full trajectory: which tools were called, in what order, with what arguments, and what came back. You need the trajectory because two runs with identical final answers can have wildly different reliability, and you want to catch the agent that got lucky.

Scorers. These turn each run into a measurement. Some are deterministic and cheap: did the agent call the refund tool with a valid order ID, did the output parse as the JSON schema you require, did it stay under the latency budget. Some are semantic and need an LLM-as-judge, which I will come back to because it is both the most powerful and the most misused part of the stack. The combination matters: deterministic scorers give you hard guarantees on structure and safety, while semantic scorers handle the fuzzy question of whether the answer is actually good.

A reporting and gating layer. This aggregates scores across the dataset into numbers you can act on, tracks them over time, and decides whether a change is allowed to ship. Without gating, evaluation is a dashboard nobody looks at. With it, evaluation becomes a real release gate, the same way a failing unit test blocks a merge.

If you are building AI features into an existing product and want a second set of eyes on how these pieces fit your stack, this is the kind of architecture work our team does during custom software development engagements.

Metrics That Actually Mean Something

The temptation is to reduce agent quality to a single accuracy number. Resist it. Non-deterministic systems need a small panel of metrics, each answering a different question.

Task success rate is the headline: across your dataset, what fraction of runs achieve the intended outcome under your rubric. Because the system is non-deterministic, you should run each case several times, five is a reasonable starting point, and report the success rate per case. A case that passes three times out of five is not a passing case; it is a coin flip you have not noticed yet.

Trajectory validity measures whether the agent took a sound path, independent of the final answer. Did it call tools it was allowed to call, in a sensible order, without redundant or dangerous steps. You can have a high task success rate sitting on top of fragile trajectories, and that gap is where future incidents live.

Consistency is the metric teams forget. Run the same input multiple times and measure how much the outcomes vary. A feature that is correct on average but swings between three different decisions for the same input will erode user trust faster than one that is slightly less accurate but stable. For anything touching money or compliance, low variance can matter more than peak accuracy.

Cost and latency per task belong in the eval report too, because an agent that quietly starts making twice as many tool calls after a prompt change is a regression even if accuracy holds. Tracking these alongside quality stops you from shipping a "better" agent that doubles your inference bill. We dug into the spend side of this in our post on LLM cost control: token budgets, caching, and graceful fallbacks.

LLM-as-Judge: Powerful and Easy to Get Wrong

For semantic scoring, the dominant pattern is to use a second LLM to grade the output of the first against a rubric. Done well, this scales human judgement to thousands of cases. Done badly, it gives you a confident green dashboard that means nothing.

Three rules keep it honest. First, give the judge a specific rubric, not a vibe. "Is this a good summary?" produces noise. "Does the summary mention the refund amount, the reason for the request, and the customer's stated deadline? Score one point for each" produces a measurement you can reason about. Decompose quality into checkable properties.

Second, validate the judge against humans before you trust it. Take fifty cases, have a person score them, have the judge score them, and measure agreement. If the judge disagrees with your team on a quarter of cases, its numbers are not yet trustworthy and the rubric needs tightening. This calibration step is non-negotiable and it is the one almost everyone skips.

Third, watch for known judge biases. LLM judges tend to favour longer answers, favour outputs from the same model family, and can be swayed by confident tone over correctness. Where it matters, randomise position, strip identifying style, and keep a deterministic check alongside the judge so a single biased grader cannot wave through a structurally broken output. The guardrail patterns in our piece on AI safety nets pair naturally with judge-based scoring, because the same checks that protect users in production can run as scorers in evaluation.

Wiring Evaluation Into Your Pipeline

An evaluation suite that runs only when someone remembers to run it is worth very little. The goal is to make agent evaluation a normal part of how changes ship, with the same reflexes you already have around unit tests.

Run a fast subset on every pull request. Full evaluation across a large dataset with multiple repetitions per case is too slow and too expensive for every commit, but a curated smoke set of ten to twenty critical cases can run in a couple of minutes and catch the obvious regressions. Treat a drop in success rate on that subset the way you treat a failing test: it blocks the merge.

Run the full suite nightly and before every release, and store the results so you can see trends. The first time someone tweaks a prompt to fix one customer complaint and the nightly run shows task success quietly dropping two points elsewhere, the investment pays for itself. Prompt changes have non-local effects, and only a standing evaluation suite makes those effects visible.

Crucially, gate on regression, not on a perfect score. Agentic features rarely hit a hundred percent, and chasing that number wastes effort. The useful gate is relative: this change must not reduce task success rate below the current baseline minus a small tolerance, and must not increase cost or latency beyond budget. That turns evaluation from a vanity metric into a contract about not making things worse, which is exactly what a release gate should enforce. To see those trends you also need production telemetry feeding back in, which is where the practices in LLM observability on a budget connect directly to your offline evals: production failures become tomorrow's test cases.

Start Smaller Than You Think

The most common reason teams never build agent evaluation is that they imagine a large platform with a labelling team and a bespoke harness. You do not need that to start, and waiting for it is how features ship untested.

Begin with a JSON file of twenty cases and a script that runs your agent against each one, applies two or three deterministic checks plus a single rubric-based judge, and prints a success rate. Run it by hand before each release this week. Add it to CI next week. Grow the dataset from real failures every week after that. Within a quarter you will have something that genuinely tells you whether an AI feature is ready, and you will wonder how you ever shipped non-deterministic code on a single manual try.

The teams that win with agentic features are not the ones with the cleverest prompts. They are the ones who can change a prompt on Friday and know by the numbers, not by feel, whether they made the product better or worse.

If you are about to ship an agentic feature and want confidence that it will behave for the twentieth customer as well as the first, we help teams design evaluation harnesses and audit AI features before release as part of our code quality consulting work. Reach us at hello@wolf-tech.io or at wolf-tech.io, and bring the feature you are least sure about.