Engineering Notes · Agentic QA

Evals Are the Test Suite for Your Test Suite: Running Agentic QA in Production

A passing agentic test tells you nothing if the agent silently got dumber. The fix is to treat the agent like a product that ships — with golden traces, versioned prompts, a model-upgrade gate, and an SLO on cost and latency. Without that, your agentic suite is decaying every week and nobody is measuring it.

27 May 202610 min read

TL;DR

Agentic QA gives you two regressable systems — the app and the agent. Treat them both as production.
Golden traces are the unit of truth. A known-good run, captured verbatim, that future runs are diffed against on reasoning, tool calls, and final state.
Prompt and model versioning belongs in your repo, with a changelog and a CI gate. "We updated the prompt last Tuesday" is not a production answer.
Before flipping a model version on a live fleet, replay the golden traces under the new model and review the diff. No exceptions.
Cost and latency need SLOs the same way correctness does. p95 token spend per test and p95 wall-clock are the two numbers we alert on.

The two-system problem

When you ship agentic QA, you have introduced a second piece of production software into your release pipeline. Your tests don't just exercise the app any more — they are themselves a system, with model dependencies, prompt logic, tool wiring, and emergent behaviour.

Most teams instrument only the app. The agent is treated as a fixed instrument, the way a traditional end-to-end framework is. It isn't. A deterministic framework doesn't change behaviour because the vendor pushed a new model checkpoint last night. Your agent does. And the failure mode is brutal: a model that has become slightly less precise will pass most of its tests and quietly stop catching the regressions it used to catch. You will not notice. The dashboard stays green.

An evals harness is what keeps the agent itself honest. It is the regression suite for your regression suite.

If the only signal you have on your agent's health is whether its tests pass, you are flying blind. A test that passes against a regressed agent is a test that has stopped doing its job.

Golden traces: the unit of truth

A golden trace is a single agentic test run, captured verbatim and frozen, that represents "this is what correct looks like." It contains four things:

The starting state: app version, seed, initial URL, fixture data.
The reasoning trace: every plan the agent emitted, step by step, in natural language.
The tool-call trace: every action the agent took — click coordinates, typed strings, assertions — in order.
The final state: a deterministic fingerprint of the app at the end (DOM hash of the asserted region, screenshot diff anchor, URL).

When the agent re-runs the same test under the same conditions, the harness diffs the new trace against the golden one along all four dimensions. A different final state is a probable real bug. A different tool-call sequence with the same final state is a candidate prompt regression — it got the right answer the wrong way. A different reasoning trace with the same actions is fine; that's the model thinking out loud differently, and you ignore it.

The discipline this enforces is enormous. Suddenly you can answer "did the agent change behaviour?" with a yes/no instead of a vibes-based debate about screenshots.

Versioning prompts the same way you version code

On every engagement we have done, the same artifact appears: a single chat message that says "I tweaked the prompt last week and now things look better, I think." That message is the symptom of a missing version control story.

Prompts are code. They live in the repo. They have a changelog. They are reviewed in a PR like any other change. The agent loads them from a pinned version, not from a shared doc that someone edits on Tuesday afternoons. Same for the tool definitions and the model identifier.

agent.yaml in the repo, with prompt, tool list, and model version pinned by hash or explicit identifier.
Every change to agent.yaml requires a PR, a reviewer, and a golden-trace run before merge.
The CI gate refuses to ship an agent.yaml change if the golden-trace diff is non-empty and unreviewed.
The release notes for the test fleet read like product release notes: "v0.7.1: tightened the planning prompt to handle multi-step confirmations; expected effect on traces 14, 22, 31."

Treating agent.yaml as production code is the single highest-leverage change a team makes. It moves the agent from a craft artifact to an engineered one.

The model-upgrade protocol

Frontier model vendors push improvements constantly. Most of those improvements are good. Some of them change behaviour your tests depend on in invisible ways. A model that has become more cautious will refuse to click a button it used to click. A model that has become faster will skip a step it used to perform. A model that has become more verbose will blow your token budget by 30% on the same task.

The protocol we use before flipping a model version on a live fleet is short and non-negotiable:

Pin the current model version in agent.yaml. Pinning by date or by explicit identifier, never by alias.
Run the full golden-trace suite under the new model in shadow mode — the new model executes; the old model still gates merges.
Diff the shadow traces against the golden ones. A green diff (final-state-equivalent on every test) is the only condition under which the new model gets promoted.
Where the diff is non-empty, classify each delta as either an agent regression (the new model got worse), an app regression the new model caught that the old one missed (rare, valuable), or a benign reasoning change. Update goldens only for the third case.
Promote the new model in agent.yaml as a single, reviewable commit. Tag the release. Keep the prior version available for rollback for at least one full release cycle.

A model upgrade is a release. Treat it like one and the surprises stop.

Cost and latency are SLOs too

Most teams find out their agentic test fleet has become expensive when the invoice arrives. By then the regression is six weeks old and nobody can remember which change caused it. The fix is to instrument cost and latency the same way you instrument correctness.

We track two numbers per test, persisted between runs:

p95 token spend per test, by class (input, output, cached). A 30% jump on a test that used to be steady is a signal — usually a prompt change that bloated the context, occasionally a model version that became chattier.
p95 wall-clock per test. Catches model latency regressions and runaway plans before they become a flaky-build complaint. A test whose p95 has doubled is broken even if it still passes.
Total fleet spend per PR and per day, with a budget. The budget exists for the same reason error budgets exist — to make the decision once, in calm.

An agentic suite that has no cost SLO will get expensive. An agentic suite that has no latency SLO will get slow. Neither failure mode looks like a test failure on the dashboard, which is exactly why you have to measure them separately.

Who owns what

The last thing that has to be in place is organisational, not technical. On every customer engagement, the moment the agentic fleet stops being a prototype and starts being production, the question of ownership becomes urgent.

We have learned to set it out explicitly:

The agent platform team owns agent.yaml, the golden-trace harness, the eval suite, the model-upgrade protocol, and the cost/latency SLOs.
The product test teams own the individual test cases — the journeys being tested, the assertions, the fixture data.
App regressions route to product. Agent regressions route to the platform team. The triage rubric in the companion post is what makes the routing mechanical.

Conflate these two roles and you get the worst of both worlds: prompt changes that quietly break test cases, and test-case changes that quietly mask prompt regressions. Two teams, one rubric. It is the only configuration we have seen work past the pilot stage.

What "done" looks like

An agentic QA fleet running in production with all of this in place feels boring. A model upgrade is a routine PR with a green shadow run attached. A prompt change is a PR with a one-line changelog and a golden-trace diff. The cost dashboard is flat. The latency dashboard is flat. The flake budget (see the companion post) is comfortably under its ceiling.

Most of the engineering investment is in the harness, not in the agent. The agent is just the actor; the harness is the stage, the lighting, the prompter, and the union rep. Get the harness right and the agent gets boring in the good way — which is the only state in which it is safe to let it gate your releases.

An agentic test that passes against a regressed agent is a test that has stopped doing its job. Without evals, you will not know.

Key takeaways

Capture a golden trace for every test — starting state, reasoning, tool calls, final state — and diff future runs against it.
Version agent.yaml (prompt, tools, model) in the repo. PR, review, CI gate. No untracked edits.
Shadow-run a new model against golden traces before promoting it. A model upgrade is a release.
Track p95 token spend and p95 wall-clock per test. Cost and latency regressions are silent on a correctness dashboard.
Agent regressions and app regressions route to different teams. Make the ownership explicit before the first incident.

FAQs

Isn't a golden trace just a snapshot test under a different name?+

It rhymes with one, but the diff is multi-dimensional. A snapshot test compares one artifact; a golden-trace diff compares starting state, reasoning, tool calls, and final state separately, and applies different rules to each. Different reasoning with identical tool calls is allowed; different tool calls with identical reasoning is not. That distinction is impossible in a single-artifact snapshot.

How often do you actually update golden traces?+

Whenever the underlying app changes a journey deliberately (most common), whenever a model upgrade produces a benign reasoning-only diff that we accept (occasional), and never as a quiet fix to make a red diff go away. The third one is the failure mode that destroys eval suites.

Do we need a separate model-upgrade protocol if our vendor pins versions for us?+

Yes. Pinned versions still get retired, and the underlying serving stack changes within a version. The protocol is cheap to run — a single shadow execution of the golden-trace suite — and the one time it catches a regression you would have shipped, it pays for the next two years of the practice.

Won't tracking p95 token spend per test be noisy on a small fleet?+

It is, below about 50 runs per test per week. We use a rolling 100-run window and don't alert until that fills. For a smaller fleet, watch the median instead of p95, but the discipline of having a number is what matters more than which percentile you pick.

Can the platform team and the product test team be the same people early on?+

Yes, for as long as the fleet is a pilot. The moment the fleet is gating merges in more than one squad, split them. The earliest sign you have left it too late is when nobody can confidently say whether a red run is an app problem or an agent problem — that ambiguity is the cost of conflated ownership.

Scaling agentic QA past the pilot?

If you have an agentic test fleet working in a single squad and you are about to roll it out wider, this is the moment to put the harness in. We scope evals-first engagements that ship the platform alongside the tests, not after.

Talk to us

About the authorVenkata Kari · Founder, GVK Technologies

GVK Technologies builds and operates agentic test suites — including the eval harness, golden-trace pipeline, and model-upgrade protocol described above. Talk to us if you are about to scale agentic QA past a single squad.

All posts