Engineering Notes · Agentic QA

Making an Agentic Test Run Boring: Determinism, Retries, and the Flake Budget

An LLM-driven test fails for reasons a traditional end-to-end test doesn't — the agent misread the screen, picked the wrong tool, or reasoned itself into a corner. Treating every one as a bug produces noise; treating every one as flake hides regressions. The fix is engineering discipline, not better prompts.

27 May 20269 min read

TL;DR

Agentic test failures split into three classes — tool errors, perception errors, reasoning errors. Log them separately or you cannot triage them.
Pick a flake budget number on day one and enforce it. We treat 3–5% as the working ceiling; above that, the agent is broken, not the app.
"Retry until green" is the wrong default. It masks regressions and burns tokens. Retry only inside a logged budget, never as a quiet fallback.
Temperature pinning, seeded tool order, screenshot-diff gates, and bounded plan length are the four levers that actually move flake. Prompts matter less than people think.

The problem nobody warns you about

When you wire up your first agentic test — an LLM driving a browser, with tools for click, type, and assert — it works. You demo it, the room nods, and you push it into CI. A week later you have a 12% non-reproducible failure rate and an open chat thread asking whether the agent is broken or the app is.

Nobody can answer that question, because the run logs are a wall of model output and nothing labels what went wrong. The on-call engineer reruns it. It passes. The PR ships. You have just trained your team to ignore your test suite.

This is the moment most agentic QA pilots quietly die. Not because the technology doesn't work — it does — but because the operating discipline around it was never built. The fix is engineering, not better prompts.

Failures come in three shapes, not one

A traditional end-to-end test fails in one shape: an assertion was false, or a selector didn't resolve. An agentic test fails in three:

Tool errors — the agent issued a valid action but the tool layer broke: the click coordinates were stale, the network timed out, the page hadn't loaded. This is infrastructure, and it's the easiest to fix.
Perception errors — the agent looked at the screen and saw something that wasn't there, or missed something that was. A button rendered three pixels lower than expected. A modal animation in flight. A skeleton loader the agent took for real content.
Reasoning errors — the agent saw the screen correctly but planned the wrong next step. It tried to log in twice. It misread "Save draft" as "Save". It hallucinated a confirmation page that doesn't exist.
These three failures want three different fixes. Lumping them together as "flake" is what trains the team to stop looking.

Every agentic test run must emit a labelled failure class. If your harness can't tell tool from perception from reasoning, you have one bit of information per run, and no path to improvement.

The flake budget is a first-class artifact

SRE teams set error budgets — a quantified amount of failure they will tolerate before they stop shipping and start fixing. Agentic QA needs the same idea, applied to the agent.

Pick a number on day one. Below it, the run is healthy and retries (if any) are allowed inside the budget. Above it, the agent is broken — you stop accepting its output as signal until the cause is named. We operate at a 3–5% working ceiling for a stable test fleet against a stable app. Our published lab benchmark on a mature open-source web app landed at 3.30% with retries off, which is what a well-engineered deterministic end-to-end suite gets; an agentic suite should not be worse.

The budget number matters less than the discipline of having one. The point is to make the decision once, in calm, instead of every Tuesday morning when the dashboard is red and someone wants to ship.

If you don't have a flake budget, every red run becomes a political argument. With one, it becomes a number.

Why "retry until green" is the wrong default

Most CI configs default to two automatic retries. It feels harmless — flaky tests pass on the second try and stop blocking the queue. For agentic tests it's actively destructive, for two reasons.

First, retries don't just hide flake; they hide regressions. An agent that occasionally misclicks the "Confirm" button looks the same on the dashboard as an app that occasionally renders the wrong button. Both go green on retry. One of them is a real bug you just shipped.

Second, every retry costs tokens. A single retry of a 40-step agentic test on a frontier model can run to a meaningful fraction of a dollar. Three retries across a hundred tests per PR, run twenty times a day, is a line item.

The fix isn't to disable retries; it's to retry inside a logged budget. Each retry emits its failure class. If a test retries past N within a window, the harness fails the build and surfaces the trace. Quiet recovery is what kills test suites; loud, accounted recovery is fine.

The four levers that actually move the number

We have tried, on customer engagements and in the lab, almost every promised "fix" for agentic test flake. Most prompt-engineering tricks shift the number by a percentage point or two. Four levers consistently move it by more:

Temperature pinning. Run the agent at temperature 0 for tests. Yes, you give up some creativity. You are not running a creative writing competition; you are running a regression suite. A pinned model is a more predictable model, and predictable is what "test" means.
Seeded tool-call order. When the agent has multiple valid first moves, seed the choice deterministically (hash of test name, run index). Identical input should yield identical first action. The downstream branching is where the agent earns its keep.
Screenshot-diff gates before agent calls. Before asking the model what to do, hash the visible region and check it against the previous step's expectation. If the page hasn't changed since the last action, you have a tool error, not a perception puzzle — surface it before spending a model call.
Bounded plan length. Cap the number of steps per test. An agent that needs 80 steps to do what should take 12 has lost the plot, and every extra step is another chance to fail. The cap turns a runaway trace into a fast, clean failure.

Notice none of these are prompt tweaks. The wins are in the harness, not the words.

A triage rubric that pays for itself

Once failures are labelled and the budget exists, triage stops being a debate. We use a four-line rubric, posted in the channel:

Tool error — file an infra ticket against the test harness; do not change the prompt; do not change the app.
Perception error, reproducible — file an app ticket (the UI rendered in a way the agent could not parse, which a screen-reader user probably can't either; this is a real accessibility signal).
Perception error, not reproducible — log against the flake budget; if the budget holds, do nothing this week; if it breaks, escalate to the model-version owner.
Reasoning error — file a prompt regression; pin the offending trace; require a new golden trace before the prompt change ships.

The first two are app-or-infra bugs and route to the team that owns them. The last two are agent bugs and route to the team that owns the agent. The single line that took us the longest to learn: those are not the same team, and you cannot let them be.

What this looks like after six weeks

On the customer engagements where we have done this work, the shape is roughly the same. Week one is denial: the team has been retrying their way to green and the new labelling makes the noise visible for the first time. Week two is panic: the number is higher than anyone wanted to admit. Weeks three and four are the four levers — usually most of the gain comes from temperature pinning and the screenshot-diff gate alone. By week six the agentic suite is inside its flake budget, the budget is documented, and the on-call rotation has a one-page rubric instead of a back-channel argument.

None of this is exotic. It is the same discipline a good end-to-end shop applied to selectors and waits a decade ago, transposed to a non-deterministic actor. The technology is new; the engineering isn't.

The agentic suite isn't a magic black box you tune with better prompts. It's a flaky-by-default actor you make legible with logging, budgets, and a rubric. That's the whole job.

Key takeaways

Emit a labelled failure class on every agentic run — tool, perception, or reasoning. Without that label you cannot improve the suite.
Pick a flake budget number on day one. 3–5% is a defensible ceiling for a stable fleet against a stable app.
Retries are allowed only inside a logged budget. Quiet retries hide regressions and burn tokens.
Temperature 0, seeded tool order, screenshot-diff gates, and bounded plan length move the flake number more than any prompt change.
Agent bugs and app bugs route to different teams. The triage rubric must say so out loud.

FAQs

Is a 3–5% flake budget realistic for agentic tests?+

For a stable agentic suite against a stable app, yes. Our published lab benchmark on a mature open-source web app landed at 3.30% with retries off using the four levers in this post. Early in an engagement the number is usually higher; the budget is what tells you when it stops being acceptable.

Won't temperature 0 make the agent worse at handling new pages?+

Slightly, in theory. In practice the regression suite is not where you want creative behaviour — you want the same input to produce the same first action. Save the higher temperatures for exploratory / discovery agents, not for tests that gate a merge.

How do you actually log a 'reasoning error' versus a 'perception error'?+

Two artifacts per step: the screenshot the agent saw, and the natural-language plan the agent emitted. If the plan describes the screen correctly but picks the wrong action, it's reasoning. If the plan describes a screen that doesn't match the screenshot, it's perception. The split is mechanical once both are captured.

Can we use this without changing our existing end-to-end suite?+

Yes. The patterns sit at the harness level — your existing end-to-end tests are untouched. The agentic layer wraps the tool calls and adds the labelling, budget, and gates. We almost always run agentic and deterministic tests side by side rather than replacing the latter.

What model do you run the tests on?+

Whichever model the customer is standardised on, pinned by version. The patterns in this post are model-agnostic; what matters is that the version is pinned and that a model upgrade triggers a re-run of the golden traces (see the companion post on evals).

Want this discipline applied to your suite?

We scope agentic QA engagements that start with measurement, name a flake budget out loud, and don't ship until the harness can label its own failures. No retries-to-green theatre.

Talk to us

About the authorVenkata Kari · Founder, GVK Technologies

Twenty years in QA leadership, most of it spent watching teams ship around a red dashboard. GVK Technologies builds and operates agentic test suites for product engineering teams — see the case studies for measured runs against real open-source apps.

All posts