Your CI Isn't Broken. It's Flaky — And Retries Are Hiding the Bill
We took a mature open-source web app with a serious Playwright end-to-end suite and ran it five times on the same commit, same machine, retries off. It gave five different answers — and every red result was noise, not a bug.
6 min read
From the founder
A note on why this exists: GVK is a new firm, so rather than point at client logos I don't have yet, I ran the experiment myself and I'm showing you exactly what I found — including the parts that complicate the headline.
I started with flaky CI because it's the failure I see teams quietly normalise. After enough false alarms people stop trusting red — and that learned shrug is the moment a real bug ships behind a green tick.
Five identical runs, five different answers
We ran ~273 end-to-end tests five times against the same commit on the same machine, with nothing changed between runs. It failed 7 tests the first time. Then 2. Then 6. Then 1. Then 1.
Across all five runs, 264 tests passed every single time, zero tests failed every time, and 9 tests failed only sometimes. That last line is the whole story: none of the failures were real. Every red result came from a test that was perfectly capable of passing — it just didn't, that run.
The suite wasn't catching bugs. It was generating noise — and a dashboard can't tell the two apart.
The distinction that changes how you fix it
A deterministic failure — one that fails every time — is a signal: something is actually wrong, and you want to see it. A flaky failure — one that fails sometimes — is noise: the code is fine, the test is unreliable. On a dashboard they are the same colour.
When you can't tell them apart, you reach for the one tool that hides both: retries. Most teams run CI with two automatic retries, so a test that fails one run in five passes on retry and disappears. The dashboard goes green; the problem doesn't go away. Now your real bugs are buried under the same mechanism that buries your flake — and you've trained everyone to ignore red.
What was actually wrong with those 9 tests
We explained every flaky failure rather than re-running it. They sorted into a small number of root causes:
- Timing and race conditions (the large majority): the test checked for something a beat before the app was ready — slow navigation, an element not yet rendered, a toast that appeared and vanished.
- An ambiguous selector (a genuine, fixable defect): one test looked for an element that, under the right timing, matched two things on the page at once — a real bug that only surfaces intermittently.
- A missing visual baseline (not flake at all): one "failure" was simply a snapshot that had never been recorded on this platform. We flagged it as exactly that rather than letting it inflate the number.
It would have been easy — and dishonest — to call all nine "flaky" and quote a bigger figure. One of them wasn't flake. Knowing the difference is the entire job.
Where the flake hides
The flaky tests weren't randomly scattered. They clustered in the most complex, highest-value user journeys — booking flows, multi-option configuration, custom form questions. The hardest parts of the product are exactly where flake hides, which is exactly where you least want your safety net to have holes.
Where agentic AI changes the economics
Doing this by hand doesn't scale: re-running a suite five times, reading every trace, deciding flake-versus-real for each, finding root cause, and proposing a fix is days of senior time per cycle — and it has to happen continuously, not once.
That is the part we automate. Agents run and re-run under controlled conditions to expose inconsistency that single runs and retries hide; triage every failure to separate genuine breakage from flake; classify each flake by root cause; and propose fixes that are re-verified before a human sees them. The outcome your team feels: red builds that mean something again, a flake rate that goes down instead of being papered over, and senior engineers who stop being human retry buttons.
Automatic retries are the most expensive lie in CI. You pay the compute, you pay the pipeline minutes, and you pay a third time when a genuine failure hides behind the same green tick that buried the flake. I'd rather show you the ugly number than a comfortable one.
Key results
- 3.30% raw flake rate at retries:0 across five identical runs (2.93% genuine after excluding a missing-baseline artifact).
- Zero deterministic failures — every red result was noise, not a real bug.
- 9 flaky tests classified by root cause: timing/race, an ambiguous selector (a real defect), and one non-flake misreported as failure.
- Flake concentrated in the most complex, highest-value user journeys.
Methodology & honesty
Independent benchmark on a public, mature open-source codebase with a strong Playwright suite, run locally with retries off across five identical runs. Every figure traces to a logged run. We deliberately separated a missing-baseline artifact out rather than inflate the flake number. No production systems were touched and no endorsement is implied.
FAQs
Don't retries already solve flaky tests?+
How is this different from flaky-test detection tools?+
Won't an AI quarantine real bugs as flake?+
Can you measure our suite without naming it publicly?+
What's your real flake rate?
If you can't state your real flake rate today, you're in good company — almost no one has run the experiment. I'll run it on your suite: your real number, your flaky tests ranked by root cause, and a fix plan. Email me — it comes straight to me.
Book a flake auditI ran these benchmarks myself. GVK is new — I don't have client logos to borrow yet, so instead of marketing claims I publish the experiments I wish vendors would. The contact button reaches me directly, and I read every message.