Case Study · Web E2E

Your CI Isn't Broken. It's Flaky — And Retries Are Hiding the Bill

We took a mature open-source web app with a serious Playwright end-to-end suite and ran it five times on the same commit, same machine, retries off. It gave five different answers — and every red result was noise, not a bug.

21 May 20266 min read

3.30%flake rate at retries:0 (2.93% genuine)

0tests that failed every run — nothing was actually broken

264 / 273tests passed every single run

From the founder

A note on why this exists: GVK is a new firm, so rather than point at client logos I don't have yet, I ran the experiment myself and I'm showing you exactly what I found — including the parts that complicate the headline.

I started with flaky CI because it's the failure I see teams quietly normalise. After enough false alarms people stop trusting red — and that learned shrug is the moment a real bug ships behind a green tick.

Five identical runs, five different answers

We ran ~273 end-to-end tests five times against the same commit on the same machine, with nothing changed between runs. It failed 7 tests the first time. Then 2. Then 6. Then 1. Then 1.

Across all five runs, 264 tests passed every single time, zero tests failed every time, and 9 tests failed only sometimes. That last line is the whole story: none of the failures were real. Every red result came from a test that was perfectly capable of passing — it just didn't, that run.

The suite wasn't catching bugs. It was generating noise — and a dashboard can't tell the two apart.

The distinction that changes how you fix it

A deterministic failure — one that fails every time — is a signal: something is actually wrong, and you want to see it. A flaky failure — one that fails sometimes — is noise: the code is fine, the test is unreliable. On a dashboard they are the same colour.

When you can't tell them apart, you reach for the one tool that hides both: retries. Most teams run CI with two automatic retries, so a test that fails one run in five passes on retry and disappears. The dashboard goes green; the problem doesn't go away. Now your real bugs are buried under the same mechanism that buries your flake — and you've trained everyone to ignore red.

What was actually wrong with those 9 tests

We explained every flaky failure rather than re-running it. They sorted into a small number of root causes:

Timing and race conditions (the large majority): the test checked for something a beat before the app was ready — slow navigation, an element not yet rendered, a toast that appeared and vanished.
An ambiguous selector (a genuine, fixable defect): one test looked for an element that, under the right timing, matched two things on the page at once — a real bug that only surfaces intermittently.
A missing visual baseline (not flake at all): one "failure" was simply a snapshot that had never been recorded on this platform. We flagged it as exactly that rather than letting it inflate the number.

It would have been easy — and dishonest — to call all nine "flaky" and quote a bigger figure. One of them wasn't flake. Knowing the difference is the entire job.

Where the flake hides

The flaky tests weren't randomly scattered. They clustered in the most complex, highest-value user journeys — booking flows, multi-option configuration, custom form questions. The hardest parts of the product are exactly where flake hides, which is exactly where you least want your safety net to have holes.

Where agentic AI changes the economics

Doing this by hand doesn't scale: re-running a suite five times, reading every trace, deciding flake-versus-real for each, finding root cause, and proposing a fix is days of senior time per cycle — and it has to happen continuously, not once.

That is the part we automate. Agents run and re-run under controlled conditions to expose inconsistency that single runs and retries hide; triage every failure to separate genuine breakage from flake; classify each flake by root cause; and propose fixes that are re-verified before a human sees them. The outcome your team feels: red builds that mean something again, a flake rate that goes down instead of being papered over, and senior engineers who stop being human retry buttons.

Automatic retries are the most expensive lie in CI. You pay the compute, you pay the pipeline minutes, and you pay a third time when a genuine failure hides behind the same green tick that buried the flake. I'd rather show you the ugly number than a comfortable one.

Key results

3.30% raw flake rate at retries:0 across five identical runs (2.93% genuine after excluding a missing-baseline artifact).
Zero deterministic failures — every red result was noise, not a real bug.
9 flaky tests classified by root cause: timing/race, an ambiguous selector (a real defect), and one non-flake misreported as failure.
Flake concentrated in the most complex, highest-value user journeys.

Methodology & honesty

Independent benchmark on a public, mature open-source codebase with a strong Playwright suite, run locally with retries off across five identical runs. Every figure traces to a logged run. We deliberately separated a missing-baseline artifact out rather than inflate the flake number. No production systems were touched and no endorsement is implied.

FAQs

Don't retries already solve flaky tests?+

No — retries hide flake, they don't fix it. You still pay the compute and the pipeline minutes, and you train the team to ignore red. Worse, retries suppress real failures too, so genuine bugs hide behind the same mechanism.

How is this different from flaky-test detection tools?+

Detection tells you a test is flaky. We tell you why — timing versus selector versus environment — and propose a verified fix. Detection gives you a longer to-do list; root-cause classification shrinks it.

Won't an AI quarantine real bugs as flake?+

That's the exact failure mode we engineer against. We separate deterministic failures out, every proposed fix is re-verified before a human sees it, and in this benchmark zero real failures were misclassified — we even caught a non-flake masquerading as one.

Can you measure our suite without naming it publicly?+

Yes. This benchmark is anonymous by design. A paid Flake Audit on your repository is confidential — the findings are yours.

What's your real flake rate?

If you can't state your real flake rate today, you're in good company — almost no one has run the experiment. I'll run it on your suite: your real number, your flaky tests ranked by root cause, and a fix plan. Email me — it comes straight to me.

Book a flake audit

About the authorVenkata Kari · Founder, GVK Technologies

I ran these benchmarks myself. GVK is new — I don't have client logos to borrow yet, so instead of marketing claims I publish the experiments I wish vendors would. The contact button reaches me directly, and I read every message.

All case studies