Case Study · Reference Pipeline

We Built a SaaS With Agentic QA Wired In From Commit One

Most teams bolt quality onto a codebase that already exists. We asked the opposite question: what if quality was the first thing in the repo? We built Statusbeacon — a small but real status-page SaaS — with the full agentic QA pipeline in place from the first commit, and measured what it catches on day one.

21 May 20267 min read

30 / 30OpenAPI operations with passing contract tests

72% → 85%mutation score after the pipeline flagged untested boundaries

3 / 3public endpoints within their P95 latency budget

From the founder

I'll be honest about how far I took this. To test whether quality wired in from commit one really beats quality bolted on later, I built an entire working SaaS just to run the experiment on. (Statusbeacon is a lab artifact to prove the point — not a product I'm pitching you.)

And it humbled me too. I thought the tests were good: high coverage, all green. On its first run the mutation stage scored them 72% and showed me, line by line, which of those tests were lying. I fixed them to 85%. The tool corrected its maker — which is precisely why I trust it on someone else's code.

Quality first, not last

Retrofitting discipline onto an existing codebase is hard, slow work — it's most of what we do. So we built Statusbeacon to find out what the opposite looks like: a small public status-page generator with authentication, a public read API, a management API with tokens, an incident state machine, an audit log, idempotency keys, and double-opt-in email subscribers. Thirty API operations — the kind of surface a team ships in a sprint.

This post is honest about its own scope. This is the launch state — the day-one reference architecture, not a report on 90 days of production operation. We have not run it for a quarter and we are not going to pretend we have. What we can show is exactly what a from-commit-one pipeline catches before a single feature ships.

The feature has to clear the bar to merge. The bar never has to be retrofitted.

The seven components, in plain language

The pipeline has seven parts. None is exotic — the value is that they're all present from the start and they all feed one place.

Contract tests: the OpenAPI spec is the single source of truth; every operation must have a passing test — golden path, auth-required cases, and business invariants.
Mutation testing: deliberately breaks the code in hundreds of small ways to check a test notices — the only honest measure of whether your tests test anything.
Performance budgets: each public endpoint has a latency budget; a regression is a build failure, not a surprise three weeks later.
Self-healing E2E: browser tests that walk a ladder of alternative locators to recover when a selector breaks, before falling back to a human.
Agent-authored tests from PR diffs: an agent reads each PR, works out what changed, and proposes tests as review comments.
The evidence pack: one PR comment, updated in place, pulling together every signal — no dashboard archaeology.
Production-telemetry-to-test loop: a real production error is captured, PII-scrubbed, and turned into a proposed regression test.

What it catches on day one

We stood the product up locally — a production build against a real Postgres database — and ran the pipeline. Every number here is reproducible.

Contract coverage: 30 of 30 operations, 100% — and not just "returns 200." The tests pin the invariants: an incident's severity can only move forward and a backward move is rejected with a 409; subscribing the same email twice is idempotent and never duplicates; an API token's plaintext is shown exactly once. Forty-one assertions, all green.

Mutation score: 84.75%. This is the number most teams never see, and it's where the story gets interesting. The first run scored 72.46% — tests passed, coverage was high, but mutation testing found 64 ways to break the code no test would catch. Most were boundary conditions. We wrote the boundary tests the pipeline told us were missing, and the score rose above our 80% bar. That converted "looks well-tested" into "is well-tested," and told us exactly where the gap was.

The most valuable thing we found wasn't a passing test. It was the mutation score moving from 72% to 85% because the pipeline told us, precisely, which of our tests were lying.

Performance, E2E, and the evidence pack

Against the production build over local Postgres, the status roll-up, the badge, and the incident history all returned a P95 latency comfortably under the 150ms budget at a 100% success rate. (These are single-client local figures — a floor, not a production SLA under load — and we say so plainly.)

For self-healing E2E we deliberately broke a selector: the test asked for a heading by a name that no longer existed after a rename. Instead of failing, the locator walked its fallback ladder, found the heading by role and level, recovered, and logged the repair as "amber" so a human can see drift happened. The test passed and surfaced the drift. The unit core: 98 tests passing, 100% statement/line/function coverage.

What we are not claiming

We are allergic to case studies that quietly imply more than they measured, so let us be explicit. We did not operate this for 90 days. We have no escaped-defect count, no real-user number, no uptime figure, no PR-feedback-time distribution, and no weekly cost number — every one of those requires a real operating window, and inventing them would defeat the purpose.

Two of the seven components are LLM-driven (the PR-test agent and the telemetry loop); we ran them in dry-run, so they produced zero suggestions at zero cost — and we report exactly that rather than a flattering accept-rate we didn't earn.

Two of the seven components are AI-driven, and I ran them in dry-run: zero suggestions, zero cost — and I'm telling you that instead of a flattering accept-rate I didn't earn. I'd rather lose your deal than quote you a number I made up. That's the whole company in one sentence.

Key results

Contract: 30/30 OpenAPI operations covered (100%), 41 assertions green, including the headline business invariants.
Mutation score: 72.46% → 84.75% after adding the boundary tests the pipeline flagged on its first run.
Performance: 3/3 public endpoints within the P95 budget, 0% errors.
Self-healing E2E: 3/3 passing, including a real recovery from a deliberately renamed selector. Unit core: 98/98, 100% statement/line/function coverage.

Methodology & honesty

Built and measured on a GVK-owned reference product (Statusbeacon), run locally as a production build over a real Postgres database. Every figure is reproducible from a logged command. This is the day-one launch state, not a 90-day operating report — operating-window metrics (escaped defects, real users, uptime, PR feedback time, weekly cost, agent accept-rate) are deliberately reported as unmeasured rather than invented.

FAQs

Is this a 90-day operating report?+

No. It's the day-one launch state — what the pipeline catches before any feature ships. Operating-window metrics (escaped defects, real users, uptime, cost) require a real operating period and are reported as unmeasured, not invented.

Why mutation testing instead of code coverage?+

Coverage tells you a line ran during a test. Mutation testing tells you a test would actually fail if that line were wrong. In this build, coverage was high while mutation testing still found 64 ways to break the code that no test caught.

Can these components drop into our existing codebase?+

Yes — they're designed to be installed into an existing repo with config, not just used greenfield. We typically start with the contract test and the mutation gate because they pay back fastest.

What about the AI-driven components?+

The PR-test agent and telemetry loop are LLM-driven and were run in honest dry-run for this benchmark (zero suggestions, zero cost). In a live engagement they run against real PRs and production signals with cost caps in place.

What would this catch in your repo?

The two parts that pay back fastest are the contract test and the mutation gate — they tell you the truth about everything else. I'll show you what a from-commit-one pipeline catches in your codebase and where to start. Email me — it reaches me directly.

See what this catches in your repo

About the authorVenkata Kari · Founder, GVK Technologies

I ran these benchmarks myself. GVK is new — I don't have client logos to borrow yet, so instead of marketing claims I publish the experiments I wish vendors would. The contact button reaches me directly, and I read every message.

All case studies