Skip to content
Case Study · Mobile / React Native

Your Mobile Test Suite Is Probably Healthy. Your Mobile QA Still Isn't

We took a major open-source React Native app — a real, widely-installed product — and put its quality setup through its paces on the newest Xcode, the newest iOS simulator, and the newest Expo SDK. The suite scored an A. The mobile QA still had two specific holes, and the environment ate most of the day.

6 min read

474 / 474unit tests passing — types and lint clean
0automated visual-regression checks
0runtime accessibility gates

From the founder

Same deal as the rest of this series: I'm a new founder running the benchmarks I wish vendors published, on my own time, and reporting what actually happened — including what didn't work.

This one humbled me before I could test a single screen. I lost the better part of two days to the environment alone: a native build that died because a folder name contained a space, a file-watcher that hung for nine hours before I killed it, and a day-old simulator that flatly refused to render the app. None of that shows up in a glossy demo. All of it is your Tuesday — which is exactly why I think mobile QA is a pipeline problem, not a 'write more tests' problem.

First: the suite is genuinely good

This is not a story about a sloppy codebase — the opposite. Its unit suite ran 474 tests with zero failures. Type-checking: clean. Linting: clean. It even ships a native end-to-end suite of two dozen scripted device flows, which most mobile teams never get around to writing. By the usual dashboard, you'd give this team an A.

And yet its mobile QA still has two specific, important holes — the same two we find on almost every mobile team, however good the unit coverage:

  • No automated visual regression. Nothing catches the button that drifted ten pixels, the text that clipped on a smaller screen, the dark-mode contrast that broke. On mobile — where the same code renders across dozens of device sizes, OS versions, and font settings — this is the gap that ships the bugs users actually see.
  • No runtime accessibility gating. There's a lint rule for a11y, which catches some issues in source — but nothing audits the running app, screen by screen, and fails the build when a control loses its label or a contrast ratio drops below standard.

Unit tests don't see layout, rendering, or accessibility — the hardest, most user-visible parts of a mobile app. And that blindness looks identical to safety on a dashboard.

Then: the environment ate most of the day

Here's the part nobody puts in a case study, which is precisely why we will. Getting this healthy app to simply build and run on current Apple tooling took diagnosing several distinct failures that had nothing to do with the app's quality:

  • The standard "run on simulator" command silently targeted a physical device and demanded code-signing — because the newest Xcode changed an output format the tooling didn't expect yet.
  • The native build failed outright because the project folder's name contained a space, breaking a generated build script. (Yes, in 2026.)
  • The file-watcher the bundler depends on wouldn't start and hung the bundler until we disabled it.
  • Once built and launched, the app still wouldn't fully render on the day-old simulator: one build configuration hit a runtime module error unique to the new OS; the other aborted on a security feature that doesn't exist on simulators at all.

None of these are the app's fault. All of them are the daily reality of mobile QA: the environment isn't a footnote to the testing — the environment is most of the work.

Why mobile is structurally harder than web

Web QA runs in a browser that's broadly stable. Mobile QA runs on a moving target: two operating systems, annual breaking toolchain updates, native build systems, simulators that lag real devices, and security features that behave differently in test than in production.

You can have 474 green unit tests and still spend your week fighting a simulator. That's the thing generic "add more tests" advice completely ignores.

Where agentic AI changes the math

If the environment is the hard part, the win isn't an AI that writes more unit tests. It's a QA capability that is resilient to the churn and fills the holes unit tests can't reach:

  • Autonomous exploration that adapts: an agent drives the running app, maps what's actually there, and exercises journeys — extending coverage beyond the handful of hand-scripted flows and re-deriving its path when the UI moves instead of breaking.
  • Visual regression as a first-class gate: every screen, every run, diffed against a known-good baseline with dynamic regions masked — so drift and dark-mode breaks get caught before a user finds them.
  • Runtime accessibility auditing, gated: every discovered screen scanned in its running state, violations classified by severity, and the build held when accessibility regresses.
  • Built to survive the environment, not assume it — so a toolchain upgrade is a logged event, not a lost week.
I treat an accessibility regression as a shipped defect, not a 'nice-to-have we'll get to'. If your build can go green while a control loses its label or the contrast drops below standard, your build is lying to you.

Key results

  • 474 of 474 unit tests passing; TypeScript type-check and lint clean — a healthy, well-maintained suite.
  • An existing native end-to-end suite of two dozen scripted flows already in place.
  • Zero automated visual-regression coverage and zero runtime accessibility gating — the two gaps that ship user-visible bugs.
  • The app builds on the newest Apple toolchain — but several distinct environment failures stood between a healthy repo and a running app.

Methodology & honesty

Independent benchmark on a public open-source React Native app, run on current-release Apple developer tooling. The baseline numbers (unit tests, type-check, lint, build) are measured and reproducible. We did not complete an autonomous-exploration cycle this round — the day-old simulator blocked full app render, and we report that plainly rather than quote exploration numbers we didn't produce. No production systems were touched and no endorsement is implied.

FAQs

We already have Detox or Maestro flows — isn't that covered?+
Hand-written flows break when the UI moves and only cover what someone had time to script. We add adaptive exploration that extends coverage and survives change, plus the visual-regression and runtime-accessibility gates that scripted flows don't provide.
Isn't this just screenshot testing?+
Screenshot testing tells you something changed. We tell you what changed and whether it matters, audit accessibility on the running app, and adapt to UI change — which screenshot tools don't.
Does AI driving our app put anything at risk?+
It never ships anything. It explores, captures, and reports; a human approves. A hard guardrail blocks destructive actions — it cannot post, delete, or message.
We use a real-device cloud already. Does this replace it?+
No — it's complementary. Device clouds give you devices; we provide the intelligence that drives them and judges the result, and can run on your existing device farm.

Can you see your mobile blind spots?

Most teams can quote their unit-test count but not whether their layout or accessibility regressed three releases ago. I'll map your coverage, rank the visual and accessibility regressions with reproductions, and hand you an honest environment-fragility report. Email me directly.

Book a mobile QA gap assessment
About the authorVenkata Kari · Founder, GVK Technologies

I ran these benchmarks myself. GVK is new — I don't have client logos to borrow yet, so instead of marketing claims I publish the experiments I wish vendors would. The contact button reaches me directly, and I read every message.

Related serviceAutonomous Mobile App TestingExplore the service
Related case studyYour CI isn't broken — it's flakyRead the study