All projects

Project · AI Tooling

UX Gap Detection

A multi-bot pipeline that finds the UX issues automated testing misses — running continuously on every test pass, deduplicating against known bugs, and filing verified issues automatically.

QA AI Tooling Enterprise
RoleSenior QA Analyst
StartedMar 2026
StatusActive · 2 apps
StackPython, Anthropic API, GitHub Actions
Meta · AI Tooling · Pipeline
Auto
runs
AI
analysis
Dedup
filter
Simulator
verify
Auto
filed
Active · 2 apps

The problem

When the offshore manual testing team was replaced by automated test passes, the coverage gap wasn't obvious at first. Automated tests are written to verify behavior — they pass as long as the app does what it's programmed to do. They don't notice when something looks off: a misaligned UI element, a flow that's technically correct but confusing, a rendering issue that only appears in certain states.

Real users notice these things. There was no systematic way to catch them without reintroducing manual review, which was exactly what automation was supposed to eliminate. The question became: how do you get the coverage of a manual tester without the manual tester?

The challenge

The biggest risk was noise. A pipeline that files inaccurate or duplicate bugs erodes engineering trust fast, and a system engineers learn to ignore is worse than no system at all. Getting it right required iteration: bugs had to be reproduced in a simulator before filing, not just theorized; correct details (description, build info, reproduction steps) had to be consistently attached; and the pipeline had to coexist with existing automated workflows without disruption. That calibration happened through review cycles with engineering and cross-functional partners, adjusting the detection logic until accuracy was high enough to trust.

Key decisions

Decision 01

Bot-per-concern, not one monolith

Each stage of the pipeline runs as a separate bot — ingestion, analysis, dedup, filing. This means any stage can be updated, retrained, or replaced independently without rebuilding the whole pipeline. When the analysis bot needed refinement, the filing and dedup logic stayed untouched.

Decision 02

Dedup before testing, not after

The dedup check runs before simulator testing, not after. Running a simulator test only to find out the issue is already filed wastes compute and time. Checking the bug database first means only net-new issues ever reach the testing stage.

Decision 03

Auto-file with a human review gate

Issues are filed automatically but reviewed before being assigned. This keeps the output trustworthy — a fully autonomous pipeline that silently files noise is worse than no pipeline. The review gate stays until confidence is high enough to remove it for specific issue types.

How it works
Step 01
Ingest automated runs
Bots pull the latest automated test results — job outputs and post-run reports across both apps
Step 02
Identify UX gaps
AI reviews run output and flags potential UX issues — rendering problems, broken flows, functional anomalies that automation passes but users would notice
Step 03
Dedup against existing bugs
Candidate issues are checked against the open bug database. Already-known issues are filtered out — only net-new gaps proceed
Known issue
Already filed — skipped. No duplicate noise.
Net new
Proceeds to simulator testing
Step 04
Simulator testing
Net-new issues are reproduced in device simulators and emulators. Manual verification for flows with account limitations or hardware dependencies
Step 05
Auto-file & triage
Confirmed issues are filed automatically — title, reproduction steps, and severity pre-populated. Filed to the correct product area on my behalf
Step 06
Review & assign
Filed issues are reviewed and assigned for prioritization. Working toward full autonomous triage — manual review currently required for edge cases and account-gated flows

Where it stands

Started with one app and a few features. Now covers two apps with a full feature suite. The pipeline has been running for around two months — long enough to validate the approach and start tuning accuracy.

2Apps covered
6Pipeline steps
~2moIn production