Verification
Once the tests are filled in, you need one spot that asks whether they cover the spec, whether they are flaky, whether the selectors are fragile, and whether any real key leaked. /run-suite looks at behavior, /review-e2e looks at static quality. Passing only one of the two is not a real pass.
| Tool | What it looks at |
|---|---|
/run-suite | Behavior. Runs the chosen tracks and synthesizes the result in critical-flow language |
/review-e2e | Static. Scores structural quality against the six-dimension rubric |
run-suite’s result synthesis
Section titled “run-suite’s result synthesis”Raw reporter output never reaches the user as is. It gets regrouped by e2e-spec.md §2 flow. A pass is one line; a failure gets what, where, the likely cause, and the next step.
## Behavioral verification — web
✓ Guest visitor buys Pro from the pricing page → lands on the dashboard✓ Logged-in user edits their profile in settings✗ Logged-in user opens billing management (Customer Portal) - Location: tests/web/specs/billing.spec.ts:42 - Message: customer-portal call returned 401 - Likely cause: missing Authorization header or expired session - Next: re-call implement-web-suite — review the session extractionThe assertion lines are kept only in the report file (run-{ISO}.md), and the full log goes into .claude/state/last-run.log in one piece. When output exceeds 50 lines, only errors and warnings flow into the context (Silence on Success).
review-e2e, six dimensions
Section titled “review-e2e, six dimensions”The rubric’s SSOT is AI_AUTOMATION.md §7. Each dimension scores 5/3/1, and passing means every dimension scores 3 or higher.
| Dimension | 5 points | 1 point |
|---|---|---|
| Coverage | Every §2 critical flow has a test | Core flow uncovered |
| Flakiness | Zero hard sleeps, stable on rerun, retry policy stated | Sleep sprinkled everywhere, unstable |
| Selector quality | role/label/testID first, zero CSS/XPath | Many XPath and fragile selectors |
| Test isolation | Context, DB, device state isolated; only auth reused | Order-dependent, global state |
| Run performance | Parallel and sharding, reasonable time | Excessively slow, redundant |
| Maintainability | POM reuse, fixture cleanup, clear failure messages | Locators scattered, hard to read |
The verifier does not hand out unearned 5s (Evaluator Calibration). When in doubt it scores low and attaches a fix suggestion. Flakiness gets caught with grep -rn 'waitForTimeout\|sleep', and selector quality counts CSS/XPath overuse with grep. The result drops as a single page at .claude/state/review-{ISO}.md with per-dimension scores and reasoning.
Forbidden patterns
Section titled “Forbidden patterns”AI_AUTOMATION.md §5 holds the patterns the reviewer blocks. Most are spots that grep catches.
| Forbidden | Why | Replacement |
|---|---|---|
Hard sleep, waitForTimeout(n) | Number-one cause of flakiness | auto-wait, web-first assertion, waitForResponse, idle sync (Detox) |
| CSS/XPath selector overuse | Fragile under DOM changes | getByRole → getByLabel/getByText → last resort getByTestId (testID on mobile) |
| Reaching a production SUT or DB | Data contamination, incidents | local/staging/dedicated test environment (S6) |
| Hardcoding a webview context ID | Shifts from run to run | Query with getContextHandles() then switch |
| Caching the browser binary (CI) | Version mismatch, OS dependencies | playwright install --with-deps every time |
| Clicking Electron native UI directly | Cannot be automated | evaluate/stub the main-process API |
| Sharing state between tests | Order-dependent, flaky | A fresh context per test; reuse only auth via storageState |
| Exposing real keys in spec/CI logs | Leak | env and a secret store, masking (S2, S7) |
Security baseline S1 through S8
Section titled “Security baseline S1 through S8”AI_AUTOMATION.md §4, eight lines, pinned to the e2e domain.
- S1 least privilege. The test account is dedicated and low-privilege. You do not run e2e under an admin or a real user account.
- S2 secret isolation. Real keys and passwords are never pinned in code or the repo. Only
.env.test.local(gitignored) holds them, and the repo carries only the key list in.env.test.example. - S3 input validation. SUT URLs and credentials received from outside are not trusted as is; their format is validated.
- S4 output audit. If a token or PII shows up in a trace, video, or screenshot, it gets masked. Watch the visibility scope of artifacts.
- S5 destructive-command block. A
db resetor an account deletion runs only after explicit confirmation. The same goes before deleting a track or a directory. - S6 environment isolation. Reaching a production SUT or a production DB is absolutely banned. Local, staging, or dedicated test only.
- S7 log masking. Reports and logs never expose credentials or session tokens.
- S8 external-communication audit. Name the external endpoints the tests call, and block side effects like real payments or real sends with test mode.
S6 and S2 are the first seats blocked in this domain. Connect to production or leak a real key and it stops right there.
Sprint Contract and the iteration limit
Section titled “Sprint Contract and the iteration limit”When /review-e2e returns changes_requested, it names the target track and dimensions and re-calls that track’s implement-*. The reviewer does not touch the code; the builder makes a surgical fix to just that part. This loop runs at most two iterations. Two blocks in a row flip the stage to blocked and /recover-from-blocked steps into the trigger seat. The same recovery flow at a non-developer pace lives in the beginner when-ai-gets-stuck.
Quick reference
Section titled “Quick reference”/run-suite # run chosen tracks + scenario-language report/review-e2e # six-dimension synthesis
PM="$(jq -r '.name' .claude/state/package-manager.json)"$PM exec playwright test --project=chromium # web direct$PM exec playwright test --project=electron # electron direct (wrap Linux local with xvfb-run)maestro test tests/mobile/flows # mobile (Maestro)ci comes next. How .github/workflows/e2e.yml splits by track, web sharding and the blob merge, electron’s Linux-only xvfb, and mobile’s emulator and cloud.