Skip to content

Verification

Once the tests are filled in, you need one spot that asks whether they cover the spec, whether they are flaky, whether the selectors are fragile, and whether any real key leaked. /run-suite looks at behavior, /review-e2e looks at static quality. Passing only one of the two is not a real pass.

ToolWhat it looks at
/run-suiteBehavior. Runs the chosen tracks and synthesizes the result in critical-flow language
/review-e2eStatic. Scores structural quality against the six-dimension rubric

Raw reporter output never reaches the user as is. It gets regrouped by e2e-spec.md §2 flow. A pass is one line; a failure gets what, where, the likely cause, and the next step.

## Behavioral verification — web
✓ Guest visitor buys Pro from the pricing page → lands on the dashboard
✓ Logged-in user edits their profile in settings
✗ Logged-in user opens billing management (Customer Portal)
- Location: tests/web/specs/billing.spec.ts:42
- Message: customer-portal call returned 401
- Likely cause: missing Authorization header or expired session
- Next: re-call implement-web-suite — review the session extraction

The assertion lines are kept only in the report file (run-{ISO}.md), and the full log goes into .claude/state/last-run.log in one piece. When output exceeds 50 lines, only errors and warnings flow into the context (Silence on Success).

The rubric’s SSOT is AI_AUTOMATION.md §7. Each dimension scores 5/3/1, and passing means every dimension scores 3 or higher.

Dimension5 points1 point
CoverageEvery §2 critical flow has a testCore flow uncovered
FlakinessZero hard sleeps, stable on rerun, retry policy statedSleep sprinkled everywhere, unstable
Selector qualityrole/label/testID first, zero CSS/XPathMany XPath and fragile selectors
Test isolationContext, DB, device state isolated; only auth reusedOrder-dependent, global state
Run performanceParallel and sharding, reasonable timeExcessively slow, redundant
MaintainabilityPOM reuse, fixture cleanup, clear failure messagesLocators scattered, hard to read

The verifier does not hand out unearned 5s (Evaluator Calibration). When in doubt it scores low and attaches a fix suggestion. Flakiness gets caught with grep -rn 'waitForTimeout\|sleep', and selector quality counts CSS/XPath overuse with grep. The result drops as a single page at .claude/state/review-{ISO}.md with per-dimension scores and reasoning.

AI_AUTOMATION.md §5 holds the patterns the reviewer blocks. Most are spots that grep catches.

ForbiddenWhyReplacement
Hard sleep, waitForTimeout(n)Number-one cause of flakinessauto-wait, web-first assertion, waitForResponse, idle sync (Detox)
CSS/XPath selector overuseFragile under DOM changesgetByRolegetByLabel/getByText → last resort getByTestId (testID on mobile)
Reaching a production SUT or DBData contamination, incidentslocal/staging/dedicated test environment (S6)
Hardcoding a webview context IDShifts from run to runQuery with getContextHandles() then switch
Caching the browser binary (CI)Version mismatch, OS dependenciesplaywright install --with-deps every time
Clicking Electron native UI directlyCannot be automatedevaluate/stub the main-process API
Sharing state between testsOrder-dependent, flakyA fresh context per test; reuse only auth via storageState
Exposing real keys in spec/CI logsLeakenv and a secret store, masking (S2, S7)

AI_AUTOMATION.md §4, eight lines, pinned to the e2e domain.

  • S1 least privilege. The test account is dedicated and low-privilege. You do not run e2e under an admin or a real user account.
  • S2 secret isolation. Real keys and passwords are never pinned in code or the repo. Only .env.test.local (gitignored) holds them, and the repo carries only the key list in .env.test.example.
  • S3 input validation. SUT URLs and credentials received from outside are not trusted as is; their format is validated.
  • S4 output audit. If a token or PII shows up in a trace, video, or screenshot, it gets masked. Watch the visibility scope of artifacts.
  • S5 destructive-command block. A db reset or an account deletion runs only after explicit confirmation. The same goes before deleting a track or a directory.
  • S6 environment isolation. Reaching a production SUT or a production DB is absolutely banned. Local, staging, or dedicated test only.
  • S7 log masking. Reports and logs never expose credentials or session tokens.
  • S8 external-communication audit. Name the external endpoints the tests call, and block side effects like real payments or real sends with test mode.

S6 and S2 are the first seats blocked in this domain. Connect to production or leak a real key and it stops right there.

When /review-e2e returns changes_requested, it names the target track and dimensions and re-calls that track’s implement-*. The reviewer does not touch the code; the builder makes a surgical fix to just that part. This loop runs at most two iterations. Two blocks in a row flip the stage to blocked and /recover-from-blocked steps into the trigger seat. The same recovery flow at a non-developer pace lives in the beginner when-ai-gets-stuck.

Terminal window
/run-suite # run chosen tracks + scenario-language report
/review-e2e # six-dimension synthesis
PM="$(jq -r '.name' .claude/state/package-manager.json)"
$PM exec playwright test --project=chromium # web direct
$PM exec playwright test --project=electron # electron direct (wrap Linux local with xvfb-run)
maestro test tests/mobile/flows # mobile (Maestro)

ci comes next. How .github/workflows/e2e.yml splits by track, web sharding and the blob merge, electron’s Linux-only xvfb, and mobile’s emulator and cloud.