Verification

Once the tests are filled in, you need one spot that asks whether they cover the spec, whether they are flaky, whether the selectors are fragile, and whether any real key leaked. /run-suite looks at behavior, /review-e2e looks at static quality. Passing only one of the two is not a real pass.

Tool	What it looks at
`/run-suite`	Behavior. Runs the chosen tracks and synthesizes the result in critical-flow language
`/review-e2e`	Static. Scores structural quality against the six-dimension rubric

run-suite’s result synthesis

Raw reporter output never reaches the user as is. It gets regrouped by e2e-spec.md §2 flow. A pass is one line; a failure gets what, where, the likely cause, and the next step.

## Behavioral verification — web

✓ Guest visitor buys Pro from the pricing page → lands on the dashboard
✓ Logged-in user edits their profile in settings
✗ Logged-in user opens billing management (Customer Portal)
   - Location: tests/web/specs/billing.spec.ts:42
   - Message: customer-portal call returned 401
   - Likely cause: missing Authorization header or expired session
   - Next: re-call implement-web-suite — review the session extraction

The assertion lines are kept only in the report file (run-{ISO}.md), and the full log goes into .claude/state/last-run.log in one piece. When output exceeds 50 lines, only errors and warnings flow into the context (Silence on Success).

review-e2e, six dimensions

The rubric’s SSOT is AI_AUTOMATION.md §7. Each dimension scores 5/3/1, and passing means every dimension scores 3 or higher.

Dimension	5 points	1 point
Coverage	Every §2 critical flow has a test	Core flow uncovered
Flakiness	Zero hard sleeps, stable on rerun, retry policy stated	Sleep sprinkled everywhere, unstable
Selector quality	role/label/testID first, zero CSS/XPath	Many XPath and fragile selectors
Test isolation	Context, DB, device state isolated; only auth reused	Order-dependent, global state
Run performance	Parallel and sharding, reasonable time	Excessively slow, redundant
Maintainability	POM reuse, fixture cleanup, clear failure messages	Locators scattered, hard to read

The verifier does not hand out unearned 5s (Evaluator Calibration). When in doubt it scores low and attaches a fix suggestion. Flakiness gets caught with grep -rn 'waitForTimeout\|sleep', and selector quality counts CSS/XPath overuse with grep. The result drops as a single page at .claude/state/review-{ISO}.md with per-dimension scores and reasoning.

Forbidden patterns

AI_AUTOMATION.md §5 holds the patterns the reviewer blocks. Most are spots that grep catches.

Forbidden	Why	Replacement
Hard `sleep`, `waitForTimeout(n)`	Number-one cause of flakiness	auto-wait, web-first assertion, `waitForResponse`, idle sync (Detox)
CSS/XPath selector overuse	Fragile under DOM changes	`getByRole` → `getByLabel`/`getByText` → last resort `getByTestId` (testID on mobile)
Reaching a production SUT or DB	Data contamination, incidents	local/staging/dedicated test environment (S6)
Hardcoding a webview context ID	Shifts from run to run	Query with `getContextHandles()` then switch
Caching the browser binary (CI)	Version mismatch, OS dependencies	`playwright install --with-deps` every time
Clicking Electron native UI directly	Cannot be automated	`evaluate`/stub the main-process API
Sharing state between tests	Order-dependent, flaky	A fresh context per test; reuse only auth via storageState
Exposing real keys in spec/CI logs	Leak	env and a secret store, masking (S2, S7)

Security baseline S1 through S8

AI_AUTOMATION.md §4, eight lines, pinned to the e2e domain.

S1 least privilege. The test account is dedicated and low-privilege. You do not run e2e under an admin or a real user account.
S2 secret isolation. Real keys and passwords are never pinned in code or the repo. Only .env.test.local (gitignored) holds them, and the repo carries only the key list in .env.test.example.
S3 input validation. SUT URLs and credentials received from outside are not trusted as is; their format is validated.
S4 output audit. If a token or PII shows up in a trace, video, or screenshot, it gets masked. Watch the visibility scope of artifacts.
S5 destructive-command block. A db reset or an account deletion runs only after explicit confirmation. The same goes before deleting a track or a directory.
S6 environment isolation. Reaching a production SUT or a production DB is absolutely banned. Local, staging, or dedicated test only.
S7 log masking. Reports and logs never expose credentials or session tokens.
S8 external-communication audit. Name the external endpoints the tests call, and block side effects like real payments or real sends with test mode.

S6 and S2 are the first seats blocked in this domain. Connect to production or leak a real key and it stops right there.

Sprint Contract and the iteration limit

When /review-e2e returns changes_requested, it names the target track and dimensions and re-calls that track’s implement-*. The reviewer does not touch the code; the builder makes a surgical fix to just that part. This loop runs at most two iterations. Two blocks in a row flip the stage to blocked and /recover-from-blocked steps into the trigger seat. The same recovery flow at a non-developer pace lives in the beginner when-ai-gets-stuck.

Quick reference

/run-suite                    # run chosen tracks + scenario-language report
/review-e2e                   # six-dimension synthesis

PM="$(jq -r '.name' .claude/state/package-manager.json)"
$PM exec playwright test --project=chromium    # web direct
$PM exec playwright test --project=electron    # electron direct (wrap Linux local with xvfb-run)
maestro test tests/mobile/flows                # mobile (Maestro)

ci comes next. How .github/workflows/e2e.yml splits by track, web sharding and the blob merge, electron’s Linux-only xvfb, and mobile’s emulator and cloud.