بنقرة واحدة
verify-ui
Verify UI changes by capturing before/after screenshots and asking Claude vision to judge whether the change matches intent and didn't break anything else. Mano-verify pattern (arXiv 2509.17336 —
القائمة
Verify UI changes by capturing before/after screenshots and asking Claude vision to judge whether the change matches intent and didn't break anything else. Mano-verify pattern (arXiv 2509.17336 —
Run an agent N times in parallel against the same prompt, then aggregate via one of 5 modes — critic argmax, USC consistency, confidence-weighted, hybrid, or AggAgent synthesis. Use for high-stakes invocations where you'd rather pay Nx to be sure. Triggers on /best-of-n, "best of n", "run multiple", "give me three options", "synthesize across rollouts".
Run a complete SCRUM sprint with all ceremonies — Product Owner authors stories, Tech Lead designs, Scrum Master orchestrates implementers, Verifier+Aspect-Panel guard quality, Sprint Auditor adversarially audits, Retro feeds back to /tune-agent. Triggers on /scrum, "run a sprint", "scrum me this", "ship this with full process".
Run an agent N times in parallel against a code-change task, with each rollout in its own sandboxed copy of the codebase, scored by ACTUAL test pass rate (not just critic opinion). Use for SWE-bench-style code changes where the test suite is the ground truth. Triggers on /exec-grounded, "execution-grounded", "verify by running tests", "best-of-N with tests".
Mine past Claude Code session transcripts (JSONL) to extract user corrections, successful patterns, and recurring failure modes. Proposes diffs against lessons.md and rules/*.md for human review. Triggers on /mine-transcripts, "mine my sessions", "what did I learn this week", "session retro". Use weekly or after a fleet completes.
A/B test an agent's current prompt against a candidate variant from policy/<agent>/candidates/. Runs both on the same scenarios, aggregates rewards, recommends promotion if candidate beats current by >1 stderr with n≥10. Triggers on /ab-test, "compare prompts", "test the candidate", "is the new prompt better".
Apply the diff/patch outputs from a /mine-transcripts run to lessons.md and rules/*.md after human review. Use after running /mine-transcripts and reviewing the proposed diffs. Triggers on /apply-mining-patches, "apply the mining patches", "apply mining run", "approve and apply the trajectory diffs".
| name | verify-ui |
| description | Verify UI changes by capturing before/after screenshots and asking Claude vision to judge whether the change matches intent and didn't break anything else. Mano-verify pattern (arXiv 2509.17336 — |
| when_to_use | After any change to frontend code (HTML/CSS/React/Vue/etc) where the visible behavior matters. Cheaper and broader than running E2E browser tests. NOT for pure logic changes (use /verify or /exec-grounded). |
| allowed-tools | mcp__Claude_Preview__preview_start, mcp__Claude_Preview__preview_screenshot, mcp__Claude_Preview__preview_eval, mcp__Claude_Preview__preview_inspect, mcp__Claude_in_Chrome__tabs_create_mcp, mcp__Claude_in_Chrome__navigate, mcp__Claude_in_Chrome__computer, Read, Bash |
| arguments | ["intent"] |
/verify-ui "<intent>" captures before/after screenshots and asks Claude vision to judge whether the visible change matches the stated intent — and whether anything else changed unexpectedly.
Pattern from Mano-verify (arXiv 2509.17336, Sep 2025) — the verifier reasons over (pre-image, post-image, intent text) jointly, not just text. #1 on OSWorld specialized at 58.2%.
/verify with the verifier agent — runs tests, no UI).claude/launch.json or ask the userpreview_start if not runningpreview_screenshot — save the result. This is the BASELINE.preview_eval with location.reload() or wait for HMR)preview_screenshot again. This is the POST-CHANGE state.You have BEFORE and AFTER images in context plus the intent text. Reason explicitly:
Don't pixel-diff — diff meaningfully. Anti-aliasing, font hinting, browser version drift produce pixel changes that aren't real regressions.
INTENT: <one sentence>
BEFORE: <path to screenshot 1>
AFTER: <path to screenshot 2>
VERDICT: pass | fail | escalate
CONFIDENCE: 0.0 - 1.0
INTENDED_CHANGE: <observed | not observed | partial>
UNEXPECTED_CHANGES: <none | list with descriptions>
Last line: RESULT_verify-ui=PASS|FAIL|ESCALATE
ESCALATE = "I see changes I can't classify as expected or regression — surface to lead."
For high-stakes frontend work (release-blocking, accessibility-critical, brand-affecting), run BOTH:
/verify-ui — visual judge of the rendered change/aspect-panel with accessibility-reviewer agent — code-level a11y checkTogether they catch:
Going beyond pure pixel comparison: preview_inspect returns computed styles + bounding box for any selector. Use this when the user's intent is precise:
height beforeThis is more reliable than pure vision for measurable changes.