بنقرة واحدة
بنقرة واحدة
Compare two eval runs and report what changed. Reads both runs' events, transcripts, and produced artifacts. Writes a short markdown summary classifying differences as regression, improvement, or neutral.
Generalized recursive iteration loop. Runs parallel sub-agents against a target, scores deterministically, diagnoses instruction gaps, applies fixes, and recurses until the stop condition is met or max depth is reached.
Run an evaluation against an eval with a specific config
Write a handoff document so a new chat can continue the work
Test-driven development of a Hono/Bun WebSocket application. Read requirements, read tests, build server, verify, iterate until all tests pass.
Scaffold a new evaluation
| name | report |
| description | Show evaluation results and comparisons |
| argument-hint | ["run_id | --latest | --all | --summarize <eval>"] |
/report or /report --latest — Dashboard summary of the most recent run/report <run_id> — Dashboard summary of a specific run/report --all — Full comparison report across all runs/report --summarize <eval_name> — Generate eval summary (terminal + RESULTS.md)Parse the arguments and dispatch:
If args contain --summarize, extract the eval name and run:
python3 "$CLAUDE_PROJECT_DIR/scripts/summarize.py" <eval_name> --filter-eval
If args contain --all, run:
python3 "$CLAUDE_PROJECT_DIR/scripts/report.py" --all
If --group-by is also present, pass it through.
If args contain a run_id (8+ hex chars), run:
python3 "$CLAUDE_PROJECT_DIR/scripts/dashboard.py" <run_id> --summary
Otherwise (no args or --latest):
python3 "$CLAUDE_PROJECT_DIR/scripts/dashboard.py" --latest --summary