Run any Skill in Manus with one click

exec-grounded

Run an agent N times in parallel against a code-change task, with each rollout in its own sandboxed copy of the codebase, scored by ACTUAL test pass rate (not just critic opinion). Use for SWE-bench-style code changes where the test suite is the ground truth. Triggers on /exec-grounded, "execution-grounded", "verify by running tests", "best-of-N with tests".

Run Skill in Manus

Stars0

Forks0

UpdatedMay 10, 2026 at 16:07

Source

sethdford

sethdford/claude-code-teams

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

SKILL.md

readonly

More from this repository

same repository

best-of-n

sethdford/claude-code-teams

Run an agent N times in parallel against the same prompt, then aggregate via one of 5 modes — critic argmax, USC consistency, confidence-weighted, hybrid, or AggAgent synthesis. Use for high-stakes invocations where you'd rather pay Nx to be sure. Triggers on /best-of-n, "best of n", "run multiple", "give me three options", "synthesize across rollouts".

2026-05-100

scrum

sethdford/claude-code-teams

Run a complete SCRUM sprint with all ceremonies — Product Owner authors stories, Tech Lead designs, Scrum Master orchestrates implementers, Verifier+Aspect-Panel guard quality, Sprint Auditor adversarially audits, Retro feeds back to /tune-agent. Triggers on /scrum, "run a sprint", "scrum me this", "ship this with full process".

2026-05-100

mine-transcripts

sethdford/claude-code-teams

Mine past Claude Code session transcripts (JSONL) to extract user corrections, successful patterns, and recurring failure modes. Proposes diffs against lessons.md and rules/*.md for human review. Triggers on /mine-transcripts, "mine my sessions", "what did I learn this week", "session retro". Use weekly or after a fleet completes.

2026-05-100

verify-ui

sethdford/claude-code-teams

Verify UI changes by capturing before/after screenshots and asking Claude vision to judge whether the change matches intent and didn't break anything else. Mano-verify pattern (arXiv 2509.17336 —

2026-05-100

ab-test

sethdford/claude-code-teams

A/B test an agent's current prompt against a candidate variant from policy/<agent>/candidates/. Runs both on the same scenarios, aggregates rewards, recommends promotion if candidate beats current by >1 stderr with n≥10. Triggers on /ab-test, "compare prompts", "test the candidate", "is the new prompt better".

2026-05-100

apply-mining-patches

sethdford/claude-code-teams

Apply the diff/patch outputs from a /mine-transcripts run to lessons.md and rules/*.md after human review. Use after running /mine-transcripts and reviewing the proposed diffs. Triggers on /apply-mining-patches, "apply the mining patches", "apply mining run", "approve and apply the trajectory diffs".

2026-05-100

name	exec-grounded
description	Run an agent N times in parallel against a code-change task, with each rollout in its own sandboxed copy of the codebase, scored by ACTUAL test pass rate (not just critic opinion). Use for SWE-bench-style code changes where the test suite is the ground truth. Triggers on /exec-grounded, "execution-grounded", "verify by running tests", "best-of-N with tests".
when_to_use	When you have a clear test command and a code change to implement. The test suite IS the verifier — agent rollouts that don't make tests green get scored down. Top SWE-bench leaders spend 60-80% of compute on this kind of verification.
allowed-tools	Bash, Read
arguments	["agent","prompt","target-dir","test-cmd","n"]

/exec-grounded — Best-of-N where tests decide

/exec-grounded <agent> "<prompt>" --target-dir <dir> --test-cmd "<cmd>" [--n 3]

Mirrors the SWE-bench-leader pattern: each candidate is executed, not just judged. The test exit code is the dominant signal (0.7 weight); critic score is the tiebreaker (0.3).

How

python3 ~/.claude/rl/exec_grounded.py <agent> "<prompt>" \
  --target-dir ./src \
  --test-cmd "pytest -q" \
  --n 3

Per rollout:

Stage: cp -r ./src /tmp/bon-<ts>/work-<idx>/ (skips node_modules, .git, pycache, etc.)
Run: Claude works in /tmp/bon-<ts>/work-<idx>/ (passed via --add-dir); makes whatever edits the agent decides
Verify in sandbox: sandbox_run.py --cwd <workdir> -- <test-cmd> — OS-level isolation (no writes outside workdir, no network unless --allow-network)
Parse test output: pytest / jest / vitest / go test patterns recognized; falls back to exit-code check
Score: test_score * 0.7 + critic_score * 0.3
- test_score = 1.0 (all green), -1.0 (all red), 0.0 (no tests found), (passed - failed)/total (partial)
- critic_score: parsed from RESULT_critic=CLEAN|HAS_FINDINGS_n_n if present

Output

{
  "n": 3,
  "winner_idx": 1,
  "winner_score": 0.85,
  "winner_test_stats": {"passed": 12, "failed": 0, "total": 12, "runner": "pytest"},
  "spread": 1.10,
  "out_dir": "~/.claude/rl/exec-grounded-runs/<ts>"
}

out_dir/decision.json has the full per-rollout breakdown including stdout/stderr excerpts. The winner's workdir is preserved for inspection or merging back.

Test runners recognized

pytest: N passed, M failed, K errors in T s
jest / vitest: Tests: A failed, B passed, C total
go test: --- PASS: / --- FAIL: line counts

For other runners (rspec, mocha, cargo test), the test_score falls back to exit-code-only (1.0 if exit 0, 0.0 otherwise). Add a parser to parse_test_output in rl/exec_grounded.py to extend.

Cost discipline

N rollouts × --max-budget-usd (default $1) = up to $N total per call
Plus N test runs (typically <$0.001 each — they're local subprocess)
Budget gate via --max-budget-usd 0.30 per rollout for routine use

When NOT to use

No test suite (/exec-grounded collapses to plain best-of-N — use that instead)
Test suite takes >5 minutes (sandbox timeout is 600s; one slow run blocks parallelism)
Single-line trivial fix (overhead > value)
Tests have no isolation (e.g., they hit a real DB you can't sandbox-network) — surface failure mode

Anti-patterns

Setting --allow-network on every run — defeats the isolation
Running with N=8 by default — 3 is a strong baseline; escalate only if disagreement is high
Treating green tests as final correctness — tests can be incomplete; pair with /aspect-panel on the winner's diff for a second read
Picking the rollout by score and then trying to merge by hand — the winner.workdir IS the change; copy from there