Run any Skill in Manus with one click

ab-test

A/B test an agent's current prompt against a candidate variant from policy/<agent>/candidates/. Runs both on the same scenarios, aggregates rewards, recommends promotion if candidate beats current by >1 stderr with n≥10. Triggers on /ab-test, "compare prompts", "test the candidate", "is the new prompt better".

Run Skill in Manus

Stars0

Forks0

UpdatedMay 10, 2026 at 15:17

Source

sethdford

sethdford/claude-code-teams

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

SKILL.md

readonly

More from this repository

same repository

best-of-n

sethdford/claude-code-teams

Run an agent N times in parallel against the same prompt, then aggregate via one of 5 modes — critic argmax, USC consistency, confidence-weighted, hybrid, or AggAgent synthesis. Use for high-stakes invocations where you'd rather pay Nx to be sure. Triggers on /best-of-n, "best of n", "run multiple", "give me three options", "synthesize across rollouts".

2026-05-100

scrum

sethdford/claude-code-teams

Run a complete SCRUM sprint with all ceremonies — Product Owner authors stories, Tech Lead designs, Scrum Master orchestrates implementers, Verifier+Aspect-Panel guard quality, Sprint Auditor adversarially audits, Retro feeds back to /tune-agent. Triggers on /scrum, "run a sprint", "scrum me this", "ship this with full process".

2026-05-100

exec-grounded

sethdford/claude-code-teams

Run an agent N times in parallel against a code-change task, with each rollout in its own sandboxed copy of the codebase, scored by ACTUAL test pass rate (not just critic opinion). Use for SWE-bench-style code changes where the test suite is the ground truth. Triggers on /exec-grounded, "execution-grounded", "verify by running tests", "best-of-N with tests".

2026-05-100

mine-transcripts

sethdford/claude-code-teams

Mine past Claude Code session transcripts (JSONL) to extract user corrections, successful patterns, and recurring failure modes. Proposes diffs against lessons.md and rules/*.md for human review. Triggers on /mine-transcripts, "mine my sessions", "what did I learn this week", "session retro". Use weekly or after a fleet completes.

2026-05-100

verify-ui

sethdford/claude-code-teams

Verify UI changes by capturing before/after screenshots and asking Claude vision to judge whether the change matches intent and didn't break anything else. Mano-verify pattern (arXiv 2509.17336 —

2026-05-100

apply-mining-patches

sethdford/claude-code-teams

Apply the diff/patch outputs from a /mine-transcripts run to lessons.md and rules/*.md after human review. Use after running /mine-transcripts and reviewing the proposed diffs. Triggers on /apply-mining-patches, "apply the mining patches", "apply mining run", "approve and apply the trajectory diffs".

2026-05-100

name	ab-test
description	A/B test an agent's current prompt against a candidate variant from policy/<agent>/candidates/. Runs both on the same scenarios, aggregates rewards, recommends promotion if candidate beats current by >1 stderr with n≥10. Triggers on /ab-test, "compare prompts", "test the candidate", "is the new prompt better".
when_to_use	After agent-tuner produces a candidate. After manual prompt edits to a high-traffic agent. Periodically to detect drift.
allowed-tools	Bash, Read, Edit, Write
arguments	["agent"]

/ab-test — Prompt variant A/B with reward gating

/ab-test <agent> [--candidate <path>] [--runs <k>] [--promote <name>] compares current vs candidate prompts on held-out scenarios and either promotes the winner or surfaces inconclusive.

How

# Run A/B (reads candidates from policy/<agent>/candidates/)
python3 ~/.claude/rl/ab_test.py <agent> --runs 3

# Promote a named candidate (after reviewing the decision.json)
python3 ~/.claude/rl/ab_test.py <agent> --promote <candidate-name>

The runner:

For each scenario in ~/.claude/evals/scenarios/<agent>/, runs the scenario <runs> times against the live agent prompt.
Same for each candidate in ~/.claude/rl/policy/<agent>/candidates/*.md.
Aggregates per variant: n, mean reward, stderr.
Promotion rule: candidate's mean > current's mean + max(current.stderr, 0.05) AND n ≥ 10.
Writes decision.json and prints it.

Output

{
  "agent": "verifier",
  "scenarios": ["happy-path", "ambiguous-contract", "no-tests-exist"],
  "runs_per_scenario": 3,
  "current": {"n": 9, "mean": 0.78, "stderr": 0.12},
  "candidates": {
    "tuner-2026-05-10": {"n": 9, "mean": 0.92, "stderr": 0.08}
  },
  "winner": "tuner-2026-05-10",
  "promote": true,
  "promote_command": "python3 ~/.claude/rl/ab_test.py verifier --promote tuner-2026-05-10"
}

If promote: true, the user runs the command to apply. The previous live agent is archived to policy/<agent>/history/<ts>.md.

Cost

A typical /ab-test runs scenarios × runs × (1 + candidates) claude invocations. For 5 scenarios × 3 runs × 2 variants (current + 1 candidate) = 30 calls. With the budget cap (--max-budget-usd 0.3 per call), worst case ~$10. Run when it matters; don't run nightly.

When NOT to use

No held-out scenarios authored — /ab-test returns immediately with an error. Author scenarios via /eval-author <agent> first.
Candidate prompt is identical to current — useless.
Sample size will be too small (e.g., 1 scenario × 1 run) — no statistical power. Need n ≥ 10 for promotion.

Anti-patterns

Promoting after a single run — too noisy. Always run /ab-test and let the promotion rule gate.
Running A/B with scenarios that exclusively tested the current prompt's strengths — selection bias. Mix scenarios.
Promoting on mean alone without checking stderr — high variance candidates often regress later.