| name | eval-run |
| description | Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations. |
| user-invocable | true |
| allowed-tools | Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion |
You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.
For the full data flow (dataset → workspace → execution → collection → scoring), see ${CLAUDE_SKILL_DIR}/references/data-pipeline.md. For tool interception mechanics, see ${CLAUDE_SKILL_DIR}/references/tool-interception.md.
Step 0: Parse Arguments and Load Config
Parse $ARGUMENTS:
| Argument | Required | Default | Description |
|---|
--config <path> | no | eval.yaml | Path to eval config |
--model <model> | no | models.skill from config | Skill model. Required if models.skill is unset in eval.yaml. |
--subagent-model <model> | no | models.subagent → falls back to skill model | Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7) |
--skill <name> | no | from config | Override the skill to test |
--run-id <id> | no | YYYY-MM-DD-<model> | Identifier for this run |
--case <filter> | no | all cases | Substring match to select cases |
--baseline <run-id> | no | — | Previous run to compare against |
--no-judge | no | false | Skip LLM judges, run inline checks only |
--gold | no | false | Save outputs as gold references after run |
--effort <level> | no | runner.effort from config | Claude Code reasoning effort (low/medium/high/xhigh/max) |
Check if the config file exists (use the parsed config path, not hardcoded eval.yaml):
test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If config is missing: invoke eval-analyze to bootstrap:
Use the Skill tool to invoke /eval-analyze [--skill <skill>]
Once config exists, read it to understand the eval setup — the skill under test, runner, dataset, outputs, judges, models, and any tool interception. The downstream scripts read the same config; you don't need to pass these fields through, just confirm they're present and warn the user about anything missing or surprising.
If inputs.tools has entries but the skill uses AskUserQuestion or external APIs, verify the handlers cover those tools. Warn the user if a tool the skill uses isn't intercepted — headless execution may hang.
Persist parsed flags:
mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
gold=<true/false> no_judge=<true/false>
Step 1: Find Dataset
Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:
ls <dataset_path>/ | head -20
If --case filter was specified, note it for the workspace step.
If no cases found, stop and tell the user clearly:
- What path was checked
- That they need test cases there
- Suggest running
/eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset path
Step 2: Preflight Check
Before setting up the workspace, verify the project's artifact directories are clean. Skills write to the project directory (not the workspace), so stale artifacts from previous runs contaminate results — wrong IDs, stale run reports, inflated file counts.
python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
--config <config> \
[--run-id <id>]
The script checks tmp/ state files and whether $AGENT_EVAL_RUNS_DIR/<id> already has results from a previous run.
- If
CLEAN: proceed to workspace setup.
- If
DIRTY: report the findings to the user and ask what to do:
- Force clean: run
preflight.py --clean --force to delete all stale artifacts, then proceed.
- Change run-id: append a version suffix (e.g.,
2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.
- Abort: let the user handle cleanup manually.
Step 3: Prepare Workspace
Create an isolated workspace with the test cases and output directories:
python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
--config <config> \
--run-id <id> \
[--case-filter <filter>]
The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.
If the case count is 0, stop — the filter matched nothing.
Step 3b: Resolve Tool Interception (if inputs.tools configured)
If eval.yaml has inputs.tools entries, this step is mandatory. workspace.py emits a skeleton in tool_handlers.yaml; you must resolve each handler's prompt into concrete runtime checks (input_filters, env_checks, case_overrides). Do not skip this even when the eval.yaml is unchanged — the workspace is created fresh each time.
Read ${CLAUDE_SKILL_DIR}/references/tool-interception.md for the full format, field reference, and resolution examples. Then read <workspace>/tool_handlers.yaml, resolve every handler, and write it back.
Critical: any handler with patterns: [Bash, ...] and no input_filters is non-functional and will pass through unchecked.
Step 4: Execute Skill
Run the skill headlessly against test cases. In case mode (default), execute.py runs the skill once per case with case-specific arguments and workspace — each case gets its own stdout.log and subagent transcripts. In batch mode, all cases run in a single invocation via batch.yaml.
The execute script handles CLI construction, streaming progress, and result capture:
python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
--config <config> \
--workspace <workspace_path> \
--skill <skill_name> \
--skill-args "<skill arguments>" \
--model <model> \
--output $AGENT_EVAL_RUNS_DIR/<id> \
[--agent <runner>] \
[--subagent-model <model>] \
[--mlflow-experiment <name>] \
[--effort <level>] \
[--parallelism <n>]
Most flags fall back to the config:
--agent falls back to runner.type (default claude-code).
--model falls back to models.skill. If neither is set, execute.py errors out.
--mlflow-experiment falls back to mlflow.experiment.
--skill-args falls back to execution.arguments. In case mode, {field} placeholders are resolved per case from input.yaml.
--effort falls back to runner.effort (Claude Code only; ignored by other runners).
--parallelism falls back to execution.parallelism. When > 1, cases run concurrently via thread pool. Each case gets its own log prefix (e.g., eval:case-003) so interleaved output is distinguishable.
Override via CLI only when testing different combinations than what the config specifies.
Monitoring Progress
Skill execution can take minutes to hours. Launch execute.py using the Bash tool with run_in_background: true. Do NOT pipe the command through tail, head, grep, or any other filter — piping buffers all output and prevents progress monitoring. The command must be the bare python3 ... execute.py ... invocation with no pipes.
Once launched, the Bash tool returns an output file path. Monitor progress by reading that file periodically:
tail -20 <output_file>
Look for phase markers (## Phase, ## Step, Batch N/M), agent counts (N agents launched, N/M done), and completion signals (Done). Summarize concisely — e.g., "Batch 2/4: review agents 3/5 complete" rather than dumping raw output.
Detecting problems: If the last lines haven't changed across two checks (~2-3 min apart), the pipeline may be stuck. Common signs:
- Repeated
sleep commands with no progress change → agents may have timed out or crashed
ERROR or Traceback in the output → script failure, report immediately
- No new output for 5+ minutes → possible hang, check if the process is still running
exit code or EXIT: appearing → execution finished (check the code)
When you spot an issue, report it to the user with the relevant output lines rather than waiting for completion.
After execution, check run_result.json for exit_code, duration_s, wall_clock_s, cost_usd, num_turns, and per-model token usage. duration_s is the sum of per-case durations; wall_clock_s is the actual elapsed time (lower when parallelism is used). Read it with cat (it's JSON — state.py would corrupt it to YAML).
cat $AGENT_EVAL_RUNS_DIR/<id>/run_result.json
If exit_code is non-zero, report the failure with the exit code, duration, and the first few lines of $AGENT_EVAL_RUNS_DIR/<id>/stderr.log. Do not continue to scoring.
Step 5: Collect Artifacts
Distribute workspace outputs into per-case directories so judges can score each case independently:
python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
--config <config> \
--workspace <workspace_path> \
--output $AGENT_EVAL_RUNS_DIR/<id>
Read the collection summary (JSON file — do not use state.py on it):
cat $AGENT_EVAL_RUNS_DIR/<id>/collection.json
Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.
Step 6: Score
Run all configured judges against the collected outputs. Skip this step if --no-judge was specified.
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
--run-id <id> \
--config <config>
Judges receive a record dict with:
- File contents:
outputs["files"], outputs["<dir>_content"]
- Execution metadata:
outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)
- Tool calls:
outputs["tool_calls"] (if outputs has tool: entries)
- Logs:
outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)
- Annotations:
outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.
This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.
If --baseline was specified, also run pairwise comparison:
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
--run-id <id> \
--baseline <baseline_id> \
--config <config>
Read the full results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>/summary.yaml
summary.yaml has three sections: judges (per-judge mean and pass_rate), per_case (per-case {value, rationale} per judge), and pairwise (only if --baseline was used: run_a, run_b, wins_a, wins_b, ties).
Step 7: Interpret and Report
Read the summary and analyze the results. Read ${CLAUDE_SKILL_DIR}/prompts/analyze-results.md for the full analysis framework — it covers aggregate assessment, failure patterns, root causes, regressions, cost attribution, and recommendations. Lead with the Recommendation so the call-to-action is the first thing the reader sees. Be decisive — state assessments, not hedges.
Save analysis to file so it persists in the report. Prepend YAML frontmatter recording the agent and model that wrote the analysis, plus the UTC timestamp — the report uses these to attribute the analysis in its subtitle:
cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF'
---
agent: Claude Code
model: <your-model-id>
date: <UTC ISO 8601>
---
<your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions>
EOF
Write the analysis body as markdown with these sections in order: ## Recommendation (verdict + top actions), ## Summary (aggregate scores, run metrics), ## Failure Patterns, ## Root Causes, ## Regressions (only if --baseline was provided), ## Cost Attribution (always — cite run_metrics plus a derived cost_per_<unit>). The Recommendation must be self-contained — many readers will only read that section. This file is rendered as a prominent callout near the top of the HTML report; the frontmatter is consumed by the report renderer and not displayed verbatim.
Generate HTML report:
python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
--run-id <id> \
--config <config> \
[--baseline <baseline_id>] \
--open
Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<id>/report.html.
If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.
Suggest next steps (include --config <config> if a non-default config was used):
/eval-review --run-id <id> for interactive human review of the results
/eval-optimize --model <model> for automated improvement based on failures
/eval-mlflow --run-id <id> to log results to MLflow
Step 8: Log to MLflow (optional)
If mlflow.experiment is configured in eval.yaml:
Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>
Rules
- Never read large artifact files into your context — delegate content analysis to agents. The summary.yaml has everything you need for reporting.
- Persist state at every step — use
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.
- Report progress at each step so the user knows what's happening and how long it's taking.
- Fail fast — if execution fails, report it immediately. Don't continue to scoring with no artifacts.
- Be decisive in analysis — the user wants to know what's wrong and what to do about it, not a list of possibilities.
$ARGUMENTS