Run any Skill in Manus with one click

$pwd:

eval-run

Name: Eval Run
Author: opendatahub-io

// Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

Run Skill in Manus

$ git log --oneline --stat

stars:6

forks:8

updated:May 6, 2026 at 13:03

File Explorer

13 files

SKILL.md

readonly

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	eval-run
description	Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion

You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.

For the full data flow (dataset → workspace → execution → collection → scoring), see ${CLAUDE_SKILL_DIR}/references/data-pipeline.md. For tool interception mechanics, see ${CLAUDE_SKILL_DIR}/references/tool-interception.md.

Step 0: Parse Arguments and Load Config

Parse $ARGUMENTS:

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from config	Skill model. Required if `models.skill` is unset in eval.yaml.
`--subagent-model <model>`	no	`models.subagent` → falls back to skill model	Model for subagents (e.g., `claude-sonnet-4-6` while main is `claude-opus-4-7`)
`--skill <name>`	no	from config	Override the skill to test
`--run-id <id>`	no	`YYYY-MM-DD-<model>`	Identifier for this run
`--case <filter>`	no	all cases	Substring match to select cases
`--baseline <run-id>`	no	—	Previous run to compare against
`--no-judge`	no	false	Skip LLM judges, run inline checks only
`--gold`	no	false	Save outputs as gold references after run
`--effort <level>`	no	`runner.effort` from config	Claude Code reasoning effort (`low`/`medium`/`high`/`xhigh`/`max`)

Check if the config file exists (use the parsed config path, not hardcoded eval.yaml):

test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"

If config is missing: invoke eval-analyze to bootstrap:

Use the Skill tool to invoke /eval-analyze [--skill <skill>]

Once config exists, read it to understand the eval setup — the skill under test, runner, dataset, outputs, judges, models, and any tool interception. The downstream scripts read the same config; you don't need to pass these fields through, just confirm they're present and warn the user about anything missing or surprising.

If inputs.tools has entries but the skill uses AskUserQuestion or external APIs, verify the handlers cover those tools. Warn the user if a tool the skill uses isn't intercepted — headless execution may hang.

Persist parsed flags:

mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
  model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
  gold=<true/false> no_judge=<true/false>

Step 1: Find Dataset

Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:

ls <dataset_path>/ | head -20

If --case filter was specified, note it for the workspace step.

If no cases found, stop and tell the user clearly:

What path was checked
That they need test cases there
Suggest running /eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset path

Step 2: Preflight Check

Before setting up the workspace, verify the project's artifact directories are clean. Skills write to the project directory (not the workspace), so stale artifacts from previous runs contaminate results — wrong IDs, stale run reports, inflated file counts.

python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
  --config <config> \
  [--run-id <id>]

The script checks tmp/ state files and whether $AGENT_EVAL_RUNS_DIR/<id> already has results from a previous run.

If CLEAN: proceed to workspace setup.
If DIRTY: report the findings to the user and ask what to do:
- Force clean: run preflight.py --clean --force to delete all stale artifacts, then proceed.
- Change run-id: append a version suffix (e.g., 2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.
- Abort: let the user handle cleanup manually.

Step 3: Prepare Workspace

Create an isolated workspace with the test cases and output directories:

python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
  --config <config> \
  --run-id <id> \
  [--case-filter <filter>]

The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.

If the case count is 0, stop — the filter matched nothing.

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

If eval.yaml has inputs.tools entries, this step is mandatory. workspace.py emits a skeleton in tool_handlers.yaml; you must resolve each handler's prompt into concrete runtime checks (input_filters, env_checks, case_overrides). Do not skip this even when the eval.yaml is unchanged — the workspace is created fresh each time.

Read ${CLAUDE_SKILL_DIR}/references/tool-interception.md for the full format, field reference, and resolution examples. Then read <workspace>/tool_handlers.yaml, resolve every handler, and write it back.

Critical: any handler with patterns: [Bash, ...] and no input_filters is non-functional and will pass through unchecked.

Step 4: Execute Skill

Run the skill headlessly against test cases. In case mode (default), execute.py runs the skill once per case with case-specific arguments and workspace — each case gets its own stdout.log and subagent transcripts. In batch mode, all cases run in a single invocation via batch.yaml.

The execute script handles CLI construction, streaming progress, and result capture:

python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
  --config <config> \
  --workspace <workspace_path> \
  --skill <skill_name> \
  --skill-args "<skill arguments>" \
  --model <model> \
  --output $AGENT_EVAL_RUNS_DIR/<id> \
  [--agent <runner>] \
  [--subagent-model <model>] \
  [--mlflow-experiment <name>] \
  [--effort <level>] \
  [--parallelism <n>]

Most flags fall back to the config:

--agent falls back to runner.type (default claude-code).
--model falls back to models.skill. If neither is set, execute.py errors out.
--mlflow-experiment falls back to mlflow.experiment.
--skill-args falls back to execution.arguments. In case mode, {field} placeholders are resolved per case from input.yaml.
--effort falls back to runner.effort (Claude Code only; ignored by other runners).
--parallelism falls back to execution.parallelism. When > 1, cases run concurrently via thread pool. Each case gets its own log prefix (e.g., eval:case-003) so interleaved output is distinguishable.

Override via CLI only when testing different combinations than what the config specifies.

Monitoring Progress

Skill execution can take minutes to hours. Launch execute.py using the Bash tool with run_in_background: true. Do NOT pipe the command through tail, head, grep, or any other filter — piping buffers all output and prevents progress monitoring. The command must be the bare python3 ... execute.py ... invocation with no pipes.

Once launched, the Bash tool returns an output file path. Monitor progress by reading that file periodically:

# Check progress (repeat periodically)
tail -20 <output_file>

Look for phase markers (## Phase, ## Step, Batch N/M), agent counts (N agents launched, N/M done), and completion signals (Done). Summarize concisely — e.g., "Batch 2/4: review agents 3/5 complete" rather than dumping raw output.

Detecting problems: If the last lines haven't changed across two checks (~2-3 min apart), the pipeline may be stuck. Common signs:

Repeated sleep commands with no progress change → agents may have timed out or crashed
ERROR or Traceback in the output → script failure, report immediately
No new output for 5+ minutes → possible hang, check if the process is still running
exit code or EXIT: appearing → execution finished (check the code)

When you spot an issue, report it to the user with the relevant output lines rather than waiting for completion.

After execution, check run_result.json for exit_code, duration_s, wall_clock_s, cost_usd, num_turns, and per-model token usage. duration_s is the sum of per-case durations; wall_clock_s is the actual elapsed time (lower when parallelism is used). Read it with cat (it's JSON — state.py would corrupt it to YAML).

cat $AGENT_EVAL_RUNS_DIR/<id>/run_result.json

If exit_code is non-zero, report the failure with the exit code, duration, and the first few lines of $AGENT_EVAL_RUNS_DIR/<id>/stderr.log. Do not continue to scoring.

Step 5: Collect Artifacts

Distribute workspace outputs into per-case directories so judges can score each case independently:

python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
  --config <config> \
  --workspace <workspace_path> \
  --output $AGENT_EVAL_RUNS_DIR/<id>

Read the collection summary (JSON file — do not use state.py on it):

cat $AGENT_EVAL_RUNS_DIR/<id>/collection.json

Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.

Step 6: Score

Run all configured judges against the collected outputs. Skip this step if --no-judge was specified.

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
  --run-id <id> \
  --config <config>

Judges receive a record dict with:

File contents: outputs["files"], outputs["<dir>_content"]
Execution metadata: outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)
Tool calls: outputs["tool_calls"] (if outputs has tool: entries)
Logs: outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)
Annotations: outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.

This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.

If --baseline was specified, also run pairwise comparison:

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
  --run-id <id> \
  --baseline <baseline_id> \
  --config <config>

Read the full results:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>/summary.yaml

summary.yaml has three sections: judges (per-judge mean and pass_rate), per_case (per-case {value, rationale} per judge), and pairwise (only if --baseline was used: run_a, run_b, wins_a, wins_b, ties).

Step 7: Interpret and Report

Read the summary and analyze the results. Read ${CLAUDE_SKILL_DIR}/prompts/analyze-results.md for the full analysis framework — it covers aggregate assessment, failure patterns, root causes, regressions, cost attribution, and recommendations. Lead with the Recommendation so the call-to-action is the first thing the reader sees. Be decisive — state assessments, not hedges.

Save analysis to file so it persists in the report. Prepend YAML frontmatter recording the agent and model that wrote the analysis, plus the UTC timestamp — the report uses these to attribute the analysis in its subtitle:

cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF'
---
agent: Claude Code        # the agent/runtime writing this analysis (e.g. Claude Code)
model: <your-model-id>   # e.g. claude-opus-4-7, claude-sonnet-4-6 — the model backing the agent
date: <UTC ISO 8601>     # e.g. 2026-04-17T14:32:11Z
---

<your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions>
EOF

Write the analysis body as markdown with these sections in order: ## Recommendation (verdict + top actions), ## Summary (aggregate scores, run metrics), ## Failure Patterns, ## Root Causes, ## Regressions (only if --baseline was provided), ## Cost Attribution (always — cite run_metrics plus a derived cost_per_<unit>). The Recommendation must be self-contained — many readers will only read that section. This file is rendered as a prominent callout near the top of the HTML report; the frontmatter is consumed by the report renderer and not displayed verbatim.

Generate HTML report:

python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
  --run-id <id> \
  --config <config> \
  [--baseline <baseline_id>] \
  --open

Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<id>/report.html.

If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for interactive human review of the results
/eval-optimize --model <model> for automated improvement based on failures
/eval-mlflow --run-id <id> to log results to MLflow

Step 8: Log to MLflow (optional)

If mlflow.experiment is configured in eval.yaml:

Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>

Rules

Never read large artifact files into your context — delegate content analysis to agents. The summary.yaml has everything you need for reporting.
Persist state at every step — use python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.
Report progress at each step so the user knows what's happening and how long it's taking.
Fail fast — if execution fails, report it immediately. Don't continue to scoring with no artifacts.
Be decisive in analysis — the user wants to know what's wrong and what to do about it, not a list of possibilities.

$ARGUMENTS

name	eval-run
description	Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion

You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.

Step 0: Parse Arguments and Load Config

Parse $ARGUMENTS:

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from config	Skill model. Required if `models.skill` is unset in eval.yaml.
`--subagent-model <model>`	no	`models.subagent` → falls back to skill model	Model for subagents (e.g., `claude-sonnet-4-6` while main is `claude-opus-4-7`)
`--skill <name>`	no	from config	Override the skill to test
`--run-id <id>`	no	`YYYY-MM-DD-<model>`	Identifier for this run
`--case <filter>`	no	all cases	Substring match to select cases
`--baseline <run-id>`	no	—	Previous run to compare against
`--no-judge`	no	false	Skip LLM judges, run inline checks only
`--gold`	no	false	Save outputs as gold references after run
`--effort <level>`	no	`runner.effort` from config	Claude Code reasoning effort (`low`/`medium`/`high`/`xhigh`/`max`)

Check if the config file exists (use the parsed config path, not hardcoded eval.yaml):

test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"

If config is missing: invoke eval-analyze to bootstrap:

Use the Skill tool to invoke /eval-analyze [--skill <skill>]

Persist parsed flags:

mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
  model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
  gold=<true/false> no_judge=<true/false>

Step 1: Find Dataset

Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:

ls <dataset_path>/ | head -20

If --case filter was specified, note it for the workspace step.

If no cases found, stop and tell the user clearly:

What path was checked
That they need test cases there
Suggest running /eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset path

Step 2: Preflight Check

python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
  --config <config> \
  [--run-id <id>]

The script checks tmp/ state files and whether $AGENT_EVAL_RUNS_DIR/<id> already has results from a previous run.

If CLEAN: proceed to workspace setup.
If DIRTY: report the findings to the user and ask what to do:
- Force clean: run preflight.py --clean --force to delete all stale artifacts, then proceed.
- Change run-id: append a version suffix (e.g., 2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.
- Abort: let the user handle cleanup manually.

Step 3: Prepare Workspace

Create an isolated workspace with the test cases and output directories:

python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
  --config <config> \
  --run-id <id> \
  [--case-filter <filter>]

The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.

If the case count is 0, stop — the filter matched nothing.

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Critical: any handler with patterns: [Bash, ...] and no input_filters is non-functional and will pass through unchecked.

Step 4: Execute Skill

The execute script handles CLI construction, streaming progress, and result capture:

python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
  --config <config> \
  --workspace <workspace_path> \
  --skill <skill_name> \
  --skill-args "<skill arguments>" \
  --model <model> \
  --output $AGENT_EVAL_RUNS_DIR/<id> \
  [--agent <runner>] \
  [--subagent-model <model>] \
  [--mlflow-experiment <name>] \
  [--effort <level>] \
  [--parallelism <n>]

Most flags fall back to the config:

--agent falls back to runner.type (default claude-code).
--model falls back to models.skill. If neither is set, execute.py errors out.
--mlflow-experiment falls back to mlflow.experiment.
--skill-args falls back to execution.arguments. In case mode, {field} placeholders are resolved per case from input.yaml.
--effort falls back to runner.effort (Claude Code only; ignored by other runners).
--parallelism falls back to execution.parallelism. When > 1, cases run concurrently via thread pool. Each case gets its own log prefix (e.g., eval:case-003) so interleaved output is distinguishable.

Override via CLI only when testing different combinations than what the config specifies.

Monitoring Progress

Once launched, the Bash tool returns an output file path. Monitor progress by reading that file periodically:

# Check progress (repeat periodically)
tail -20 <output_file>

Detecting problems: If the last lines haven't changed across two checks (~2-3 min apart), the pipeline may be stuck. Common signs:

Repeated sleep commands with no progress change → agents may have timed out or crashed
ERROR or Traceback in the output → script failure, report immediately
No new output for 5+ minutes → possible hang, check if the process is still running
exit code or EXIT: appearing → execution finished (check the code)

When you spot an issue, report it to the user with the relevant output lines rather than waiting for completion.

cat $AGENT_EVAL_RUNS_DIR/<id>/run_result.json

If exit_code is non-zero, report the failure with the exit code, duration, and the first few lines of $AGENT_EVAL_RUNS_DIR/<id>/stderr.log. Do not continue to scoring.

Step 5: Collect Artifacts

Distribute workspace outputs into per-case directories so judges can score each case independently:

python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
  --config <config> \
  --workspace <workspace_path> \
  --output $AGENT_EVAL_RUNS_DIR/<id>

Read the collection summary (JSON file — do not use state.py on it):

cat $AGENT_EVAL_RUNS_DIR/<id>/collection.json

Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.

Step 6: Score

Run all configured judges against the collected outputs. Skip this step if --no-judge was specified.

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
  --run-id <id> \
  --config <config>

Judges receive a record dict with:

File contents: outputs["files"], outputs["<dir>_content"]
Execution metadata: outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)
Tool calls: outputs["tool_calls"] (if outputs has tool: entries)
Logs: outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)
Annotations: outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.

This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.

If --baseline was specified, also run pairwise comparison:

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
  --run-id <id> \
  --baseline <baseline_id> \
  --config <config>

Read the full results:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>/summary.yaml

Step 7: Interpret and Report

cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF'
---
agent: Claude Code        # the agent/runtime writing this analysis (e.g. Claude Code)
model: <your-model-id>   # e.g. claude-opus-4-7, claude-sonnet-4-6 — the model backing the agent
date: <UTC ISO 8601>     # e.g. 2026-04-17T14:32:11Z
---

<your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions>
EOF

Generate HTML report:

python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
  --run-id <id> \
  --config <config> \
  [--baseline <baseline_id>] \
  --open

Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<id>/report.html.

If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for interactive human review of the results
/eval-optimize --model <model> for automated improvement based on failures
/eval-mlflow --run-id <id> to log results to MLflow

Step 8: Log to MLflow (optional)

If mlflow.experiment is configured in eval.yaml:

Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>

Rules

Never read large artifact files into your context — delegate content analysis to agents. The summary.yaml has everything you need for reporting.
Persist state at every step — use python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.
Report progress at each step so the user knows what's happening and how long it's taking.
Fail fast — if execution fails, report it immediately. Don't continue to scoring with no artifacts.
Be decisive in analysis — the user wants to know what's wrong and what to do about it, not a list of possibilities.

$ARGUMENTS

eval-run

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)