Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

eval-run

Name: Eval Run
Author: opendatahub-io

// Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

Ejecutar en Manus

$ git log --oneline --stat

stars:15

forks:18

updated:29 de mayo de 2026, 17:16

Explorador de archivos

13 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

eval-dataset.md

from "opendatahub-io/agent-eval-harness"

Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

2026-05-2915

eval-mlflow.md

from "opendatahub-io/agent-eval-harness"

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

2026-05-2915

eval-optimize.md

from "opendatahub-io/agent-eval-harness"

Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

2026-05-2915

eval-review.md

from "opendatahub-io/agent-eval-harness"

Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.

2026-05-2915

eval-analyze.md

from "opendatahub-io/agent-eval-harness"

Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

2026-05-2915

eval-setup.md

from "opendatahub-io/agent-eval-harness"

Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment".

2026-05-0615

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from config

Skill model. Required if models.skill is unset in eval.yaml.

--subagent-model <model>

models.subagent → falls back to skill model

Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7)

--skill <name>

from config

Override the skill to test

--run-id <id>

YYYY-MM-DD-<model>

Identifier for this run

--cases <id> [<id> ...]

all cases

Exact case IDs to run

--baseline <run-id>

—

Previous run to compare against

--no-llm-judges

false

Skip LLM judges (prompt, prompt_file, LLM builtins). Run deterministic judges (check, Python builtins, external code).

--gold

false

Save outputs as gold references after run

--effort <level>

runner.effort from config

Claude Code reasoning effort (Claude Code only; ignored by other runners)

mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs} python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \ model=<model> skill=<skill> run_id=<id> baseline=<baseline> \ gold=<true/false> no_llm_judges=<true/false>

python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \ --config <config> \ --workspace <workspace_path> \ --skill <skill_name> \ --skill-args "<skill arguments>" \ --model <model> \ --output $AGENT_EVAL_RUNS_DIR/<id> \ [--agent <runner>] \ [--subagent-model <model>] \ [--mlflow-experiment <name>] \ [--effort <level>] \ [--parallelism <n>]

cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF' --- agent: Claude Code # the agent/runtime writing this analysis (e.g. Claude Code) model: <your-model-id> # e.g. claude-opus-4-7, claude-sonnet-4-6 — the model backing the agent date: <UTC ISO 8601> # e.g. 2026-04-17T14:32:11Z --- <your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions> EOF

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from config

Skill model. Required if models.skill is unset in eval.yaml.

--subagent-model <model>

models.subagent → falls back to skill model

Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7)

--skill <name>

from config

Override the skill to test

--run-id <id>

YYYY-MM-DD-<model>

Identifier for this run

--cases <id> [<id> ...]

all cases

Exact case IDs to run

--baseline <run-id>

—

Previous run to compare against

--no-llm-judges

false

Skip LLM judges (prompt, prompt_file, LLM builtins). Run deterministic judges (check, Python builtins, external code).

--gold

false

Save outputs as gold references after run

--effort <level>

runner.effort from config

Claude Code reasoning effort (Claude Code only; ignored by other runners)

eval-run

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

name	eval-run
description	Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion

eval-run

Más de este repositorio

Más de este repositorio

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)