Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

eval-review

Name: Eval Review
Author: opendatahub-io

// Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.

Ejecutar en Manus

$ git log --oneline --stat

stars:15

forks:18

updated:29 de mayo de 2026, 14:02

Explorador de archivos

3 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

eval-dataset.md

from "opendatahub-io/agent-eval-harness"

Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

2026-05-2915

eval-mlflow.md

from "opendatahub-io/agent-eval-harness"

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

2026-05-2915

eval-run.md

from "opendatahub-io/agent-eval-harness"

Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

2026-05-2915

eval-optimize.md

from "opendatahub-io/agent-eval-harness"

Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

2026-05-2915

eval-analyze.md

from "opendatahub-io/agent-eval-harness"

Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

2026-05-2915

eval-setup.md

from "opendatahub-io/agent-eval-harness"

Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment".

2026-05-0615

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

Argument

Required

Default

Description

--run-id <id>

yes

—

Which eval run to review

--config <path>

eval.yaml

Path to eval config

--cases <name> [<name> ...]

all

Exact case directory names to review

run_id: "<id>" reviewed_cases: <count> feedback_cases: <count_with_feedback> reviewer: "human" feedback: case-001-simple-null-pointer-fix: "User's comment about this case" case-002-complex-refactor: "Another comment" case-003-edge-case: "" # empty = acceptable

Argument

Required

Default

Description

--run-id <id>

yes

—

Which eval run to review

--config <path>

eval.yaml

Path to eval config

--cases <name> [<name> ...]

all

Exact case directory names to review

eval-review

Step 0: Parse Arguments

Step 1: Load Results

Step 2: Present Overview

Step 3: Walk Through Cases

Step 4: Check Transcripts (if available)

Step 5: Save Feedback

Step 6: Analyze Patterns

Step 7: Propose Changes

Step 8: Next Steps

Rules

Step 0: Parse Arguments

Step 1: Load Results

Step 2: Present Overview

Step 3: Walk Through Cases

Step 4: Check Transcripts (if available)

Step 5: Save Feedback

Step 6: Analyze Patterns

Step 7: Propose Changes

Step 8: Next Steps

Rules

name	eval-review
description	Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion, Skill

eval-review

Más de este repositorio

Más de este repositorio

Step 0: Parse Arguments

Step 1: Load Results

Step 2: Present Overview

Step 3: Walk Through Cases

Step 4: Check Transcripts (if available)

Step 5: Save Feedback

Step 6: Analyze Patterns

Step 7: Propose Changes

Step 8: Next Steps

Rules

Step 0: Parse Arguments

Step 1: Load Results

Step 2: Present Overview

Step 3: Walk Through Cases

Step 4: Check Transcripts (if available)

Step 5: Save Feedback

Step 6: Analyze Patterns

Step 7: Propose Changes

Step 8: Next Steps

Rules