تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

eval-optimize

Name: Eval Optimize
Author: opendatahub-io

// Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

تشغيل في Manus

$ git log --oneline --stat

stars:١٥

forks:١٨

updated:٢٩ مايو ٢٠٢٦ في ١٤:٢٧

مستكشف الملفات

2 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

eval-dataset.md

from "opendatahub-io/agent-eval-harness"

Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

2026-05-2915

eval-mlflow.md

from "opendatahub-io/agent-eval-harness"

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

2026-05-2915

eval-run.md

from "opendatahub-io/agent-eval-harness"

Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

2026-05-2915

eval-review.md

from "opendatahub-io/agent-eval-harness"

Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.

2026-05-2915

eval-analyze.md

from "opendatahub-io/agent-eval-harness"

Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

2026-05-2915

eval-setup.md

from "opendatahub-io/agent-eval-harness"

Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment".

2026-05-0615

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from eval.yaml

Model to use for eval runs (overrides config default)

--max-iterations <N>

Stop after N improvement cycles

--run-id <id>

auto-generated

Base run ID (iterations append -iter-N)

--target-judge <name>

all judges

Focus on a specific failing judge

Agent tool, subagent_type="Explore": "Read the transcript at <path>. The judge '<judge_name>' (<judge_type>) failed this case with rationale: '<rationale from summary.yaml>' Find evidence explaining WHY this failure happened: - Where in the transcript did the skill handle (or skip) the relevant task? - What instructions from SKILL.md led to this behavior? - Did the skill attempt the right thing but produce wrong output, or skip it entirely? - If it tried multiple approaches, which one stuck and why?"

Agent tool, subagent_type="Explore": "Read the outputs in $AGENT_EVAL_RUNS_DIR/<id>/cases/<failing_case>/. The judge '<judge_name>' failed with: '<rationale>'. Compare the actual output against this expectation — what specifically is missing or wrong?"

# Targeted re-run (failing cases only) Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --cases <failing-case-id> [<failing-case-id> ...] --baseline <id>-iter-<N-1> --config <config> [--model <model>] # Full re-run (all cases) — use for final verification Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --baseline <id>-iter-<N-1> --config <config> [--model <model>]

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from eval.yaml

Model to use for eval runs (overrides config default)

--max-iterations <N>

Stop after N improvement cycles

--run-id <id>

auto-generated

Base run ID (iterations append -iter-N)

--target-judge <name>

all judges

Focus on a specific failing judge

eval-optimize

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

name	eval-optimize
description	Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion

eval-optimize

المزيد من هذا المستودع

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

المزيد من هذا المستودع