一键在 Manus 中运行任何 Skill

$pwd:

eval-analyze

Name: Eval Analyze
Author: opendatahub-io

// Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

在 Manus 中运行

$ git log --oneline --stat

stars:15

forks:18

updated:2026年5月29日 13:49

文件资源管理器

8 个文件

SKILL.md

readonly

related-skills.json

同仓库

eval-dataset.md

from "opendatahub-io/agent-eval-harness"

Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

2026-05-2915

eval-mlflow.md

from "opendatahub-io/agent-eval-harness"

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

2026-05-2915

eval-run.md

from "opendatahub-io/agent-eval-harness"

Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

2026-05-2915

eval-optimize.md

from "opendatahub-io/agent-eval-harness"

Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

2026-05-2915

eval-review.md

from "opendatahub-io/agent-eval-harness"

Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.

2026-05-2915

eval-setup.md

from "opendatahub-io/agent-eval-harness"

Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment".

2026-05-0615

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

Argument

Required

Default

Description

--skill <name>

auto-detect

Which skill to analyze

--config <path>

eval.yaml

Output path for the config

--update

false

Fill in missing sections only, preserve user edits

Argument

Required

Default

Description

--skill <name>

auto-detect

Which skill to analyze

--config <path>

eval.yaml

Output path for the config

--update

false

Fill in missing sections only, preserve user edits

eval-analyze

Step 0: Parse Arguments

Step 1: Find the Target Skill

Step 2: Check If Analysis Is Needed

Step 3: Deep-Read the Skill

Step 4: Explore the Dataset

Step 5: Generate eval.yaml

Step 5b: Validate Generated Config

Step 6: Generate eval.md

Step 7: Report

Rules

Step 0: Parse Arguments

Step 1: Find the Target Skill

Step 2: Check If Analysis Is Needed

Step 3: Deep-Read the Skill

Step 4: Explore the Dataset

Step 5: Generate eval.yaml

Step 5b: Validate Generated Config

Step 6: Generate eval.md

Step 7: Report

Rules

name	eval-analyze
description	Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion

eval-analyze

同仓库更多 Skills

同仓库更多 Skills

Step 0: Parse Arguments

Step 1: Find the Target Skill

Step 2: Check If Analysis Is Needed

Step 3: Deep-Read the Skill

Step 4: Explore the Dataset

Step 5: Generate eval.yaml

Step 5b: Validate Generated Config

Step 6: Generate eval.md

Step 7: Report

Rules

Step 0: Parse Arguments

Step 1: Find the Target Skill

Step 2: Check If Analysis Is Needed

Step 3: Deep-Read the Skill

Step 4: Explore the Dataset

Step 5: Generate eval.yaml

Step 5b: Validate Generated Config

Step 6: Generate eval.md

Step 7: Report

Rules