Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

$pwd:

eval-dataset

Name: Eval Dataset
Author: opendatahub-io

// Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

Executar no Manus

$ git log --oneline --stat

stars:15

forks:18

updated:29 de maio de 2026 às 17:16

Explorador de arquivos

2 arquivos

SKILL.md

readonly

related-skills.json

mesmo repositório

eval-mlflow.md

from "opendatahub-io/agent-eval-harness"

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

2026-05-2915

eval-run.md

from "opendatahub-io/agent-eval-harness"

Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations.

2026-05-2915

eval-optimize.md

from "opendatahub-io/agent-eval-harness"

Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

2026-05-2915

eval-review.md

from "opendatahub-io/agent-eval-harness"

Interactive review of evaluation results. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on what the user identifies. Use when the user wants to review eval results, look at results, check scores, see what went wrong, give qualitative feedback on skill outputs, or iterate on a skill based on human judgment rather than automated fixes. Triggers on "review the run", "how did my skill do", "what failed", "look at the eval results", "check the scores". Complements /eval-optimize (automated) with human-in-the-loop review.

2026-05-2915

eval-analyze.md

from "opendatahub-io/agent-eval-harness"

Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

2026-05-2915

eval-setup.md

from "opendatahub-io/agent-eval-harness"

Optional environment configurator for the agent-eval-harness. Configures MLflow tracking, verifies API keys, and troubleshoots dependency issues. Not required for basic usage — dependencies auto-install via SessionStart hook and agent_eval is available via symlinks. Use when the user wants to configure MLflow tracking, troubleshoot import errors, verify the environment, or set up a remote MLflow server. Also triggers on "configure mlflow", "set up tracking", "ModuleNotFoundError", "mlflow not installed", "missing dependencies", or "check my eval environment".

2026-05-0615

package.json

"author": "opendatahub-io"

"repository": "opendatahub-io/agent-eval-harness"

Abrir repositório GitHub Ver repositórios do creator

$ install --global

$ download --local

Executar no Manus

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--count <N>

Number of cases to generate

--strategy <type>

bootstrap

Generation strategy (see Step 3)

--run-id <id>

—

Previous eval run to learn from (used with expand)

# answers.yaml — LLM answerer guidance for this case dedup_is_duplicate: true dedup_guidance: > This RFE is intentionally a rephrased version of an existing RFE about model signature verification. If asked whether existing RFEs cover this need, the answer is yes.

# annotations.yaml — fields for outcome-aware judges dedup_is_duplicate: true # or false — tells judges whether no RFE is expected tags: [dedup, high-overlap] known_issues: - dedup should flag this as overlapping with RHAIRFE-1001

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--count <N>

Number of cases to generate

--strategy <type>

bootstrap

Generation strategy (see Step 3)

--run-id <id>

—

Previous eval run to learn from (used with expand)

eval-dataset

Step 0: Parse Arguments

Step 1: Read Context

Assess recommended case count

Step 2: Parse Schema into Generation Template

Step 3: Assess Current State

Step 4: Choose Strategy

Step 5: Generate Cases

Step 6: Validate

Step 7: Report

Rules

Step 0: Parse Arguments

Step 1: Read Context

Assess recommended case count

Step 2: Parse Schema into Generation Template

Step 3: Assess Current State

Step 4: Choose Strategy

Step 5: Generate Cases

Step 6: Validate

Step 7: Report

Rules

name	eval-dataset
description	Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."
user-invocable	true
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion

eval-dataset

Mais deste repositório

Mais deste repositório

Step 0: Parse Arguments

Step 1: Read Context

Assess recommended case count

Step 2: Parse Schema into Generation Template

Step 3: Assess Current State

Step 4: Choose Strategy

Step 5: Generate Cases

Step 6: Validate

Step 7: Report

Rules

Step 0: Parse Arguments

Step 1: Read Context

Assess recommended case count

Step 2: Parse Schema into Generation Template

Step 3: Assess Current State

Step 4: Choose Strategy

Step 5: Generate Cases

Step 6: Validate

Step 7: Report

Rules