Run any Skill in Manus with one click

Get Started

$pwd:

eval-skills

Name: Eval Skills
Author: JetBrains

// Run structured evaluations on skills to measure quality and track improvements.

Run Skill in Manus

$ git log --oneline --stat

stars:30

forks:7

updated:March 31, 2026 at 17:42

SKILL.md

readonly

name	eval-skills
description	Run structured evaluations on skills to measure quality and track improvements.
argument-hint	[skill-name ...] (e.g. local-code-review review-architecture)

Eval Skills

Steps

1. Determine skills to evaluate

If names provided via $ARGUMENTS, evaluate those. Otherwise list skills with evals/evals.json files and ask user to pick (accept "all").

2. Create iteration directory

mkdir -p .claude/evals-workspace/iteration-<N>

Use next sequential number.

3. Run eval cases

For each test case in evals.json, run twice:

With skill: subagent with skill loaded, save to iteration-<N>/<skill>-<id>/with_skill/outputs/
Without skill: subagent without skill, save to iteration-<N>/<skill>-<id>/without_skill/outputs/

Each run starts with clean context.

4. Grade

Evaluate assertions against output. Save grading.json:

{
  "assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
  "summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}

Require concrete evidence for every PASS.

5. Aggregate

Save iteration-<N>/benchmark.json with mean pass rates (with/without skill) and delta.

6. Present results

Show per-eval pass rates, overall delta, always-pass candidates (remove?), always-fail candidates (revise?). Save feedback to feedback.json.

Iteration loop

Update SKILL.md based on findings, run new iteration, compare benchmarks, stop when pass rates plateau.

related-skills.json

same repository

autosteer.md

from "JetBrains/databao-cli"

Run the full development pipeline autonomously without pausing between phases. Stops only on quality-gate failures.

2026-03-3130

check-coverage.md

from "JetBrains/databao-cli"

Run test coverage measurement, analyze results, and fix gaps when coverage falls below the 80% threshold.

2026-03-3130

check-pr-comments.md

from "JetBrains/databao-cli"

Fetch unresolved PR review threads, triage them, implement fixes, validate, reply in-thread, and resolve.

2026-03-3130

create-pr.md

from "JetBrains/databao-cli"

Stage, commit, push, and open a GitHub PR following project conventions. Use when code is ready to ship.

2026-03-3130

local-code-review.md

from "JetBrains/databao-cli"

Review local code changes for correctness, regressions, missing tests, and Databao-specific risks.

2026-03-3130

make-yt-issue.md

from "JetBrains/databao-cli"

Ensure a YouTrack issue exists before starting work. Validates existing tickets or creates new ones.

2026-03-3130

package.json

"author": "JetBrains"

"repository": "JetBrains/databao-cli"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	eval-skills
description	Run structured evaluations on skills to measure quality and track improvements.
argument-hint	[skill-name ...] (e.g. local-code-review review-architecture)

Eval Skills

Steps

1. Determine skills to evaluate

If names provided via $ARGUMENTS, evaluate those. Otherwise list skills with evals/evals.json files and ask user to pick (accept "all").

2. Create iteration directory

mkdir -p .claude/evals-workspace/iteration-<N>

Use next sequential number.

3. Run eval cases

For each test case in evals.json, run twice:

With skill: subagent with skill loaded, save to iteration-<N>/<skill>-<id>/with_skill/outputs/
Without skill: subagent without skill, save to iteration-<N>/<skill>-<id>/without_skill/outputs/

Each run starts with clean context.

4. Grade

Evaluate assertions against output. Save grading.json:

{
  "assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
  "summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}

Require concrete evidence for every PASS.

5. Aggregate

Save iteration-<N>/benchmark.json with mean pass rates (with/without skill) and delta.

6. Present results

Show per-eval pass rates, overall delta, always-pass candidates (remove?), always-fail candidates (revise?). Save feedback to feedback.json.

Iteration loop

Update SKILL.md based on findings, run new iteration, compare benchmarks, stop when pass rates plateau.

eval-skills

Eval Skills

Steps

1. Determine skills to evaluate

2. Create iteration directory

3. Run eval cases

4. Grade

5. Aggregate

6. Present results

Iteration loop

More from this repository

More from this repository

Eval Skills

Steps

1. Determine skills to evaluate

2. Create iteration directory

3. Run eval cases

4. Grade

5. Aggregate

6. Present results

Iteration loop