Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

Comenzar

$pwd:

run-eval

Name: Run Eval
Author: adam-s

// Run an evaluation against an eval with a specific config

Ejecutar en Manus

$ git log --oneline --stat

stars:2

forks:1

updated:6 de abril de 2026, 11:29

SKILL.md

readonly

name	run-eval
description	Run an evaluation against an eval with a specific config
argument-hint	<eval> [config] [--model MODEL] [--challenge NAME] [--prompt-variant VARIANT] [--keep]

/run-eval — Run an evaluation

Run a Claude agent in a workspace, score the result, then compare against the most recent prior run.

Before Starting

Confirm with the user:

Eval and config to run
Single challenge or all challenges (matrix evals run all by default)
Single run or parallel (default: single)

Do NOT launch until the user confirms.

Arguments

$1 — eval name (directory in evals/)
$2 — config name (directory in evals/<eval>/configs/, default: baseline)
--model <name> — override model (default from EVAL.md frontmatter)
--budget <usd> — override budget
--challenge <name> — run only this challenge (matrix evals)
--prompt-variant <name> — use prompt-<name>.md instead of prompt.md (e.g. --prompt-variant vague)
--keep — keep workspace after completion for inspection

Steps

Run the evaluation:

python3 scripts/run_eval.py <eval> <config> [--model MODEL] [--budget USD] [--challenge NAME] [--prompt-variant VARIANT] [--keep]

run_eval.py handles everything: EVAL.md parsing, config resolution, challenge iteration, prompt templating, and invoke.py delegation. It always runs in stream mode so the agent's transcript is archived to stream.jsonl for /compare to read.

Use run_in_background: true for the run command. Monitor with python3 scripts/dashboard.py --latest.

After the run finishes, show the result:

python3 scripts/dashboard.py --latest --summary

For each challenge that just ran, find the most recent prior run for the same target/config (if any) and call /compare:

/compare <prior-run-id> <current-run-id>

To find the prior run id: list evals/<eval>/results/ sorted by mtime, skip the current run, take the next one whose events.jsonl shows the same target and config in agent_started. If there is no prior run, skip the comparison and just show the run result.

The /compare skill spawns a sub-agent that reads both runs' evidence (events, transcript, produced artifacts) and writes a markdown summary. Print its output below the run summary so the developer sees both in one place.

If the developer ran multiple challenges (matrix eval), call /compare once per challenge — each comparison is independent.

related-skills.json

mismo repositorio

compare.md

from "adam-s/agent-spec"

Compare two eval runs and report what changed. Reads both runs' events, transcripts, and produced artifacts. Writes a short markdown summary classifying differences as regression, improvement, or neutral.

2026-04-062

iterate.md

from "adam-s/agent-spec"

Generalized recursive iteration loop. Runs parallel sub-agents against a target, scores deterministically, diagnoses instruction gaps, applies fixes, and recurses until the stop condition is met or max depth is reached.

2026-04-062

handoff.md

from "adam-s/agent-spec"

Write a handoff document so a new chat can continue the work

2026-04-052

report.md

from "adam-s/agent-spec"

Show evaluation results and comparisons

2026-04-052

build.md

from "adam-s/agent-spec"

Test-driven development of a Hono/Bun WebSocket application. Read requirements, read tests, build server, verify, iterate until all tests pass.

2026-04-052

new-eval.md

from "adam-s/agent-spec"

Scaffold a new evaluation

2026-04-032

package.json

"author": "adam-s"

"repository": "adam-s/agent-spec"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas15-1253L4

name	run-eval
description	Run an evaluation against an eval with a specific config
argument-hint	<eval> [config] [--model MODEL] [--challenge NAME] [--prompt-variant VARIANT] [--keep]

/run-eval — Run an evaluation

Run a Claude agent in a workspace, score the result, then compare against the most recent prior run.

Before Starting

Confirm with the user:

Eval and config to run
Single challenge or all challenges (matrix evals run all by default)
Single run or parallel (default: single)

Do NOT launch until the user confirms.

Arguments

$1 — eval name (directory in evals/)
$2 — config name (directory in evals/<eval>/configs/, default: baseline)
--model <name> — override model (default from EVAL.md frontmatter)
--budget <usd> — override budget
--challenge <name> — run only this challenge (matrix evals)
--prompt-variant <name> — use prompt-<name>.md instead of prompt.md (e.g. --prompt-variant vague)
--keep — keep workspace after completion for inspection

Steps

Run the evaluation:

python3 scripts/run_eval.py <eval> <config> [--model MODEL] [--budget USD] [--challenge NAME] [--prompt-variant VARIANT] [--keep]

Use run_in_background: true for the run command. Monitor with python3 scripts/dashboard.py --latest.

After the run finishes, show the result:

python3 scripts/dashboard.py --latest --summary

For each challenge that just ran, find the most recent prior run for the same target/config (if any) and call /compare:

/compare <prior-run-id> <current-run-id>

If the developer ran multiple challenges (matrix eval), call /compare once per challenge — each comparison is independent.

run-eval

/run-eval — Run an evaluation

Before Starting

Arguments

Steps

Más de este repositorio

Más de este repositorio

/run-eval — Run an evaluation

Before Starting

Arguments

Steps