一键在 Manus 中运行任何 Skill

eval

星标19

分支0

更新时间2026年4月27日 14:26

Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

PSPDFKit-labs

PSPDFKit-labs/agentic-usability

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Run Full Evaluation Pipeline

Run the complete benchmark pipeline: execute → judge → report.

echo "Arguments: $ARGUMENTS"

Pipeline Stages

Execute: For each test case × target, spins up a sandboxed VM, has an AI agent solve the problem, extracts solution files
Judge: For each test case × target, an LLM judge compares the generated solution to the reference solution and scores it
Report: Aggregates all judge scores into a terminal scorecard and writes report.json

Options

--resume: Resume from the last checkpoint of an interrupted pipeline
--fresh: Only useful with --resume. Resets pipeline state so the run re-executes from scratch in the same run directory. Does NOT delete result files. Without --resume, a new run always starts fresh anyway.
--label <name>: Human-readable label for this run
--run <runId>: Only used with --resume. Target a specific run instead of auto-detecting the latest incomplete one.

Detecting Pipeline Status

Before running, you can check if a pipeline is paused/interrupted by reading the pipeline state file:

Pipeline state location: <project>/results/<runId>/pipeline-state.json

{
  "stage": "execute",
  "startedAt": "2026-04-25T10:30:00.000Z",
  "testCases": 15,
  "completed": {
    "execute": { "node-20": ["TC-001", "TC-002"] },
    "judge": { "node-20": [] }
  }
}

How to check status:

stage is "execute" or "judge" → pipeline is incomplete/paused
stage is "report" → pipeline completed successfully
Compare completed[stage][target].length vs testCases to see progress
No report.json in the run directory → pipeline didn't finish
List runs: look for subdirectories in results/ containing run.json

Run manifest (results/<runId>/run.json):

{
  "id": "run-2026-04-25T10-30-00-000Z",
  "createdAt": "2026-04-25T10:30:00.000Z",
  "targets": ["node-20"],
  "testCount": 15,
  "label": "baseline v2"
}

How Resume Works

When --resume is passed:

Finds the latest incomplete run (where stage !== "report"), or uses --run <id>
Loads the saved pipeline state
Skips completed stages entirely (e.g., if stage="judge", execute is skipped)
Within a stage: only runs tests not yet in the completed map for each target
Progress is saved after each individual test — safe against crashes

Abort Handling

First Ctrl+C: Graceful — finishes current test, saves state, prints "use --resume to continue"
Second Ctrl+C: Hard exit — immediate process termination

Running the Pipeline

Run agentic-usability eval -p $ARGUMENTS and monitor the output. If interrupted, suggest --resume to continue.

For detailed pipeline internals, see pipeline-guide.md.

同仓库更多 Skills

同仓库

init

PSPDFKit-labs/agentic-usability

Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.

2026-05-1419

sandbox

PSPDFKit-labs/agentic-usability

Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.

2026-05-1419

execute

PSPDFKit-labs/agentic-usability

Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.

2026-04-2719

export

PSPDFKit-labs/agentic-usability

Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.

2026-04-2719

generate

PSPDFKit-labs/agentic-usability

Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.

2026-04-2719

insights

PSPDFKit-labs/agentic-usability

Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues.

2026-04-2719

name	eval
description	Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.
argument-hint	[project-directory] [--resume] [--fresh] [--label name]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Run Full Evaluation Pipeline

Run the complete benchmark pipeline: execute → judge → report.

echo "Arguments: $ARGUMENTS"

Pipeline Stages

Execute: For each test case × target, spins up a sandboxed VM, has an AI agent solve the problem, extracts solution files
Judge: For each test case × target, an LLM judge compares the generated solution to the reference solution and scores it
Report: Aggregates all judge scores into a terminal scorecard and writes report.json

Options

--resume: Resume from the last checkpoint of an interrupted pipeline
--fresh: Only useful with --resume. Resets pipeline state so the run re-executes from scratch in the same run directory. Does NOT delete result files. Without --resume, a new run always starts fresh anyway.
--label <name>: Human-readable label for this run
--run <runId>: Only used with --resume. Target a specific run instead of auto-detecting the latest incomplete one.

Detecting Pipeline Status

Before running, you can check if a pipeline is paused/interrupted by reading the pipeline state file:

Pipeline state location: <project>/results/<runId>/pipeline-state.json

{
  "stage": "execute",
  "startedAt": "2026-04-25T10:30:00.000Z",
  "testCases": 15,
  "completed": {
    "execute": { "node-20": ["TC-001", "TC-002"] },
    "judge": { "node-20": [] }
  }
}

How to check status:

stage is "execute" or "judge" → pipeline is incomplete/paused
stage is "report" → pipeline completed successfully
Compare completed[stage][target].length vs testCases to see progress
No report.json in the run directory → pipeline didn't finish
List runs: look for subdirectories in results/ containing run.json

Run manifest (results/<runId>/run.json):

{
  "id": "run-2026-04-25T10-30-00-000Z",
  "createdAt": "2026-04-25T10:30:00.000Z",
  "targets": ["node-20"],
  "testCount": 15,
  "label": "baseline v2"
}

How Resume Works

When --resume is passed:

Finds the latest incomplete run (where stage !== "report"), or uses --run <id>
Loads the saved pipeline state
Skips completed stages entirely (e.g., if stage="judge", execute is skipped)
Within a stage: only runs tests not yet in the completed map for each target
Progress is saved after each individual test — safe against crashes

Abort Handling

First Ctrl+C: Graceful — finishes current test, saves state, prints "use --resume to continue"
Second Ctrl+C: Hard exit — immediate process termination

Running the Pipeline

Run agentic-usability eval -p $ARGUMENTS and monitor the output. If interrupted, suggest --resume to continue.

For detailed pipeline internals, see pipeline-guide.md.