一键在 Manus 中运行任何 Skill

execute

星标19

分支0

更新时间2026年4月27日 14:26

Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

PSPDFKit-labs

PSPDFKit-labs/agentic-usability

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Execute Test Cases

Run the executor stage of the benchmark pipeline. For each test case and target, this:

Creates a sandboxed VM from the target image
Scaffolds workspace (template → setup script → per-test setup instructions)
Uploads PROBLEM.md with the problem statement
Installs the agent CLI inside the sandbox
Uploads public sources (docs, packages) into /workspace/sources/
Runs the executor agent to solve the problem
Extracts solution files from /workspace/solution/
Saves all artifacts and destroys the sandbox

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)
--run <runId>: Target a specific run directory (default: latest run)

Per-Test Output Files

Saved to results/<runId>/<target>/<testId>/:

File	Description
`generated-solution.json`	Agent's solution `[{path, content}]`
`agent-notes.md`	Agent's self-reported working notes
`agent-output.log`	Raw agent stdout/stderr
`agent-cmd.log`	Exact command executed
`agent-session.jsonl`	Agent conversation log (if available)
`agent-egress.log.json`	Network traffic logs
`workspace-snapshot.tar.gz`	Full sandbox workspace tarball
`setup.log`	Workspace scaffolding log
`agent-error.log`	Error details (only on failure)
`install-error.log`	Agent install failure (only on error)

Progress Tracking

Progress is tracked in results/<runId>/pipeline-state.json:

completed.execute["<target>"] lists test IDs that have finished
State is saved after each test — safe to interrupt and resume
Use agentic-usability eval --resume to continue from where it stopped

To check which tests completed, read the pipeline state:

results/<runId>/pipeline-state.json → completed.execute.<targetName>

Retry Behavior

Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.

Concurrency

Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.

Run agentic-usability execute -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

同仓库更多 Skills

同仓库

init

PSPDFKit-labs/agentic-usability

Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.

2026-05-1419

sandbox

PSPDFKit-labs/agentic-usability

Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.

2026-05-1419

eval

PSPDFKit-labs/agentic-usability

Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.

2026-04-2719

export

PSPDFKit-labs/agentic-usability

Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.

2026-04-2719

generate

PSPDFKit-labs/agentic-usability

Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.

2026-04-2719

insights

PSPDFKit-labs/agentic-usability

Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues.

2026-04-2719

name	execute
description	Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
argument-hint	[project-directory] [--tests TC-001,TC-002] [--run runId]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Execute Test Cases

Run the executor stage of the benchmark pipeline. For each test case and target, this:

Creates a sandboxed VM from the target image
Scaffolds workspace (template → setup script → per-test setup instructions)
Uploads PROBLEM.md with the problem statement
Installs the agent CLI inside the sandbox
Uploads public sources (docs, packages) into /workspace/sources/
Runs the executor agent to solve the problem
Extracts solution files from /workspace/solution/
Saves all artifacts and destroys the sandbox

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)
--run <runId>: Target a specific run directory (default: latest run)

Per-Test Output Files

Saved to results/<runId>/<target>/<testId>/:

File	Description
`generated-solution.json`	Agent's solution `[{path, content}]`
`agent-notes.md`	Agent's self-reported working notes
`agent-output.log`	Raw agent stdout/stderr
`agent-cmd.log`	Exact command executed
`agent-session.jsonl`	Agent conversation log (if available)
`agent-egress.log.json`	Network traffic logs
`workspace-snapshot.tar.gz`	Full sandbox workspace tarball
`setup.log`	Workspace scaffolding log
`agent-error.log`	Error details (only on failure)
`install-error.log`	Agent install failure (only on error)

Progress Tracking

Progress is tracked in results/<runId>/pipeline-state.json:

completed.execute["<target>"] lists test IDs that have finished
State is saved after each test — safe to interrupt and resume
Use agentic-usability eval --resume to continue from where it stopped

To check which tests completed, read the pipeline state:

results/<runId>/pipeline-state.json → completed.execute.<targetName>

Retry Behavior

Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.

Concurrency

Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.

Run agentic-usability execute -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.