تشغيل أي مهارة في Manus بنقرة واحدة

execute

النجوم١٩

التفرعات٠

آخر تحديث٢٧ أبريل ٢٠٢٦ في ١٤:٢٦

Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

PSPDFKit-labs

PSPDFKit-labs/agentic-usability

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

SKILL.md

readonly

name	execute
description	Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
argument-hint	[project-directory] [--tests TC-001,TC-002] [--run runId]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Execute Test Cases

Run the executor stage of the benchmark pipeline. For each test case and target, this:

Creates a sandboxed VM from the target image
Scaffolds workspace (template → setup script → per-test setup instructions)
Uploads PROBLEM.md with the problem statement
Installs the agent CLI inside the sandbox
Uploads public sources (docs, packages) into /workspace/sources/
Runs the executor agent to solve the problem
Extracts solution files from /workspace/solution/
Saves all artifacts and destroys the sandbox

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)
--run <runId>: Target a specific run directory (default: latest run)

Per-Test Output Files

Saved to results/<runId>/<target>/<testId>/:

File	Description
`generated-solution.json`	Agent's solution `[{path, content}]`
`agent-notes.md`	Agent's self-reported working notes
`agent-output.log`	Raw agent stdout/stderr
`agent-cmd.log`	Exact command executed
`agent-session.jsonl`	Agent conversation log (if available)
`agent-egress.log.json`	Network traffic logs
`workspace-snapshot.tar.gz`	Full sandbox workspace tarball
`setup.log`	Workspace scaffolding log
`agent-error.log`	Error details (only on failure)
`install-error.log`	Agent install failure (only on error)

Progress Tracking

Progress is tracked in results/<runId>/pipeline-state.json:

completed.execute["<target>"] lists test IDs that have finished
State is saved after each test — safe to interrupt and resume
Use agentic-usability eval --resume to continue from where it stopped

To check which tests completed, read the pipeline state:

results/<runId>/pipeline-state.json → completed.execute.<targetName>

Retry Behavior

Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.

Concurrency

Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.

Run agentic-usability execute -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

المزيد من هذا المستودع

نفس المستودع

init

PSPDFKit-labs/agentic-usability

Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.

2026-05-1419

sandbox

PSPDFKit-labs/agentic-usability

Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.

2026-05-1419

eval

PSPDFKit-labs/agentic-usability

Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.

2026-04-2719

export

PSPDFKit-labs/agentic-usability

Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.

2026-04-2719

generate

PSPDFKit-labs/agentic-usability

Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.

2026-04-2719

insights

PSPDFKit-labs/agentic-usability

Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues.

2026-04-2719

name	execute
description	Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
argument-hint	[project-directory] [--tests TC-001,TC-002] [--run runId]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Execute Test Cases

Run the executor stage of the benchmark pipeline. For each test case and target, this:

Creates a sandboxed VM from the target image
Scaffolds workspace (template → setup script → per-test setup instructions)
Uploads PROBLEM.md with the problem statement
Installs the agent CLI inside the sandbox
Uploads public sources (docs, packages) into /workspace/sources/
Runs the executor agent to solve the problem
Extracts solution files from /workspace/solution/
Saves all artifacts and destroys the sandbox

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)
--run <runId>: Target a specific run directory (default: latest run)

Per-Test Output Files

Saved to results/<runId>/<target>/<testId>/:

File	Description
`generated-solution.json`	Agent's solution `[{path, content}]`
`agent-notes.md`	Agent's self-reported working notes
`agent-output.log`	Raw agent stdout/stderr
`agent-cmd.log`	Exact command executed
`agent-session.jsonl`	Agent conversation log (if available)
`agent-egress.log.json`	Network traffic logs
`workspace-snapshot.tar.gz`	Full sandbox workspace tarball
`setup.log`	Workspace scaffolding log
`agent-error.log`	Error details (only on failure)
`install-error.log`	Agent install failure (only on error)

Progress Tracking

Progress is tracked in results/<runId>/pipeline-state.json:

completed.execute["<target>"] lists test IDs that have finished
State is saved after each test — safe to interrupt and resume
Use agentic-usability eval --resume to continue from where it stopped

To check which tests completed, read the pipeline state:

results/<runId>/pipeline-state.json → completed.execute.<targetName>

Retry Behavior

Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.

Concurrency

Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.

Run agentic-usability execute -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.