一键导入
execute
Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.
Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.
Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.
Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.
Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.
Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues.
| name | execute |
| description | Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions. |
| argument-hint | [project-directory] [--tests TC-001,TC-002] [--run runId] |
| disable-model-invocation | true |
| allowed-tools | Bash(agentic-usability *) Read Glob |
Run the executor stage of the benchmark pipeline. For each test case and target, this:
PROBLEM.md with the problem statement/workspace/sources//workspace/solution/echo "Arguments: $ARGUMENTS"
--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)--run <runId>: Target a specific run directory (default: latest run)Saved to results/<runId>/<target>/<testId>/:
| File | Description |
|---|---|
generated-solution.json | Agent's solution [{path, content}] |
agent-notes.md | Agent's self-reported working notes |
agent-output.log | Raw agent stdout/stderr |
agent-cmd.log | Exact command executed |
agent-session.jsonl | Agent conversation log (if available) |
agent-egress.log.json | Network traffic logs |
workspace-snapshot.tar.gz | Full sandbox workspace tarball |
setup.log | Workspace scaffolding log |
agent-error.log | Error details (only on failure) |
install-error.log | Agent install failure (only on error) |
Progress is tracked in results/<runId>/pipeline-state.json:
completed.execute["<target>"] lists test IDs that have finishedagentic-usability eval --resume to continue from where it stoppedTo check which tests completed, read the pipeline state:
results/<runId>/pipeline-state.json → completed.execute.<targetName>
Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.
Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.
Run agentic-usability execute -p $ARGUMENTS and report the results.
For detailed internals, see pipeline-guide.md.