一键导入
sandbox
Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.
Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.
Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.
Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.
Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues.
| name | sandbox |
| description | Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding. |
| argument-hint | [project-directory] [--mode executor|judge] [--test TC-001] [--target node-20] [--run runId] |
| disable-model-invocation | true |
| allowed-tools | Bash(agentic-usability *) Read Glob |
Launch an interactive shell inside a microsandbox identical to what the pipeline uses. Useful for debugging agent auth, inspecting environment variables, testing commands, and reproducing sandbox issues.
echo "Arguments: $ARGUMENTS"
By default the sandbox boots with just the target image, secrets, and env vars — no agent install or workspace setup.
agentic-usability sandbox -p <project>
Boots a sandbox with the configured secrets and env vars. Nothing else is installed or scaffolded.
agentic-usability sandbox -p <project> --mode executor
agentic-usability sandbox -p <project> --mode executor --test TC-001
Installs the executor agent CLI. With --test, also scaffolds the workspace, uploads PROBLEM.md, and uploads public sources — mirroring the execute stage setup.
agentic-usability sandbox -p <project> --mode judge --test TC-001
agentic-usability sandbox -p <project> --mode judge --test TC-001 --run <runId>
Installs the judge agent CLI. With --test, restores the workspace snapshot from a previous run (or uploads solution files), uploads all sources (private + public) — mirroring the judge stage setup.
| Flag | Default | Description |
|---|---|---|
--target <name> | first in config | Which target image to use |
--mode <mode> | (none) | executor or judge — installs agent CLI and optionally sets up workspace |
--test <id> | (none) | Test case to scaffold (requires --mode) |
--run <runId> | latest | Run to load workspace snapshot from (judge mode) |
--output <dir> | results/sandbox-debug-<timestamp>/ | Directory to save debug artifacts |
Once inside the sandbox, you have a full shell. Press Ctrl-] to detach and destroy the sandbox.
Common debugging tasks:
printenv | grep KEY — check which env vars are setcodex login --with-api-key — test Codex authcat /workspace/PROBLEM.md — verify problem statementls /workspace/sources/ — check uploaded sourcesAfter detaching, the following artifacts are saved to the output directory:
| File | Description |
|---|---|
agent-egress.log.json | Network traffic captured during the session |
setup.log | Scaffolding and agent install output |
workspace-snapshot.tar.gz | Tarball of /workspace after session ends |
agent-session.jsonl | Agent CLI session log (if available) |
Run agentic-usability sandbox -p $ARGUMENTS and report the results.