一键导入
judge
Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.
Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.
Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.
Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.
Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.
| name | judge |
| description | Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness. |
| argument-hint | [project-directory] [--tests TC-001,TC-002] [--run runId] |
| disable-model-invocation | true |
| allowed-tools | Bash(agentic-usability *) Read Glob |
Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.
echo "Arguments: $ARGUMENTS"
--tests <ids>: Comma-separated test case IDs to judge--run <runId>: Target a specific run (default: latest)Each test case receives scores on:
| Dimension | Range | What it measures |
|---|---|---|
apiDiscovery | 0-100 | Did the agent find the correct SDK endpoints/methods? |
callCorrectness | 0-100 | Are API calls constructed correctly (params, headers, body)? |
completeness | 0-100 | Does the solution handle all requirements? |
functionalCorrectness | 0-100 | Does the code run and produce correct output? |
overallVerdict | boolean | Does the solution actually work? |
notes | string | Brief explanation of scoring decisions |
Written to results/<runId>/<target>/<testId>/judge.json:
{
"testId": "TC-001",
"target": "node-20",
"apiDiscovery": 85,
"callCorrectness": 90,
"completeness": 75,
"functionalCorrectness": 80,
"overallVerdict": true,
"notes": "Found correct APIs, minor parameter issue in error handling path"
}
If the executor produced no solution, the judge writes an all-zero score:
{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }
| File | Description |
|---|---|
judge.json | Full scoring result |
judge-cmd.log | Judge command executed |
judge-output.log | Raw judge stdout/stderr |
judge-session.jsonl | Judge conversation log (if available) |
judge-egress.log.json | Judge network traffic |
judge-error.log | Error (only on failure) |
Tracked in results/<runId>/pipeline-state.json:
completed.judge["<target>"] lists judged test IDsRun agentic-usability judge -p $ARGUMENTS and report the results.
For detailed internals, see pipeline-guide.md.