一键在 Manus 中运行任何 Skill

judge

星标19

分支0

更新时间2026年4月27日 14:26

Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

PSPDFKit-labs

PSPDFKit-labs/agentic-usability

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Judge Solutions

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to judge
--run <runId>: Target a specific run (default: latest)

Scoring Dimensions

Each test case receives scores on:

Dimension	Range	What it measures
`apiDiscovery`	0-100	Did the agent find the correct SDK endpoints/methods?
`callCorrectness`	0-100	Are API calls constructed correctly (params, headers, body)?
`completeness`	0-100	Does the solution handle all requirements?
`functionalCorrectness`	0-100	Does the code run and produce correct output?
`overallVerdict`	boolean	Does the solution actually work?
`notes`	string	Brief explanation of scoring decisions

Score Bands

0-20: Fundamentally wrong
21-40: Major issues, partially correct
41-60: Mostly correct with notable mistakes
61-80: Correct with minor issues
81-100: Excellent, matches reference

Judge Output

Written to results/<runId>/<target>/<testId>/judge.json:

{
  "testId": "TC-001",
  "target": "node-20",
  "apiDiscovery": 85,
  "callCorrectness": 90,
  "completeness": 75,
  "functionalCorrectness": 80,
  "overallVerdict": true,
  "notes": "Found correct APIs, minor parameter issue in error handling path"
}

DNF (Did Not Finish)

If the executor produced no solution, the judge writes an all-zero score:

{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }

Per-Test Judge Files

File	Description
`judge.json`	Full scoring result
`judge-cmd.log`	Judge command executed
`judge-output.log`	Raw judge stdout/stderr
`judge-session.jsonl`	Judge conversation log (if available)
`judge-egress.log.json`	Judge network traffic
`judge-error.log`	Error (only on failure)

Progress Tracking

Tracked in results/<runId>/pipeline-state.json:

completed.judge["<target>"] lists judged test IDs
State saved after each test — safe to interrupt and resume

Run agentic-usability judge -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

同仓库更多 Skills

同仓库

init

PSPDFKit-labs/agentic-usability

Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.

2026-05-1419

sandbox

PSPDFKit-labs/agentic-usability

Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.

2026-05-1419

eval

PSPDFKit-labs/agentic-usability

Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.

2026-04-2719

execute

PSPDFKit-labs/agentic-usability

Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.

2026-04-2719

export

PSPDFKit-labs/agentic-usability

Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.

2026-04-2719

generate

PSPDFKit-labs/agentic-usability

Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.

2026-04-2719

name	judge
description	Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.
argument-hint	[project-directory] [--tests TC-001,TC-002] [--run runId]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Judge Solutions

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to judge
--run <runId>: Target a specific run (default: latest)

Scoring Dimensions

Each test case receives scores on:

Dimension	Range	What it measures
`apiDiscovery`	0-100	Did the agent find the correct SDK endpoints/methods?
`callCorrectness`	0-100	Are API calls constructed correctly (params, headers, body)?
`completeness`	0-100	Does the solution handle all requirements?
`functionalCorrectness`	0-100	Does the code run and produce correct output?
`overallVerdict`	boolean	Does the solution actually work?
`notes`	string	Brief explanation of scoring decisions

Score Bands

0-20: Fundamentally wrong
21-40: Major issues, partially correct
41-60: Mostly correct with notable mistakes
61-80: Correct with minor issues
81-100: Excellent, matches reference

Judge Output

Written to results/<runId>/<target>/<testId>/judge.json:

{
  "testId": "TC-001",
  "target": "node-20",
  "apiDiscovery": 85,
  "callCorrectness": 90,
  "completeness": 75,
  "functionalCorrectness": 80,
  "overallVerdict": true,
  "notes": "Found correct APIs, minor parameter issue in error handling path"
}

DNF (Did Not Finish)

If the executor produced no solution, the judge writes an all-zero score:

{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }

Per-Test Judge Files

File	Description
`judge.json`	Full scoring result
`judge-cmd.log`	Judge command executed
`judge-output.log`	Raw judge stdout/stderr
`judge-session.jsonl`	Judge conversation log (if available)
`judge-egress.log.json`	Judge network traffic
`judge-error.log`	Error (only on failure)

Progress Tracking

Tracked in results/<runId>/pipeline-state.json:

completed.judge["<target>"] lists judged test IDs
State saved after each test — safe to interrupt and resume

Run agentic-usability judge -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.