원클릭으로 Manus에서 모든 스킬 실행

judge

스타19

포크0

업데이트2026년 4월 27일 14:26

Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

PSPDFKit-labs

PSPDFKit-labs/agentic-usability

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Judge Solutions

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to judge
--run <runId>: Target a specific run (default: latest)

Scoring Dimensions

Each test case receives scores on:

Dimension	Range	What it measures
`apiDiscovery`	0-100	Did the agent find the correct SDK endpoints/methods?
`callCorrectness`	0-100	Are API calls constructed correctly (params, headers, body)?
`completeness`	0-100	Does the solution handle all requirements?
`functionalCorrectness`	0-100	Does the code run and produce correct output?
`overallVerdict`	boolean	Does the solution actually work?
`notes`	string	Brief explanation of scoring decisions

Score Bands

0-20: Fundamentally wrong
21-40: Major issues, partially correct
41-60: Mostly correct with notable mistakes
61-80: Correct with minor issues
81-100: Excellent, matches reference

Judge Output

Written to results/<runId>/<target>/<testId>/judge.json:

{
  "testId": "TC-001",
  "target": "node-20",
  "apiDiscovery": 85,
  "callCorrectness": 90,
  "completeness": 75,
  "functionalCorrectness": 80,
  "overallVerdict": true,
  "notes": "Found correct APIs, minor parameter issue in error handling path"
}

DNF (Did Not Finish)

If the executor produced no solution, the judge writes an all-zero score:

{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }

Per-Test Judge Files

File	Description
`judge.json`	Full scoring result
`judge-cmd.log`	Judge command executed
`judge-output.log`	Raw judge stdout/stderr
`judge-session.jsonl`	Judge conversation log (if available)
`judge-egress.log.json`	Judge network traffic
`judge-error.log`	Error (only on failure)

Progress Tracking

Tracked in results/<runId>/pipeline-state.json:

completed.judge["<target>"] lists judged test IDs
State saved after each test — safe to interrupt and resume

Run agentic-usability judge -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

이 저장소의 다른 Skills

같은 저장소

init

PSPDFKit-labs/agentic-usability

Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.

2026-05-1419

sandbox

PSPDFKit-labs/agentic-usability

Launch an interactive shell inside a microsandbox for debugging. Supports bare mode, executor setup, or judge setup with optional test case scaffolding.

2026-05-1419

eval

PSPDFKit-labs/agentic-usability

Run the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.

2026-04-2719

execute

PSPDFKit-labs/agentic-usability

Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.

2026-04-2719

export

PSPDFKit-labs/agentic-usability

Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.

2026-04-2719

generate

PSPDFKit-labs/agentic-usability

Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.

2026-04-2719

name	judge
description	Have an LLM judge compare reference and generated solutions, scoring on API discovery, correctness, completeness, and functional correctness.
argument-hint	[project-directory] [--tests TC-001,TC-002] [--run runId]
disable-model-invocation	true
allowed-tools	Bash(agentic-usability *) Read Glob

Judge Solutions

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to judge
--run <runId>: Target a specific run (default: latest)

Scoring Dimensions

Each test case receives scores on:

Dimension	Range	What it measures
`apiDiscovery`	0-100	Did the agent find the correct SDK endpoints/methods?
`callCorrectness`	0-100	Are API calls constructed correctly (params, headers, body)?
`completeness`	0-100	Does the solution handle all requirements?
`functionalCorrectness`	0-100	Does the code run and produce correct output?
`overallVerdict`	boolean	Does the solution actually work?
`notes`	string	Brief explanation of scoring decisions

Score Bands

0-20: Fundamentally wrong
21-40: Major issues, partially correct
41-60: Mostly correct with notable mistakes
61-80: Correct with minor issues
81-100: Excellent, matches reference

Judge Output

Written to results/<runId>/<target>/<testId>/judge.json:

{
  "testId": "TC-001",
  "target": "node-20",
  "apiDiscovery": 85,
  "callCorrectness": 90,
  "completeness": 75,
  "functionalCorrectness": 80,
  "overallVerdict": true,
  "notes": "Found correct APIs, minor parameter issue in error handling path"
}

DNF (Did Not Finish)

If the executor produced no solution, the judge writes an all-zero score:

{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }

Per-Test Judge Files

File	Description
`judge.json`	Full scoring result
`judge-cmd.log`	Judge command executed
`judge-output.log`	Raw judge stdout/stderr
`judge-session.jsonl`	Judge conversation log (if available)
`judge-egress.log.json`	Judge network traffic
`judge-error.log`	Error (only on failure)

Progress Tracking

Tracked in results/<runId>/pipeline-state.json:

completed.judge["<target>"] lists judged test IDs
State saved after each test — safe to interrupt and resume

Run agentic-usability judge -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.