ワンクリックでManusで任意のスキルを実行

write-judge-prompt

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

Manusで実行

スター3

フォーク0

更新日2026年5月24日 22:13

ソース

itseffi

itseffi/ai-product-evals

GitHub リポジトリを開く Creator のリポジトリを見る

インストールコマンド

ダウンロード

Manusで実行

役立つ用途SOC

ソフトウェア開発者コンピュータ・数学職15-1252L4

SKILL.md

readonly

name	write-judge-prompt
description	Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

Write Judge Prompt

Overview

Use judge prompts only when deterministic checks are not enough.
Define the judgment criteria precisely before writing the prompt.
Ask the judge for structured outputs that can be parsed consistently.
Keep the scoring rubric narrow and behavior-based.
Validate the judge prompt against real examples before trusting it.

Prerequisites

Confirm that deterministic evaluation is inadequate for the target behavior. Inspect evaluators/index.mjs, especially llmJudge, and read traces of real failures before designing the judge prompt.

Core Instructions

Start With A Narrow Rubric

Define the exact behavior being judged, such as:

faithfulness to context
relevance to query
clarity
conciseness
professional tone

Avoid vague umbrella prompts like “judge overall quality.”

Require Structured Output

Judge outputs should be easy to parse, and should put the reasoning before the verdict so the judge reasons before it commits to a label:

REASON: [one sentence]
SCORE: [0-100]
PASS: [YES or NO]

PASS is the primary verdict; SCORE is advisory. The parser should not rely on free-form prose, and the order above matches the templates in judges/.

Use Behavior-Based Criteria

Describe what counts as success and failure in concrete terms. Good judge prompts refer to observable properties of the response, not abstract claims about “goodness.”

Test Against Real Examples

Before trusting the judge:

pass it obvious positives
pass it obvious negatives
pass it edge cases
check whether it is stable across similar examples

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/

Anti-Patterns

Using a judge when deterministic scoring is possible.
Asking for unstructured natural-language reasoning only.
Combining too many criteria in one rubric.
Trusting a judge prompt without calibration.
Treating judge output as ground truth without validation.

このリポジトリの他の Skills

同じリポジトリ

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

validate-evaluator

itseffi/ai-product-evals

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

2026-04-293

name	write-judge-prompt
description	Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

Write Judge Prompt

Overview

Use judge prompts only when deterministic checks are not enough.
Define the judgment criteria precisely before writing the prompt.
Ask the judge for structured outputs that can be parsed consistently.
Keep the scoring rubric narrow and behavior-based.
Validate the judge prompt against real examples before trusting it.

Prerequisites

Confirm that deterministic evaluation is inadequate for the target behavior. Inspect evaluators/index.mjs, especially llmJudge, and read traces of real failures before designing the judge prompt.

Core Instructions

Start With A Narrow Rubric

Define the exact behavior being judged, such as:

faithfulness to context
relevance to query
clarity
conciseness
professional tone

Avoid vague umbrella prompts like “judge overall quality.”

Require Structured Output

Judge outputs should be easy to parse, and should put the reasoning before the verdict so the judge reasons before it commits to a label:

REASON: [one sentence]
SCORE: [0-100]
PASS: [YES or NO]

PASS is the primary verdict; SCORE is advisory. The parser should not rely on free-form prose, and the order above matches the templates in judges/.

Use Behavior-Based Criteria

Describe what counts as success and failure in concrete terms. Good judge prompts refer to observable properties of the response, not abstract claims about “goodness.”

Test Against Real Examples

Before trusting the judge:

pass it obvious positives
pass it obvious negatives
pass it edge cases
check whether it is stable across similar examples

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/

Anti-Patterns

Using a judge when deterministic scoring is possible.
Asking for unstructured natural-language reasoning only.
Combining too many criteria in one rubric.
Trusting a judge prompt without calibration.
Treating judge output as ground truth without validation.