ワンクリックで
write-judge-prompt
Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.
メニュー
Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.
Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.
Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.
Drafts a reviewable judge-template patch from evaluator validation disagreements.
Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.
| name | write-judge-prompt |
| description | Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone. |
Confirm that deterministic evaluation is inadequate for the target behavior. Inspect evaluators/index.mjs, especially llmJudge, and read traces of real failures before designing the judge prompt.
Define the exact behavior being judged, such as:
Avoid vague umbrella prompts like “judge overall quality.”
Judge outputs should be easy to parse, and should put the reasoning before the verdict so the judge reasons before it commits to a label:
REASON: [one sentence]
SCORE: [0-100]
PASS: [YES or NO]
PASS is the primary verdict; SCORE is advisory. The parser should not rely on
free-form prose, and the order above matches the templates in judges/.
Describe what counts as success and failure in concrete terms. Good judge prompts refer to observable properties of the response, not abstract claims about “goodness.”
Before trusting the judge:
evaluators/index.mjsrun-eval.mjsevals/traces/