Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

write-judge-prompt

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

In Manus ausführen

Sterne3

Forks0

Aktualisiert24. Mai 2026 um 22:13

Quelle

itseffi

itseffi/ai-product-evals

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

SKILL.md

readonly

name	write-judge-prompt
description	Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

Write Judge Prompt

Overview

Use judge prompts only when deterministic checks are not enough.
Define the judgment criteria precisely before writing the prompt.
Ask the judge for structured outputs that can be parsed consistently.
Keep the scoring rubric narrow and behavior-based.
Validate the judge prompt against real examples before trusting it.

Prerequisites

Confirm that deterministic evaluation is inadequate for the target behavior. Inspect evaluators/index.mjs, especially llmJudge, and read traces of real failures before designing the judge prompt.

Core Instructions

Start With A Narrow Rubric

Define the exact behavior being judged, such as:

faithfulness to context
relevance to query
clarity
conciseness
professional tone

Avoid vague umbrella prompts like “judge overall quality.”

Require Structured Output

Judge outputs should be easy to parse, and should put the reasoning before the verdict so the judge reasons before it commits to a label:

REASON: [one sentence]
SCORE: [0-100]
PASS: [YES or NO]

PASS is the primary verdict; SCORE is advisory. The parser should not rely on free-form prose, and the order above matches the templates in judges/.

Use Behavior-Based Criteria

Describe what counts as success and failure in concrete terms. Good judge prompts refer to observable properties of the response, not abstract claims about “goodness.”

Test Against Real Examples

Before trusting the judge:

pass it obvious positives
pass it obvious negatives
pass it edge cases
check whether it is stable across similar examples

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/

Anti-Patterns

Using a judge when deterministic scoring is possible.
Asking for unstructured natural-language reasoning only.
Combining too many criteria in one rubric.
Trusting a judge prompt without calibration.
Treating judge output as ground truth without validation.

Mehr aus diesem Repository

gleiches Repository

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

validate-evaluator

itseffi/ai-product-evals

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

2026-04-293

name	write-judge-prompt
description	Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

Write Judge Prompt

Overview

Use judge prompts only when deterministic checks are not enough.
Define the judgment criteria precisely before writing the prompt.
Ask the judge for structured outputs that can be parsed consistently.
Keep the scoring rubric narrow and behavior-based.
Validate the judge prompt against real examples before trusting it.

Prerequisites

Confirm that deterministic evaluation is inadequate for the target behavior. Inspect evaluators/index.mjs, especially llmJudge, and read traces of real failures before designing the judge prompt.

Core Instructions

Start With A Narrow Rubric

Define the exact behavior being judged, such as:

faithfulness to context
relevance to query
clarity
conciseness
professional tone

Avoid vague umbrella prompts like “judge overall quality.”

Require Structured Output

Judge outputs should be easy to parse, and should put the reasoning before the verdict so the judge reasons before it commits to a label:

REASON: [one sentence]
SCORE: [0-100]
PASS: [YES or NO]

PASS is the primary verdict; SCORE is advisory. The parser should not rely on free-form prose, and the order above matches the templates in judges/.

Use Behavior-Based Criteria

Describe what counts as success and failure in concrete terms. Good judge prompts refer to observable properties of the response, not abstract claims about “goodness.”

Test Against Real Examples

Before trusting the judge:

pass it obvious positives
pass it obvious negatives
pass it edge cases
check whether it is stable across similar examples

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/

Anti-Patterns

Using a judge when deterministic scoring is possible.
Asking for unstructured natural-language reasoning only.
Combining too many criteria in one rubric.
Trusting a judge prompt without calibration.
Treating judge output as ground truth without validation.