원클릭으로
propose-judge-patch
Drafts a reviewable judge-template patch from evaluator validation disagreements.
메뉴
Drafts a reviewable judge-template patch from evaluator validation disagreements.
Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.
Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.
Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.
Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.
| name | propose-judge-patch |
| description | Drafts a reviewable judge-template patch from evaluator validation disagreements. |
Use this after skills/validate-evaluator.md has produced a JSON validation report with disagreements.
node scripts/validate-evaluator.mjs labels/sample-goldens.json --json > reports/evaluator-validation.json
node scripts/propose-judge-patch.mjs reports/evaluator-validation.json --judge-template rag-quality --output reports/proposed-judge.patch
The script is deterministic. It does not call a model and does not edit judge templates directly. It reads disagreement reasons and human critiques, infers likely rubric gaps, and writes a patch file for human or agent review.
scripts/validate-evaluator.mjs after applying any judge-template change.