تشغيل أي مهارة في Manus بنقرة واحدة

validate-evaluator

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

تشغيل في Manus

النجوم٣

التفرعات٠

آخر تحديث٢٩ أبريل ٢٠٢٦ في ١٤:٢٩

المصدر

itseffi

itseffi/ai-product-evals

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات15-1253L4

SKILL.md

readonly

name	validate-evaluator
description	Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

Validate Evaluator

Overview

Treat the evaluator as a model that can fail.
Compare evaluator decisions against trusted labels or strong reference examples.
Measure false positives and false negatives separately.
Check whether the evaluator is biased toward verbosity, formatting, or certain providers.
Calibrate the evaluator before expanding its use.

Prerequisites

Collect labeled examples or high-confidence gold cases first. Inspect evaluators/index.mjs, especially judge-based paths, and read traces where the evaluator’s decision seems suspicious.

Core Instructions

Use The Repo Harness First

Run:

npm run skill:validate-evaluator

For a specific label file:

node scripts/validate-evaluator.mjs labels/sample-goldens.json

Use repeated runs when validating a judge:

node scripts/validate-evaluator.mjs labels/sample-goldens.json --repeat 5

Treat agreement, false positives, false negatives, disagreement samples, and stability as the primary output. A judge that flips verdicts on repeated calls is not calibrated, even if one run reports high agreement.

The validator exits nonzero when provider calls fail, parse failures occur, agreement or stability are below threshold, or drift exceeds the supplied baseline. Override defaults with --min-agreement, --min-stability, --drift-baseline, --max-agreement-drop, and --max-stability-drop.

The validator also reports Cohen's kappa. Use kappa alongside raw agreement because raw agreement can look strong on imbalanced labels.

Use a judge panel when one model is too noisy:

node scripts/validate-evaluator.mjs labels/sample-goldens.json --judge-panel openai:gpt-5.5,anthropic:claude-haiku-4-5

Run the public synthetic bias checks before trusting pairwise judges:

npm run skill:judge-bias-check

Build A Validation Set

Use:

human-labeled examples
obvious positive examples
obvious negative examples
edge cases that are hard but still interpretable

The validation set should include both passes and fails.

Measure Error Types Separately

Do not rely on one aggregate accuracy value alone. Check:

false positives
false negatives
consistency on repeated or near-duplicate cases
disagreement patterns by capability area

Look For Systematic Bias

Check whether the evaluator:

rewards verbosity
changes its answer when response order is swapped
over-penalizes formatting differences
prefers one provider’s style
mistakes plausible hallucinations for grounded answers

Calibrate Before Production Use

If the evaluator is unreliable on labeled examples, fix the rubric or the parsing before using it to score more data.

Use suite-specific judge templates from judges/ when possible. Keep generic criteria as a fallback only. For RAG calibration sets, use rag-quality unless you are intentionally testing a narrower relationship such as rag-faithfulness.

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/
app.html
labels/
judges/
scripts/validate-evaluator.mjs

Anti-Patterns

Assuming the evaluator is correct because it is an LLM.
Reporting only one overall accuracy number.
Calibrating on cherry-picked easy cases.
Ignoring false positives or false negatives.
Expanding judge usage before validating it on labeled data.

المزيد من هذا المستودع

نفس المستودع

write-judge-prompt

itseffi/ai-product-evals

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

2026-05-243

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

name	validate-evaluator
description	Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

Validate Evaluator

Overview

Treat the evaluator as a model that can fail.
Compare evaluator decisions against trusted labels or strong reference examples.
Measure false positives and false negatives separately.
Check whether the evaluator is biased toward verbosity, formatting, or certain providers.
Calibrate the evaluator before expanding its use.

Prerequisites

Collect labeled examples or high-confidence gold cases first. Inspect evaluators/index.mjs, especially judge-based paths, and read traces where the evaluator’s decision seems suspicious.

Core Instructions

Use The Repo Harness First

Run:

npm run skill:validate-evaluator

For a specific label file:

node scripts/validate-evaluator.mjs labels/sample-goldens.json

Use repeated runs when validating a judge:

node scripts/validate-evaluator.mjs labels/sample-goldens.json --repeat 5

The validator also reports Cohen's kappa. Use kappa alongside raw agreement because raw agreement can look strong on imbalanced labels.

Use a judge panel when one model is too noisy:

node scripts/validate-evaluator.mjs labels/sample-goldens.json --judge-panel openai:gpt-5.5,anthropic:claude-haiku-4-5

Run the public synthetic bias checks before trusting pairwise judges:

npm run skill:judge-bias-check

Build A Validation Set

Use:

human-labeled examples
obvious positive examples
obvious negative examples
edge cases that are hard but still interpretable

The validation set should include both passes and fails.

Measure Error Types Separately

Do not rely on one aggregate accuracy value alone. Check:

false positives
false negatives
consistency on repeated or near-duplicate cases
disagreement patterns by capability area

Look For Systematic Bias

Check whether the evaluator:

rewards verbosity
changes its answer when response order is swapped
over-penalizes formatting differences
prefers one provider’s style
mistakes plausible hallucinations for grounded answers

Calibrate Before Production Use

If the evaluator is unreliable on labeled examples, fix the rubric or the parsing before using it to score more data.

Repo Files To Inspect

evaluators/index.mjs
run-eval.mjs
evals/
traces/
app.html
labels/
judges/
scripts/validate-evaluator.mjs

Anti-Patterns

Assuming the evaluator is correct because it is an LLM.
Reporting only one overall accuracy number.
Calibrating on cherry-picked easy cases.
Ignoring false positives or false negatives.
Expanding judge usage before validating it on labeled data.