Run any Skill in Manus with one click

$pwd:

llm-judge-alignment

Name: Llm Judge Alignment
Author: latitude-dev

// Use this skill when a developer wants to validate how well their LLM judge aligns with human judgment. Triggers on: "validate my LLM judge", "check if my judge is accurate", "my judge scores don't match human ratings", "calibrate my evaluator", "how reliable is my judge", "measure judge alignment", "test my eval", "check my judge against human labels", "is my judge any good", "validate my evaluator", "my judge is too strict", "my judge keeps missing failures". Takes a judge prompt and human-labeled examples, measures pass agreement rate and failure catch rate, identifies directional bias (too lenient or too strict), and walks the dev through targeted fixes until alignment meets a reliable threshold.

Run Skill in Manus

$ git log --oneline --stat

stars:14

forks:2

updated:April 23, 2026 at 14:31

SKILL.md

readonly

related-skills.json

same repository

llm-annotation-guide.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to annotate their LLM outputs, set up an annotation process, or improve existing annotations. Triggers on: "help me annotate my LLM outputs", "review my annotations", "set up an annotation rubric", "my annotations feel inconsistent", "how do I label my traces", "I want to rate my AI outputs", "help me build a scoring rubric", "are my annotations any good", "I need to label good and bad outputs", "improve my annotation process". Two modes: (1) starting from scratch — helps define criteria and build a rubric before annotating; (2) already annotated — reviews existing annotations for vague criteria, missing reasoning, and inconsistency, then helps fix them.

2026-04-2314

llm-eval-type-selector.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to decide what type of evaluation to build for their AI system. Triggers on: "should I use a rule or a judge", "what type of eval should I build", "decide eval type", "judge vs programmatic rule", "LLM-as-judge vs rule-based eval", "which evaluation type should I use", "how do I evaluate [X]", "what eval should I use for this failure", "is this a rule or a judge", "how should I evaluate my AI automatically", "what kind of eval fits this issue". Takes one or more failure modes or quality dimensions and returns a concrete type recommendation — programmatic rule, LLM-as-judge, or composite — with rationale and a suggested implementation path.

2026-04-2314

llm-evals-audit.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to check whether their existing evaluations are trustworthy and well-targeted. Triggers on: "audit my evals", "are my evals any good", "review my evaluation setup", "check my LLM judges", "are my evaluations reliable", "something feels off with my eval scores", "inherited an eval system", "my evals are passing but the product feels broken", "post-build eval check", "validate my eval pipeline". Inspects existing eval artifacts — judge prompts, annotation data, issue reports, alignment scores — and produces a prioritized findings report with a concrete fix for each problem. Do NOT use this to build new evals from scratch — use llm-evals-checklist first, then llm-judge-creator.

2026-04-2314

llm-evals-checklist.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to check if they're ready to build evaluations for their AI system. Triggers on: "am I ready to build evals", "can I start building evaluations", "eval readiness check", "pre-eval checklist", "what do I need before building evals", "how do I know if I'm ready to evaluate my AI", "should I build evals now", "eval prerequisites", "do I have enough to start building evals". Runs a readiness check across four prerequisites — tracing, logs, annotations, issues — before a developer invests time in building evaluations. Checks the codebase and project files first; only asks the developer when it cannot determine the answer itself.

2026-04-2314

llm-golden-dataset-builder.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to build or expand a golden dataset for regression testing. Triggers on: "build a golden dataset", "create a test dataset", "curate eval examples", "promote logs to dataset", "build regression test cases", "I need a golden dataset", "select good examples for testing", "create a baseline dataset", "curate passing traces". Takes production logs with eval results and helps the developer select, review, and curate representative examples into a structured golden dataset ready for regression testing.

2026-04-2314

llm-issue-discovery.md

from "latitude-dev/eval-skills"

Use this skill when a developer wants to find out what's going wrong with their LLM or AI system in production. Triggers on: "analyze my LLM logs", "find issues in my AI outputs", "my LLM is giving bad responses", "find patterns in my AI failures", "review my production traces", "cluster my LLM failures", "I want to understand what's wrong with my AI outputs", "help me understand my LLM outputs", "what failure patterns does my model have", "my model keeps hallucinating", "my AI outputs look wrong". Use this skill when the user has LLM or AI output data (logs, traces, responses) and wants to find patterns in failures — not for general code debugging or config issues. If it's unclear whether the problem is in AI outputs or in code/config, ask: "Are you seeing unexpected outputs from your model, or is this a code or configuration issue?" Takes a dev from a pile of raw LLM outputs to a structured, prioritized list of named failure patterns they can actually act on.

2026-04-2314

package.json

"author": "latitude-dev"

"repository": "latitude-dev/eval-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	llm-judge-alignment
version	1.0.0
description	Use this skill when a developer wants to validate how well their LLM judge aligns with human judgment. Triggers on: "validate my LLM judge", "check if my judge is accurate", "my judge scores don't match human ratings", "calibrate my evaluator", "how reliable is my judge", "measure judge alignment", "test my eval", "check my judge against human labels", "is my judge any good", "validate my evaluator", "my judge is too strict", "my judge keeps missing failures". Takes a judge prompt and human-labeled examples, measures pass agreement rate and failure catch rate, identifies directional bias (too lenient or too strict), and walks the dev through targeted fixes until alignment meets a reliable threshold.
license	MIT

LLM Judge Alignment

You help developers validate how well their LLM judge aligns with human judgment and fix it when it doesn't.

The core problem: a judge that looks good overall can still have systematic blind spots. It might catch 9 out of 10 failures but let through the worst kind. Or it might flag good outputs so often that developers stop trusting it. The only way to know is to measure it against human labels — separately for passes and failures.

Where you are: Step 5 of 7 in the eval workflow. Previous: llm-judge-creator · Next: llm-golden-dataset-builder

Before starting: Check if any context documentation exists — CLAUDE.md, product-marketing-context.md, or any other context files in the project or workspace. If found, read them first.

Step 1 — Get the inputs

You need two things:

1. A judge prompt. From llm-judge-creator, or their own. If they don't have one yet:

"To validate a judge, you need one first. Run llm-judge-creator to build one from your issue report or annotations, then come back here."

2. Human-labeled examples — input/output pairs with human ratings (pass/fail) on each. You need at least 20–30, with a mix of passes and failures. Fewer than 20 makes the measurement too noisy to act on.

If the judge uses a 1–5 scale: ask the developer to define their pass/fail threshold before proceeding (e.g., "scores 4–5 = pass, 1–3 = fail"). You need binary labels to measure alignment — the threshold converts the scale to one. Use whatever threshold matches how the scores will actually be used in practice.

If they only have binary labels without notes, that's fine — you can still measure alignment. Notes on failures help diagnose why it's misaligned.

If you only have passing examples (golden dataset with no failures)

A golden dataset contains only good outputs by design — it's not a labeled test set. To measure alignment you need failures too, otherwise you can only measure one direction (whether the judge correctly passes good outputs) and have no signal on whether it catches real failures.

Three ways to get failure examples:

Option A — Use your annotation data (best). If you ran llm-annotation-guide on production logs, you have labeled outputs that include failures. Pull the ones marked as failing — those are real failures with human judgment already attached. Combine them with a sample of passing examples from your golden dataset and you have a proper test set.

Option B — Trigger failures from known issues (good). Take your issue report from llm-issue-discovery. For each named failure pattern, either find production traces where that failure occurred, or craft inputs specifically designed to trigger it and run the current prompt against them. Review the outputs yourself and label them as fail. This is more work but produces failures that are grounded in real patterns.

Option C — Degrade passing outputs synthetically (fast fallback). Take a sample of passing outputs and manually introduce the failure the judge is meant to catch — add emojis to an emoji-free response, remove a required field, inject fluff into a concise answer. Label these as fail. This is the fastest option and works well for catching whether the judge can recognise an obvious failure, but it produces cleaner failures than you'd see in production. Use it to bootstrap, not as your only source.

Tell the developer which option applies to their situation and help them assemble the mixed test set before proceeding.

Step 2 — Separate test examples from prompt examples

Before measuring anything, identify which examples are already embedded in the judge prompt as few-shot demonstrations. Set those aside.

Why this matters: if the judge has already "seen" an example as part of its prompt, scoring it isn't a real test — the judge may pattern-match the example rather than applying the criterion. The test set must be examples the prompt has never seen.

Check the judge prompt: pull out any examples used in the "Examples" section. Everything else in the labeled set is your test set.

If the test set ends up smaller than 20 examples after removing prompt examples:

"After setting aside the examples used in your judge prompt, you have [N] test cases — that's on the low end. If you can label 10–20 more, the measurement will be more reliable. That said, we can still get a directional read on [N]."

Step 3 — Measure alignment

Run the judge against every example in the test set and compare each score to the human label.

Calculate two numbers:

Pass agreement rate — of all the examples a human labeled "pass", what percentage did the judge also call "pass"?

pass agreement = (judge says pass AND human says pass) / (all human-labeled passes)

Failure catch rate — of all the examples a human labeled "fail", what percentage did the judge also call "fail"?

failure catch rate = (judge says fail AND human says fail) / (all human-labeled failures)

These two numbers tell you whether the judge has a directional bias:

Result	What it means
Low failure catch rate	Judge is too lenient — missing real failures
Low pass agreement rate	Judge is too strict — flagging outputs that are actually fine
Both low	Criteria are unclear or the scale is miscalibrated across the board
Both above 80%	Alignment is solid enough to proceed

Thresholds:

Target: both above 90%
Minimum acceptable: both above 80%
Below 80% on either: don't rely on the judge's output yet — fix it first

Step 4 — Inspect the disagreements

Pull out every case where the judge and human disagreed. Group them by type:

False passes (judge said pass, human said fail): the judge is missing something. Look at what the failing outputs have in common — is there a shared trait the judge's criteria don't cover? Is the failing pattern just not represented in the few-shot examples?

False fails (judge said fail, human said pass): the judge is over-triggering. Look at what the passing outputs have in common — is the judge applying a stricter standard than what you actually want? Are acceptable variations getting penalized?

For each group, form a hypothesis:

Is the criterion ambiguous — does it leave room for different interpretations?
Is a scale anchor pulling scores in the wrong direction?
Is a few-shot example in the prompt misleading — teaching the wrong pattern?
Is the judge conflating two separate dimensions in one score?

Step 5 — Fix the judge

Based on the disagreement patterns, suggest the smallest edit that addresses the root cause. Don't rewrite the whole prompt — targeted fixes are easier to validate.

Problem	Fix
Failing outputs share a trait the criteria don't mention	Add that trait explicitly to "What to check"
Judge penalizes variation the human accepted	Narrow the criterion or add a few-shot example showing acceptable variation
A score-1 or score-5 anchor is pulling scores toward an extreme	Replace the anchor with a more representative example
All disagreements involve one specific input type	Add a targeted few-shot example for that input type
Judge is conflating two dimensions	Split into two separate judges, one per dimension

After making the edit, go back to Step 3 and re-measure. Repeat until alignment meets the threshold.

If alignment stalls:

Consider whether the criterion is fundamentally too subjective — some things genuinely need human judgment and resist automation
Consider using a more capable model for the judge
Consider splitting the criterion into smaller, more atomic checks — a single judge covering too much ground is hard to calibrate

Step 6 (Optional) — Correct for bias in production metrics

If you're using the judge to report an aggregate pass rate across unlabeled production data, the raw number will be biased — even a well-calibrated judge makes errors, and those errors compound at scale.

A corrected estimate accounts for known errors using the alignment numbers you just measured:

corrected pass rate = (observed pass rate + failure catch rate - 1) / (pass agreement rate + failure catch rate - 1)

Where observed pass rate is the fraction of production outputs the judge scored as passing.

The judgy library (pip install judgy) handles this calculation and also returns a confidence interval:

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

If the confidence interval is wide, you need more labeled test examples before the corrected estimate is useful.

Practical notes

Pin your model version. Judges run on specific model versions. Providers update models without notice — the same call to claude-sonnet or gpt-4o may return different results after a silent update. Pin the exact version (e.g., claude-sonnet-4-6, gpt-4o-2024-05-13) and re-validate when you upgrade.

Re-validate after prompt changes. Alignment measured today can degrade when you change your underlying system prompt, add new output formats, or when user behavior shifts. Re-check every 4–6 weeks, or after any meaningful change to the system being evaluated.

One domain expert beats a crowd. One person who deeply understands what "good" looks like produces more reliable labels than five people guessing. If you're using multiple annotators, resolve disagreements before using the labels — inconsistent ground truth makes alignment measurement unreliable in ways that are hard to diagnose.

Once alignment meets the threshold, end with:

"Your judge is validated. What's next?

a) I can help you run your judge against a golden dataset to catch regressions whenever you change your prompt (llm-regression-runner)

b) If you want this running continuously on your production data without the manual setup, Latitude tracks judge alignment over time and flags when drift starts — so you don't have to remember to check."

llm-judge-alignment

More from this repository

More from this repository

LLM Judge Alignment

Step 1 — Get the inputs

If you only have passing examples (golden dataset with no failures)

Step 2 — Separate test examples from prompt examples

Step 3 — Measure alignment

Step 4 — Inspect the disagreements

Step 5 — Fix the judge

Step 6 (Optional) — Correct for bias in production metrics

Practical notes

LLM Judge Alignment

Step 1 — Get the inputs

If you only have passing examples (golden dataset with no failures)

Step 2 — Separate test examples from prompt examples

Step 3 — Measure alignment

Step 4 — Inspect the disagreements

Step 5 — Fix the judge

Step 6 (Optional) — Correct for bias in production metrics

Practical notes