name	evalyn-calibrate
description	Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed

evalyn-calibrate

Pre-flight

Verify evaluation runs exist:

evalyn list-runs --limit 1

If no runs: "You need evaluation results first. Invoke evalyn-eval."

Identify calibration targets:

evalyn analyze --run <latest-run-id>

Look for subjective metrics (type [llm]) with pass rates below 90%. Only subjective metrics can be calibrated - objective metrics are deterministic.

Step 1: Annotate Results

Run interactive annotation with per-metric mode for targeted feedback:

evalyn annotate --run-id <latest-run-id> --dataset <dataset-path> --per-metric

This is an interactive terminal session. The user will:

See each item's input, output, and LLM judge results
Agree or disagree with each metric's judgment
Commands: [y]es/pass, [n]o/fail, [s]kip, [v]iew full, [q]uit

Aim for 20-30+ annotations. Focus on disagreements. Annotations save immediately - quit and resume anytime.

Step 2: Run Calibration

Start with the basic optimizer (fast, single-shot analysis):

evalyn calibrate --metric-id <target-metric> --annotations <annotations-dir>

The --annotations flag points to the directory containing annotation files (created by evalyn annotate).

The output shows alignment metrics:

Accuracy: overall agreement rate
F1 Score: balanced precision/recall
Cohen's Kappa: agreement adjusted for chance

Optimizer comparison

Optimizer	Flag	Speed	Best For
`basic`	`--optimizer basic`	Fast (1 API call)	First pass, small annotation sets
`ape`	`--optimizer ape`	Medium	Exploring many prompt variants
`opro`	`--optimizer opro`	Medium	Iterative refinement
`gepa-native`	`--optimizer gepa-native`	Slow	Best quality, built-in token tracking
`gepa`	`--optimizer gepa`	Slow	Evolutionary (requires `pip install gepa`)
`evoprompt`	`--optimizer evoprompt`	Medium	Evolutionary mutation/crossover of prompts
`textgrad`	`--optimizer textgrad`	Medium	Critique-revise gradient descent on text
`miprov2`	`--optimizer miprov2`	Medium	Instruction + few-shot demo co-optimization
`promptbreeder`	`--optimizer promptbreeder`	Slow	Self-referential evolutionary prompt search

Optimizer-specific settings are available as CLI flags (e.g., --opro-iterations, --ape-candidates, --gepa-task-lm, --gepa-reflection-lm, --gepa-max-calls).

If basic doesn't improve alignment enough:

evalyn calibrate --metric-id <target-metric> --annotations <annotations-dir> --optimizer gepa

Step 3: Re-evaluate with Calibrated Judges

evalyn run-eval --dataset <dataset-path> --use-calibrated

The --use-calibrated flag loads optimized prompts from the calibrations/ directory.

Step 4: Compare Results

evalyn compare --run1 <original-run-id> --run2 <calibrated-run-id>

Check whether calibrated pass rates better match your expectations.

Step 5: Assess and Iterate

View calibration history:

evalyn list-calibrations

If alignment improved: calibration successful. The judges now match your expectations.

If alignment is still poor:

Try a different optimizer (gepa for maximum quality)
Add more annotations (more data = better calibration)
Review the metric's rubric in the metrics JSON file - it may need manual editing
Run evalyn cluster-misalignments --run-id <run-id> to see patterns in disagreements

No Automatic Next Step

This is the terminal skill. The user decides:

Iterate: add more annotations and re-calibrate
Fix agent: if judges are correct and agent has real issues
Expand testing: evalyn simulate --dataset <path> --modes similar,outlier then re-evaluate
Ship: export final report with evalyn export --run <id> --format html

name	evalyn-calibrate
description	Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed