بنقرة واحدة
evalyn-calibrate
// Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed
// Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed
| name | evalyn-calibrate |
| description | Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed |
evalyn list-runs --limit 1
If no runs: "You need evaluation results first. Invoke evalyn-eval."
evalyn analyze --run <latest-run-id>
Look for subjective metrics (type [llm]) with pass rates below 90%. Only subjective metrics can be calibrated - objective metrics are deterministic.
Run interactive annotation with per-metric mode for targeted feedback:
evalyn annotate --run-id <latest-run-id> --dataset <dataset-path> --per-metric
This is an interactive terminal session. The user will:
[y]es/pass, [n]o/fail, [s]kip, [v]iew full, [q]uitAim for 20-30+ annotations. Focus on disagreements. Annotations save immediately - quit and resume anytime.
Start with the basic optimizer (fast, single-shot analysis):
evalyn calibrate --metric-id <target-metric> --annotations <annotations-dir>
The --annotations flag points to the directory containing annotation files (created by evalyn annotate).
The output shows alignment metrics:
| Optimizer | Flag | Speed | Best For |
|---|---|---|---|
basic | --optimizer basic | Fast (1 API call) | First pass, small annotation sets |
ape | --optimizer ape | Medium | Exploring many prompt variants |
opro | --optimizer opro | Medium | Iterative refinement |
gepa-native | --optimizer gepa-native | Slow | Best quality, built-in token tracking |
gepa | --optimizer gepa | Slow | Evolutionary (requires pip install gepa) |
evoprompt | --optimizer evoprompt | Medium | Evolutionary mutation/crossover of prompts |
textgrad | --optimizer textgrad | Medium | Critique-revise gradient descent on text |
miprov2 | --optimizer miprov2 | Medium | Instruction + few-shot demo co-optimization |
promptbreeder | --optimizer promptbreeder | Slow | Self-referential evolutionary prompt search |
Optimizer-specific settings are available as CLI flags (e.g., --opro-iterations, --ape-candidates, --gepa-task-lm, --gepa-reflection-lm, --gepa-max-calls).
If basic doesn't improve alignment enough:
evalyn calibrate --metric-id <target-metric> --annotations <annotations-dir> --optimizer gepa
evalyn run-eval --dataset <dataset-path> --use-calibrated
The --use-calibrated flag loads optimized prompts from the calibrations/ directory.
evalyn compare --run1 <original-run-id> --run2 <calibrated-run-id>
Check whether calibrated pass rates better match your expectations.
View calibration history:
evalyn list-calibrations
If alignment improved: calibration successful. The judges now match your expectations.
If alignment is still poor:
gepa for maximum quality)evalyn cluster-misalignments --run-id <run-id> to see patterns in disagreementsThis is the terminal skill. The user decides:
evalyn simulate --dataset <path> --modes similar,outlier then re-evaluateevalyn export --run <id> --format htmlUse when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn
Use to evaluate an LLM agent with evalyn. Orchestrates the full pipeline: install, instrument, trace, build dataset, suggest metrics, run eval, analyze, calibrate.
Use when setting up evalyn evaluation for an LLM agent project, instrumenting agent code, or adding the evalyn decorator
Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance