con un clic
evalyn-analyze
// Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance
// Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance
| name | evalyn-analyze |
| description | Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance |
Analyze evaluation results progressively: summary, insights, failure clustering, and trend analysis. Interpret findings and recommend next actions based on pass rates.
Verify evaluation runs exist:
evalyn list-runs --limit 3
If no runs: "You need to run an evaluation first. Invoke evalyn-eval."
Identify the latest run ID from the output.
evalyn analyze --run <run-id>
This shows:
You can also use short IDs (first 8 characters of run ID).
evalyn insights --run <run-id>
Provides diagnostic and prescriptive analysis:
If multiple runs exist (check evalyn list-runs output):
evalyn compare --run1 <previous-run-id> --run2 <latest-run-id>
For longer history across all runs in a dataset:
evalyn trend --project <project-name>
Note: use --run1 and --run2 flags for compare, not positional arguments.
If any metric has pass rate below 90%:
evalyn cluster-failures --run-id <run-id>
This clusters failed items by failure reason, revealing patterns (e.g., "all failures involve long inputs" or "failures cluster around a specific topic").
Based on the results, recommend next action:
| Overall Pass Rate | Interpretation | Recommendation |
|---|---|---|
| Above 95% | Agent performing well | Consider evalyn simulate --dataset <path> --modes similar,outlier for edge case testing. Export report: evalyn export --run <id> --format html |
| 80-95% | Moderate issues | Review failing items. Could be agent issues OR judge misalignment. Consider invoking evalyn-calibrate to verify judges are accurate. |
| Below 80% | Significant issues | Invoke evalyn-calibrate to annotate items and check if judges agree with human expectations. Fix agent if judges are correct. |
Key distinction to communicate: low pass rates can mean either (a) the agent is actually performing poorly, or (b) the LLM judges are too strict or misaligned. Calibration determines which.
Offer a shareable report:
evalyn export --run <run-id> --format html
Other formats: json, csv, markdown.
If calibration recommended: "Invoke evalyn-calibrate to annotate results and align LLM judges with your expectations."
If agent is performing well: "Your agent looks good. To strengthen confidence, run it on more varied inputs and re-evaluate, or use evalyn simulate --dataset <path> --modes similar,outlier to generate edge cases."
Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed
Use when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn
Use to evaluate an LLM agent with evalyn. Orchestrates the full pipeline: install, instrument, trace, build dataset, suggest metrics, run eval, analyze, calibrate.
Use when setting up evalyn evaluation for an LLM agent project, instrumenting agent code, or adding the evalyn decorator