com um clique
evalyn-eval
// Use when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn
// Use when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn
Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed
Use to evaluate an LLM agent with evalyn. Orchestrates the full pipeline: install, instrument, trace, build dataset, suggest metrics, run eval, analyze, calibrate.
Use when setting up evalyn evaluation for an LLM agent project, instrumenting agent code, or adding the evalyn decorator
Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance
| name | evalyn-eval |
| description | Use when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn |
Build a dataset from traces, auto-recommend metrics based on trace analysis, and run evaluation. This skill reads actual trace data to make metric recommendations rather than asking abstract questions.
evalyn list-calls --limit 5
If no traces: "You need to instrument your agent first. Invoke evalyn-setup."
ls data/*/dataset.jsonl 2>/dev/null
If dataset exists, skip to Step 2.
Identify the project name from the evalyn list-calls output (project column).
evalyn build-dataset --project <project-name>
Capture the output path - it prints "Wrote N items to ". Use this path for all subsequent commands.
Inspect a trace to understand the agent's behavior:
evalyn show-trace --last -v
Analyze the trace structure and recommend a bundle. Evalyn has 17 curated metric bundles:
| Trace Pattern | Recommended Bundle |
|---|---|
| Multiple tool calls, planning steps | orchestrator |
| Tool calls + multi-turn context | multi-step-agent |
| URLs or citations in output | research-agent |
| RAG retrieval spans, source docs | rag-qa |
| Conversational, multi-turn | chatbot |
| Code blocks in output | code-assistant |
| Short summary outputs | summarization |
| Educational/tutorial content | tutor |
| Content generation, blog posts | content-writer |
| Customer-facing Q&A | customer-support |
To see all available bundles:
evalyn suggest-metrics --mode bundle --help
Apply the recommended bundle:
evalyn suggest-metrics --dataset <path> --mode bundle --bundle <recommended>
Then expand coverage with LLM-based selection from the full 130+ metric registry:
evalyn suggest-metrics --dataset <path> --mode llm-registry --append
This two-pass approach gives a solid base (curated bundle) plus tailored additions (LLM picks from full registry).
| Mode | What it does | Speed | API key needed |
|---|---|---|---|
basic | Heuristic-based suggestion | Instant | No |
bundle | Preset metric bundles (17 available) | Instant | No |
llm-registry | LLM picks from 130+ built-in metrics | ~10s | Yes |
llm-brainstorm | LLM generates custom metrics | ~10s | Yes |
Do NOT use modes like agent, rag, or classify - those do not exist.
evalyn run-eval --dataset <path>
This runs all metrics, generates results.json in eval_runs/, and prints a summary table. Note the run ID from the output.
Useful flags:
--workers 8: increase parallel workers (default 4, max 16)--provider openai: use OpenAI instead of Gemini for LLM judges--provider ollama: use local Ollama models"Evaluation complete. Invoke evalyn-analyze to dig into the results, identify failures, and get recommendations."