Use when LLM judges need calibration, evaluation metrics seem misaligned with expectations, or annotation and judge tuning is needed
Use when building evaluation datasets, selecting metrics, or running evaluations on an LLM agent project with evalyn
Use to evaluate an LLM agent with evalyn. Orchestrates the full pipeline: install, instrument, trace, build dataset, suggest metrics, run eval, analyze, calibrate.
Use when setting up evalyn evaluation for an LLM agent project, instrumenting agent code, or adding the evalyn decorator
Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance