بنقرة واحدة
benchmark-models
Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark.
القائمة
Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark.
Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.
Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.
Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.
Drafts a reviewable judge-template patch from evaluator validation disagreements.
| name | benchmark-models |
| description | Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark. |
evals/llm-comparison.json.Verify that the compared models are available and that the suite is provider-agnostic. Inspect evals/llm-comparison.json, run-eval.mjs, providers/, and .github/workflows/eval.yml before running comparisons.
Use the same:
Only provider/model should change.
Example runs:
node run-eval.mjs --provider openai --model gpt-5.4-mini evals/llm-comparison.json
node run-eval.mjs --provider anthropic --model claude-haiku-4-5 evals/llm-comparison.json
node run-eval.mjs --provider google --model gemini-2.5-flash evals/llm-comparison.json
Group failures by capability:
A benchmark is more useful when it explains where a model fails, not just whether it lost.
Determine whether the suite is strong enough to distinguish models. If weak contains checks let all models pass, the benchmark is not sensitive. If formatting rules dominate all failures, the benchmark may be too brittle.
Inspect traces/ to answer:
evals/llm-comparison.jsonrun-eval.mjsevaluators/index.mjsproviders/.github/workflows/eval.ymltraces/