تشغيل أي مهارة في Manus بنقرة واحدة

benchmark-models

Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark.

تشغيل في Manus

النجوم٣

التفرعات٠

آخر تحديث١٦ أبريل ٢٠٢٦ في ١٠:٠٠

المصدر

itseffi

itseffi/ai-product-evals

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

علماء البياناتمهن الحاسوب والرياضيات15-2051L4

SKILL.md

readonly

name	benchmark-models
description	Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark.

Benchmark Models

Overview

Hold the eval suite constant and vary only provider/model.
Use one shared suite first: evals/llm-comparison.json.
Compare pass rate and failure categories, not just aggregate score.
Read traces to understand whether differences are capability, formatting, or evaluator issues.
Keep quality findings separate from infrastructure and pricing metadata.

Prerequisites

Verify that the compared models are available and that the suite is provider-agnostic. Inspect evals/llm-comparison.json, run-eval.mjs, providers/, and .github/workflows/eval.yml before running comparisons.

Core Instructions

Keep The Benchmark Constant

Use the same:

prompts
assertions
evaluator mode
output format

Only provider/model should change.

Example runs:

node run-eval.mjs --provider openai --model gpt-5.4-mini evals/llm-comparison.json
node run-eval.mjs --provider anthropic --model claude-haiku-4-5 evals/llm-comparison.json
node run-eval.mjs --provider google --model gemini-2.5-flash evals/llm-comparison.json

Compare By Failure Mode

Group failures by capability:

instruction following
factual recall
reasoning
code generation
formatting compliance

A benchmark is more useful when it explains where a model fails, not just whether it lost.

Check Evaluator Sensitivity

Determine whether the suite is strong enough to distinguish models. If weak contains checks let all models pass, the benchmark is not sensitive. If formatting rules dominate all failures, the benchmark may be too brittle.

Read Traces Before Ranking

Inspect traces/ to answer:

Did the lower-ranked model actually misunderstand the task?
Did it fail only on formatting?
Did the evaluator mis-score a good output?
Did the provider produce parsing differences that look like capability differences?

Repo Files To Inspect

evals/llm-comparison.json
run-eval.mjs
evaluators/index.mjs
providers/
.github/workflows/eval.yml
traces/

Anti-Patterns

Comparing different prompt sets across models.
Mixing benchmark edits and model changes in the same run.
Ranking by one average number without reading traces.
Treating provider transport failures as model quality failures.
Using cost as a primary benchmark outcome when the goal is quality.

المزيد من هذا المستودع

نفس المستودع

write-judge-prompt

itseffi/ai-product-evals

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

2026-05-243

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

name	benchmark-models
description	Guides model and provider comparison on a shared eval suite. Use when comparing providers, selecting a default model, investigating model-specific regressions, or turning one suite into a reusable benchmark.

Benchmark Models

Overview

Hold the eval suite constant and vary only provider/model.
Use one shared suite first: evals/llm-comparison.json.
Compare pass rate and failure categories, not just aggregate score.
Read traces to understand whether differences are capability, formatting, or evaluator issues.
Keep quality findings separate from infrastructure and pricing metadata.

Prerequisites

Core Instructions

Keep The Benchmark Constant

Use the same:

prompts
assertions
evaluator mode
output format

Only provider/model should change.

Example runs:

node run-eval.mjs --provider openai --model gpt-5.4-mini evals/llm-comparison.json
node run-eval.mjs --provider anthropic --model claude-haiku-4-5 evals/llm-comparison.json
node run-eval.mjs --provider google --model gemini-2.5-flash evals/llm-comparison.json

Compare By Failure Mode

Group failures by capability:

instruction following
factual recall
reasoning
code generation
formatting compliance

A benchmark is more useful when it explains where a model fails, not just whether it lost.

Check Evaluator Sensitivity

Read Traces Before Ranking

Inspect traces/ to answer:

Did the lower-ranked model actually misunderstand the task?
Did it fail only on formatting?
Did the evaluator mis-score a good output?
Did the provider produce parsing differences that look like capability differences?

Repo Files To Inspect

evals/llm-comparison.json
run-eval.mjs
evaluators/index.mjs
providers/
.github/workflows/eval.yml
traces/

Anti-Patterns

Comparing different prompt sets across models.
Mixing benchmark edits and model changes in the same run.
Ranking by one average number without reading traces.
Treating provider transport failures as model quality failures.
Using cost as a primary benchmark outcome when the goal is quality.