تشغيل أي مهارة في Manus بنقرة واحدة

error-analysis

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

تشغيل في Manus

النجوم٣

التفرعات٠

آخر تحديث٢٩ أبريل ٢٠٢٦ في ١٤:٢٩

المصدر

itseffi

itseffi/ai-product-evals

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات15-1253L4

SKILL.md

readonly

name	error-analysis
description	Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

Error Analysis

Overview

Read traces before changing metrics, prompts, or model choices.
Group failures into stable categories rather than looking at them one by one.
Separate model failures, evaluator failures, provider failures, and dataset problems.
Quantify the dominant failure modes before proposing fixes.
Fix the highest-volume or highest-severity class first.

Prerequisites

Run the relevant eval suite and collect traces. Inspect traces/, run-eval.mjs, tracer.mjs, and evaluators/index.mjs before proposing changes. If traces do not exist yet, run the smallest relevant suite first.

Core Instructions

Check Human-Labeled Pivots

Run:

npm run skill:error-analysis

If labels exist under labels/, the script includes pivots by failure_mode, feature, scenario, persona, and suite. Use those pivots to prioritize fixes before changing prompts or eval metrics.

Start With Real Outputs

For the failing suite, inspect:

the prompt
the system prompt
the raw model response
the parsed evaluator result
the reported reason for failure

Do not infer a failure category without reading the actual trace.

Classify Failures Into Buckets

Use categories such as:

provider/auth or transport error
invalid model or request format
no response or truncated response
instruction-following failure
factual error
reasoning error
formatting-only failure
weak evaluator false positive
evaluator false negative
bad expected answer or bad dataset label

If a failure does not fit an existing bucket, add a new one rather than forcing it into the wrong class.

Quantify Failure Modes

Count how many failures land in each category. A good fix targets the dominant class instead of optimizing for one memorable example.

Separate Root Cause From Surface Symptom

Examples:

A regex mismatch may actually be an instruction-following failure.
A failed contains check may actually be a bad assertion.
A hallucination may actually be missing retrieval context.
A bad answer may actually be a provider-side parser issue.

Repo Files To Inspect

traces/
tracer.mjs
run-eval.mjs
evaluators/index.mjs
evals/
app.html
labels/
docs/schemas/labels.md

Anti-Patterns

Changing the suite before reading traces.
Fixing one anecdotal example instead of the dominant failure class.
Treating evaluator mistakes as model mistakes.
Treating infrastructure errors as quality regressions.
Collapsing multiple root causes into one vague bucket like “bad output.”

المزيد من هذا المستودع

نفس المستودع

write-judge-prompt

itseffi/ai-product-evals

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

2026-05-243

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

validate-evaluator

itseffi/ai-product-evals

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

2026-04-293

name	error-analysis
description	Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

Error Analysis

Overview

Read traces before changing metrics, prompts, or model choices.
Group failures into stable categories rather than looking at them one by one.
Separate model failures, evaluator failures, provider failures, and dataset problems.
Quantify the dominant failure modes before proposing fixes.
Fix the highest-volume or highest-severity class first.

Prerequisites

Core Instructions

Check Human-Labeled Pivots

Run:

npm run skill:error-analysis

Start With Real Outputs

For the failing suite, inspect:

the prompt
the system prompt
the raw model response
the parsed evaluator result
the reported reason for failure

Do not infer a failure category without reading the actual trace.

Classify Failures Into Buckets

Use categories such as:

provider/auth or transport error
invalid model or request format
no response or truncated response
instruction-following failure
factual error
reasoning error
formatting-only failure
weak evaluator false positive
evaluator false negative
bad expected answer or bad dataset label

If a failure does not fit an existing bucket, add a new one rather than forcing it into the wrong class.

Quantify Failure Modes

Count how many failures land in each category. A good fix targets the dominant class instead of optimizing for one memorable example.

Separate Root Cause From Surface Symptom

Examples:

A regex mismatch may actually be an instruction-following failure.
A failed contains check may actually be a bad assertion.
A hallucination may actually be missing retrieval context.
A bad answer may actually be a provider-side parser issue.

Repo Files To Inspect

traces/
tracer.mjs
run-eval.mjs
evaluators/index.mjs
evals/
app.html
labels/
docs/schemas/labels.md

Anti-Patterns

Changing the suite before reading traces.
Fixing one anecdotal example instead of the dominant failure class.
Treating evaluator mistakes as model mistakes.
Treating infrastructure errors as quality regressions.
Collapsing multiple root causes into one vague bucket like “bad output.”