Run any Skill in Manus with one click

$pwd:

eval-audit

Name: Eval Audit
Author: hamelsmu

// Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

Run Skill in Manus

$ git log --oneline --stat

stars:1,333

forks:138

updated:March 3, 2026 at 02:53

SKILL.md

readonly

related-skills.json

same repository

build-review-interface.md

from "hamelsmu/evals-skills"

Build a custom browser-based annotation interface tailored to your data for reviewing LLM traces and collecting structured feedback. Use when you need to build an annotation tool, review traces, or collect human labels.

2026-03-031.3k

error-analysis.md

from "hamelsmu/evals-skills"

Help the user systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.

2026-03-031.3k

evaluate-rag.md

from "hamelsmu/evals-skills"

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-03-031.3k

generate-synthetic-data.md

from "hamelsmu/evals-skills"

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

2026-03-031.3k

validate-evaluator.md

from "hamelsmu/evals-skills"

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).

2026-03-031.3k

write-judge-prompt.md

from "hamelsmu/evals-skills"

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.

2026-03-031.3k

package.json

"author": "hamelsmu"

"repository": "hamelsmu/evals-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	eval-audit
description	Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

Eval Audit

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

Overview

Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
Run diagnostic checks across six areas
Produce a findings report ordered by impact, with each finding linking to a fix

Prerequisites

Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."

Connecting to Eval Infrastructure

Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.

Diagnostic Checks

Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.

Prioritize findings by impact on the user's product. Present the most impactful findings first.

1. Error Analysis

Check: Has the user done systematic error analysis on real or synthetic traces?

Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.

Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with error-analysis, or generate-synthetic-data first if no traces exist.

See: Your AI Product Needs Evals, LLM Evals FAQ

Check: Were failure categories brainstormed or observed?

Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.

Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with error-analysis, starting from traces.

See: Who Validates the Validators?

2. Evaluator Design

Check: Are evaluators binary pass/fail?

Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.

Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using write-judge-prompt.

See: Creating an LLM Judge That Drives Business Results

Check: Do LLM judge prompts target specific failure modes?

Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").

Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use write-judge-prompt.

Check: Are code-based checks used where possible?

Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.

Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.

Check: Are similarity metrics used as primary evaluation?

Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.

Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.

See: LLM Evals FAQ

3. Judge Validation

Check: Are LLM judges validated against human labels?

Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.

Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use validate-evaluator.

See: Creating an LLM Judge That Drives Business Results

Check: Is alignment measured with TPR/TNR or with raw accuracy?

Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.

Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use validate-evaluator.

Check: Is there a proper train/dev/test split?

Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.

Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use validate-evaluator.

4. Human Review Process

Check: Who is reviewing traces?

Determine whether domain experts or outsourced annotators are labeling data.

Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.

See: A Field Guide to Improving AI Products

Check: Are reviewers seeing full traces or just final outputs?

Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.

Check: How is data displayed to reviewers?

Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.

Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use build-review-interface.

See: LLM Evals FAQ

5. Labeled Data

Check: Is there enough labeled data?

For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:

Random: Always include a random sample alongside other strategies to discover unknown issues.
Clustering: Group traces by semantic similarity and review representatives from each cluster.
Data analysis: Analyze statistics on latency, turns, tool calls, and tokens for outliers.
Classification: Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
Feedback: Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.

Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with generate-synthetic-data.

6. Pipeline Hygiene

Check: Is error analysis re-run after significant changes?

Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.

Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.

Check: Are evaluators maintained?

Look for periodic re-validation of judges or refreshed evaluation datasets.

Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.

No Eval Infrastructure

If the user has no eval artifacts (no traces, no evaluators, no labeled data):

Start with error-analysis on a sample of real traces.
If no production data exists, use generate-synthetic-data to create test inputs, run them through the pipeline, then apply error-analysis to the resulting traces.
Do not recommend building evaluators, judges, or dashboards before completing error analysis.

Report Format

Present findings ordered by impact. For each:

### [Problem Title]
**Status:** [Problem exists / OK / Cannot determine]
[1-2 sentence explanation of the specific problem found]
**Fix:** [Concrete action, referencing a skill or article]

Group under the six diagnostic areas. Omit areas where no problems were found.

Anti-Patterns

Running the audit as a checklist without inspecting actual artifacts.
Reporting generic advice disconnected from what was found in the user's pipeline.
Recommending evaluators before error analysis is complete.
Suggesting LLM judges for failures that code-based checks can handle.
Treating this audit as a one-time event. Re-audit after significant pipeline changes.

eval-audit

More from this repository

More from this repository

Eval Audit

Overview

Prerequisites

Connecting to Eval Infrastructure

Diagnostic Checks

1. Error Analysis

2. Evaluator Design

3. Judge Validation

4. Human Review Process

5. Labeled Data

6. Pipeline Hygiene

No Eval Infrastructure

Report Format

Anti-Patterns

Eval Audit

Overview

Prerequisites

Connecting to Eval Infrastructure

Diagnostic Checks

1. Error Analysis

2. Evaluator Design

3. Judge Validation

4. Human Review Process

5. Labeled Data

6. Pipeline Hygiene

No Eval Infrastructure

Report Format

Anti-Patterns