Run any Skill in Manus with one click

rag-evaluator

Stars31

Forks19

UpdatedMay 25, 2026 at 17:23

Use when designing, building, or operating RAG evaluation — golden sets, retrieval metrics (Recall@K, NDCG, MRR), answer metrics (faithfulness, citation accuracy), LLM-as-judge setup, CI gates for RAG, eval at scale, drift detection. Triggers on phrases like "evaluate RAG", "golden set", "RAG metrics", "RAGAS", "faithfulness evaluation", "regression test RAG", "RAG CI".

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

kumaran-is

kumaran-is/claude-code-onboarding

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

SKILL.md

readonly

More from this repository

same repository

browser-testing

kumaran-is/claude-code-onboarding

Browser automation and testing using playwright-cli (stateful Bash CLI for scripted tests — network inspection, console monitoring, screenshots, tracing) and Browser-Use MCP (autonomous agent flows). Use when the user needs to test web apps, debug browser issues, analyze performance, fill forms, run E2E user flows, or inspect network/console activity.

2026-05-2631

decision-frameworks

kumaran-is/claude-code-onboarding

Use when working through a specific problem or decision using a single reasoning framework applied deeply and interactively. Covers First Principles (break assumptions, rebuild from truth), Inversion (guarantee failure, then flip), Regret Minimization (decide from age 80), and Opportunity Cost (make tradeoffs visible). Triggers: "first principles", "inversion", "regret minimization", "opportunity cost", "help me think through", "challenge my assumptions", "what am I giving up", "work backwards from failure", "what would I regret".

2026-05-2631

mental-model-applier

kumaran-is/claude-code-onboarding

Use when stuck on any problem or decision and need frameworks that actually apply to the specific situation — not a generic list. Selects the three most relevant mental models for the problem at hand and applies each one to produce a specific insight. Triggers: "apply mental models", "I'm stuck on", "need a framework for", "different perspective on", "mental model", "thinking framework", "perspective shift", "been thinking about this too long".

2026-05-2631

second-order-thinker

kumaran-is/claude-code-onboarding

Use before any significant decision, when analyzing a trend, or when evaluating the impact of any action beyond the obvious. Maps first, second, and third order consequences — the effects of the effects that most people miss. Triggers: "second order effects", "map consequences", "think ahead", "what happens after", "downstream effects", "systems thinking", "analyze this decision", "what are the ripple effects".

2026-05-2631

code-explainer

kumaran-is/claude-code-onboarding

Use when you need to explain any piece of code for handoff, onboarding, or knowledge transfer — produces a dual-audience explanation (user-facing and modifier-facing) plus the fragile part and key assumption. Triggers: "explain this code", "what does this do", "help me understand", "onboard someone to", "document this", "explain for handoff", "code walkthrough".

2026-05-2631

pr-review

kumaran-is/claude-code-onboarding

Use when reviewing someone else's PR or preparing your own review comments for posting to GitHub. Implements a two-stage approval process — internal rich analysis first, human approval gate, then clean public posting. Nothing posts to GitHub until you explicitly approve. Triggers: "review this PR", "post a PR review", "review PR #N", "give feedback on PR", "submit a code review", "pr comment".

2026-05-2631

name	rag-evaluator
description	Use when designing, building, or operating RAG evaluation — golden sets, retrieval metrics (Recall@K, NDCG, MRR), answer metrics (faithfulness, citation accuracy), LLM-as-judge setup, CI gates for RAG, eval at scale, drift detection. Triggers on phrases like "evaluate RAG", "golden set", "RAG metrics", "RAGAS", "faithfulness evaluation", "regression test RAG", "RAG CI".

RAG Evaluation

Evaluation is what separates a RAG demo from a RAG product. Build evaluation before tuning the system, not after.

Three levels of evaluation

Level	What it measures
Retrieval	Did we fetch the right evidence?
Answer	Did the model generate a correct, grounded response?
Operational	Latency, cost, abstention rate, drift

You need all three. Skipping retrieval evaluation is the most common mistake — teams measure answers and miss that retrieval was the actual problem.

v1 vs v2 evaluation maturity

Phase	Approach
v1 (launch)	50–200 hand-labeled golden queries; retrieval metrics (Recall@K, NDCG); human spot checks; simple faithfulness rubric
v2 (scale)	LLM-as-judge calibrated against human labels; production-trace eval; synthetic eval generation; drift detection

Critical: do not start with LLM-as-judge. Build the hand-labeled foundation first; an uncalibrated judge gives confident garbage.

Retrieval metrics

Metric	Measures	Use when
Recall@K	Fraction of relevant docs in top-K	High-stakes / compliance
Precision@K	Fraction of top-K that are relevant	Reducing noise
Context Precision	Are relevant chunks ranked near the top of retrieved set?	As soon as you add a reranker
MRR	1 / rank of first relevant	Q&A where first hit matters
NDCG@K	Ranking quality with graded relevance	Search-style ordering
Hit@K	Any relevant doc in top-K?	Coarse baseline

Context Precision vs Precision@K: Precision@K asks "how many of the top-K are relevant?" — binary. Context Precision asks "are the relevant chunks ranked near position 1 or buried at position K?" A reranker can leave Precision@K unchanged while dramatically improving Context Precision. Add this metric as soon as you introduce a reranker — it's the clearest signal of whether the reranker is earning its keep.

Formula: Context Precision@K = Σ (Precision@i × relevance_i) / total_relevant where i ranges over positions 1..K and relevance_i is 1 if chunk at position i is relevant, 0 otherwise.

Eval Breakdown by Dimension

When Recall@K or another metric regresses — especially after a corpus growth event — never treat the aggregate metric as the diagnosis. Break it down first. The aggregate may hide a regression in one segment while other segments stay healthy.

Dimensions to slice by before touching the pipeline:

Dimension	How to slice	What a drop isolates
Document type	policy vs. ticket vs. manual vs. table	Chunking or parsing problem specific to one format
Recency	fresh docs (< 30 days) vs. stale docs (> 90 days)	Freshness/version filtering issue; stale docs flooding top-K
Query style	keyword-heavy vs. semantic / conceptual	Embedding model weakness on one query style; BM25 not covering the other
Query length	short (≤ 5 tokens) vs. long (≥ 15 tokens)	Short queries may need expansion; long queries may need decomposition
Corpus segment	department, product line, access tier	Index crowding in one segment; wrong routing; missing sub-index
Near-duplicate rate	queries where top-3 results are near-identical	Corpus has duplicate/stale documents polluting top-K

Rule: if one dimension shows a sharp drop while others are flat, the root cause is in that segment — fix that segment before changing the global pipeline. See rag-operations-guide.md §9 for corpus hygiene fixes when duplicates or stale docs are the culprit.

Answer metrics

Metric	Measures
Faithfulness	Every claim supported by retrieved context?
Answer relevance	Does the answer address the question?
Context relevance	Was retrieved context actually useful?
Citation accuracy	Do citations support the claims?
Citation quality (≠ presence)	Does the cited chunk contain the specific assertion?
Hallucination rate	Frequency of unsupported assertions
Abstention quality	Refusals when refusal was correct
False Abstention Rate	How often the system refused when retrieved context actually supported an answer
False Answer Rate	How often the system answered when it should have abstained (the dangerous failure)

Citation quality is not citation presence

A common failure: 100% of answers have citations, but 30% of citations don't actually support the specific claim. Build evaluation that checks per-claim support:

- query: "What approval is required for external contractors?"
  claim: "External contractors need written approval and a valid certificate"
  citation_must_support:
    - contractor_type: external
    - requirement_type: written_approval
    - requirement_type: certificate
  failing_citations:
    - Generic contractor discussion without approval specificity
    - Discussion of approval without certificate requirement

Golden set design

Build the golden set BEFORE building the system. Minimum viable golden set:

- question: What approval is required for external contractors?
  expected_documents:
    - contractor-policy.pdf#page=12
  expected_sections:
    - Approval Requirements
  expected_answer_contains:
    - written approval
    - certificate of compliance
  must_not_contain:
    - unsupported cost estimate
  expected_classification: policy_question
  expected_abstention: false

Coverage targets:

~60% happy path (typical queries with clear answers)
~20% edge cases (rare terms, multi-hop, comparisons)
~20% known unanswerable (verify abstention works)

Even 50 labeled queries is dramatically better than zero. Don't wait for perfection.

LLM-as-judge: dangers and mitigations

When you move to LLM-as-judge at scale (RAGAS, TruLens, custom), know the biases:

Bias	Mitigation
Position bias	Randomize position
Verbosity bias	Cap or normalize answer length
Self-preference bias	Use different model family for judge vs generator
Style bias	Score against rubric, not impression

Calibration is mandatory. Sample 10% of judge-graded responses, have humans grade them, track judge–human agreement. If agreement drops below 80%, the judge is drifting; re-prompt or re-train.

Eval at scale

Approach	When
Hand-labeled golden set	Always, from day one
Synthetic eval generation	Expand coverage; verify 10% by hand
Production-trace eval	After 1+ month live; sample real queries, label, feed back
Drift detection	Monitor distribution shifts in queries, scores, abstention rate
Counterfactual eval	Test abstention works — remove answer doc, verify refusal

CI gate pattern

# Eval CI gate (pseudocode)
def rag_ci_gate(pull_request):
    results = run_eval(golden_set, pull_request.code)
    if results.recall_at_10 < baseline.recall_at_10 - 0.02:
        fail("Recall@10 regression")
    if results.faithfulness < baseline.faithfulness - 0.02:
        fail("Faithfulness regression")
    if results.citation_quality < baseline.citation_quality - 0.05:
        fail("Citation quality regression")
    return pass

Run this on every PR that touches the RAG pipeline.

Python + FastAPI eval scaffolding

When the user wants to set up evaluation in a Python/FastAPI project:

evals/
  golden_set.yaml          # Hand-labeled queries
  run_eval.py              # Eval runner
  metrics.py               # Retrieval + answer metrics
  judges.py                # LLM-as-judge (v2)
  baselines.json           # Baseline scores for CI comparison
  reports/                 # Per-run reports

Suggested libraries: ragas (with calibration), deepeval, or custom — depends on stack. The framework matters less than having the golden set and running it on every change.

Diagnostic Matrix

When a metric regresses, use this table to triage the root cause before touching any code. Fix the upstream layer first — don't patch generation when retrieval is broken.

Bad Metric	Likely Root Cause	Where to Look First
Low Recall@K	Bad chunking (chunks too large/small), weak embedding model, indexing gap, metadata filter too strict	Check chunking strategy, embedding model selection, pre-filter logic
Low Precision@K	Too many irrelevant chunks retrieved, reranker missing or weak, k too high	Reduce k, add/tune reranker, tighten hybrid retrieval
Low Context Precision	Reranker missing or underperforming — relevant chunks exist but are buried	Reranker not deployed, or `ef_search`/`nprobe` index params too loose; tune or add cross-encoder
Low Context Relevance	Semantic drift between query and document vocabulary, poor chunking, wrong retrieval strategy	Check query expansion, embedding model mismatch, chunking granularity
Low Answer Relevance	Prompt issue, generation model too weak, instructions unclear or conflicting	Audit system prompt, check for prompt injection, try stronger model
Low Faithfulness	Hallucination, citation enforcement absent in prompt, grounding prompt too weak	Add atomic-claim citation check, tighten grounding instructions, check context packing for noise

Key principle: trace backward from the bad metric to the pipeline layer that owns it. Low Faithfulness is almost always a generation-layer or context-packing problem, not a retrieval problem. Low Recall@K is almost always a retrieval or chunking problem — fixing the prompt won't help.

How to apply

When asked about RAG evaluation:

Find out their phase. v1 (no eval yet) needs golden set first. v2 (has eval) needs scale and calibration.
Push back on "we'll evaluate later." No golden set = flying blind. Even 50 queries beats nothing.
Start with retrieval metrics, not just answer metrics. Recall@K catches retrieval failures that look like generation failures.
Warn against uncalibrated LLM-as-judge. It's the most common eval mistake.
Recommend per-claim citation checking — citation presence is a weak signal.

Reference: full playbook §33–37 (evaluation), §35.2 (citation quality), §37.1 (eval at scale).