원클릭으로 Manus에서 모든 스킬 실행

$pwd:

llm-evaluation

Name: Llm Evaluation
Author: applied-artificial-intelligence

// LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.

Manus에서 실행

$ git log --oneline --stat

stars:60

forks:15

updated:2025년 11월 1일 23:59

SKILL.md

readonly

package.json

"author": "applied-artificial-intelligence"

"repository": "applied-artificial-intelligence/claude-code-toolkit"

$ gh browse

$ install --globalskills.sh

$ download --local

Manus에서 실행

[HINT] SKILL.md 및 모든 관련 파일을 포함한 전체 스킬 디렉토리를 다운로드합니다

related-imports.ts

// 관련 스킬

import feishu-wiki

from "openclaw"

361,765

import openai-whisper-api

import sherpa-onnx-tts

import openai-whisper

name	llm-evaluation
description	LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.

LLM Evaluation & Testing

Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.

Quick Reference

When to use this skill:

Testing LLM application outputs
Validating prompt quality and consistency
Detecting hallucinations and factual errors
Creating evaluation benchmarks
A/B testing prompts or models
Implementing continuous evaluation (CI/CD)
Measuring retrieval quality (for RAG)
Debugging unexpected LLM behavior

Metrics covered:

Traditional: BLEU, ROUGE, BERTScore, Perplexity
LLM-as-Judge: GPT-4 evaluation, rubric-based scoring
Task-specific: Exact match, F1, accuracy, recall
Quality: Toxicity, bias, coherence, relevance

Part 1: Evaluation Fundamentals

The LLM Evaluation Challenge

Why LLM evaluation is hard:

Subjective quality - "Good" output varies by use case
No single ground truth - Multiple valid answers
Context-dependent - Same output good/bad in different scenarios
Expensive to label - Human evaluation doesn't scale
Adversarial brittleness - Small prompt changes = large output changes

Solution: Multi-layered evaluation

Layer 1: Automated Metrics (fast, scalable)
  ↓
Layer 2: LLM-as-Judge (flexible, nuanced)
  ↓
Layer 3: Human Review (gold standard, expensive)

Evaluation Dataset Structure

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalExample:
    """Single evaluation example."""
    input: str  # User input / prompt
    expected_output: Optional[str]  # Gold standard (if exists)
    context: Optional[str]  # Additional context (for RAG)
    metadata: dict  # Category, difficulty, etc.

@dataclass
class EvalResult:
    """Evaluation result for one example."""
    example_id: str
    actual_output: str
    scores: dict  # {'metric_name': score}
    passed: bool
    failure_reason: Optional[str]

# Example dataset
eval_dataset = [
    EvalExample(
        input="What is the capital of France?",
        expected_output="Paris",
        context=None,
        metadata={'category': 'factual', 'difficulty': 'easy'}
    ),
    EvalExample(
        input="Explain quantum entanglement",
        expected_output=None,  # No single answer
        context=None,
        metadata={'category': 'explanation', 'difficulty': 'hard'}
    )
]

Part 2: Traditional Metrics

Metric 1: Exact Match (Simplest)

def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:
    """
    Binary metric: 1.0 if match, 0.0 otherwise.

    Use for: Classification, short answers, structured output
    Limitations: Too strict for generation tasks
    """
    if not case_sensitive:
        predicted = predicted.lower().strip()
        expected = expected.lower().strip()

    return 1.0 if predicted == expected else 0.0

# Example
score = exact_match("Paris", "paris")  # 1.0
score = exact_match("The capital is Paris", "Paris")  # 0.0

Metric 2: ROUGE (Recall-Oriented)

from rouge_score import rouge_scorer

def compute_rouge(predicted: str, expected: str) -> dict:
    """
    ROUGE metrics for text overlap.

    ROUGE-1: Unigram overlap
    ROUGE-2: Bigram overlap
    ROUGE-L: Longest common subsequence

    Use for: Summarization, translation
    Limitations: Doesn't capture semantics
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(expected, predicted)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Example
scores = compute_rouge(
    predicted="Paris is the capital of France",
    expected="The capital of France is Paris"
)
# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}

Metric 3: BERTScore (Semantic Similarity)

from bert_score import score as bert_score

def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:
    """
    Semantic similarity using BERT embeddings.

    Better than ROUGE for:
    - Paraphrases
    - Semantic equivalence
    - Generation quality

    Returns: Precision, Recall, F1
    """
    P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

# Example
scores = compute_bertscore(
    predicted=["The capital of France is Paris"],
    expected=["Paris is France's capital city"]
)
# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}

Metric 4: Perplexity (Model Confidence)

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def compute_perplexity(text: str, model_name: str = "gpt2") -> float:
    """
    Perplexity: How "surprised" is the model by this text?

    Lower = More likely/fluent
    Use for: Fluency, naturalness
    Limitations: Doesn't measure correctness
    """
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

    perplexity = torch.exp(loss).item()
    return perplexity

# Example
ppl = compute_perplexity("Paris is the capital of France")  # Low (fluent)
ppl2 = compute_perplexity("Capital France the is Paris of")  # High (awkward)

Part 3: LLM-as-Judge Evaluation

Pattern 1: Rubric-Based Scoring

from openai import OpenAI

client = OpenAI()

EVALUATION_PROMPT = """
You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:

**Criteria:**
1. **Accuracy**: Is the information factually correct?
2. **Completeness**: Does it fully answer the question?
3. **Clarity**: Is it easy to understand?
4. **Conciseness**: Is it appropriately brief?

**Response to evaluate:**
{response}

**Expected answer (reference):**
{expected}

Provide scores in JSON format:
{{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "clarity": <1-5>,
  "conciseness": <1-5>,
  "reasoning": "Brief explanation"
}}
"""

def llm_judge_score(response: str, expected: str) -> dict:
    """
    Use GPT-4 as judge with rubric scoring.

    Pros: Flexible, nuanced, scales well
    Cons: Costs $, potential bias, slower
    """
    prompt = EVALUATION_PROMPT.format(response=response, expected=expected)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    scores = json.loads(completion.choices[0].message.content)
    return scores

# Example
scores = llm_judge_score(
    response="Paris is the capital of France, located in the north-central part of the country.",
    expected="Paris"
)
# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}

Pattern 2: Binary Pass/Fail Evaluation

PASS_FAIL_PROMPT = """
Evaluate if the assistant's response is acceptable.

**Question:** {question}
**Response:** {response}
**Criteria:** {criteria}

Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.
"""

def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:
    """
    Simple pass/fail evaluation.

    Use for: Unit tests, regression tests, CI/CD
    """
    prompt = PASS_FAIL_PROMPT.format(
        question=question,
        response=response,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0  # Deterministic
    )

    result = completion.choices[0].message.content
    passed = result.startswith("PASS")
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return passed, reason

# Example
passed, reason = binary_eval(
    question="What is the capital of France?",
    response="The capital is Paris",
    criteria="Response must mention Paris"
)
# (True, "Response correctly identifies Paris as the capital")

Pattern 3: Pairwise Comparison (A/B Testing)

PAIRWISE_PROMPT = """
Compare two responses to the same question. Which is better?

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

**Criteria:** {criteria}

Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.
"""

def pairwise_comparison(
    question: str,
    response_a: str,
    response_b: str,
    criteria: str = "Overall quality, accuracy, and helpfulness"
) -> tuple[str, str]:
    """
    A/B test two responses.

    Use for: Prompt engineering, model comparison
    """
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )

    result = completion.choices[0].message.content
    winner = result.split()[0]  # "A", "B", or "TIE"
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return winner, reason

# Example
winner, reason = pairwise_comparison(
    question="Explain quantum computing",
    response_a="Quantum computers use qubits instead of bits...",
    response_b="Quantum computing is complex. It uses quantum mechanics."
)
# ("A", "Response A provides more detail and explanation")

Part 4: Hallucination Detection

Method 1: Grounding Check

def check_grounding(response: str, context: str) -> dict:
    """
    Verify response is grounded in provided context.

    Critical for RAG systems.
    """
    GROUNDING_PROMPT = """
    Context: {context}

    Response: {response}

    Is the response fully supported by the context? Answer with:
    - "GROUNDED": All claims supported
    - "PARTIALLY_GROUNDED": Some claims unsupported
    - "NOT_GROUNDED": Contains unsupported claims

    List any unsupported claims.
    """

    prompt = GROUNDING_PROMPT.format(context=context, response=response)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    result = completion.choices[0].message.content
    status = result.split("\n")[0]
    unsupported = [line for line in result.split("\n")[1:] if line.strip()]

    return {
        'grounding_status': status,
        'unsupported_claims': unsupported,
        'is_hallucination': status != "GROUNDED"
    }

Method 2: Factuality Check (External Verification)

def check_factuality(claim: str, use_search: bool = True) -> dict:
    """
    Verify factual claims using external sources.

    Options:
    1. Web search + verification
    2. Knowledge base lookup
    3. Cross-reference with trusted source
    """
    if use_search:
        # Use web search to verify
        from tavily import TavilyClient
        tavily = TavilyClient(api_key="your-key")

        # Search for evidence
        results = tavily.search(claim, max_results=3)

        # Ask LLM to verify based on search results
        VERIFY_PROMPT = """
        Claim: {claim}

        Search results:
        {results}

        Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.
        Explanation:
        """

        prompt = VERIFY_PROMPT.format(
            claim=claim,
            results="\n\n".join([r['content'] for r in results])
        )

        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        result = completion.choices[0].message.content
        is_factual = result.startswith("TRUE")

        return {
            'claim': claim,
            'factual': is_factual,
            'evidence': results,
            'explanation': result
        }

Method 3: Self-Consistency Check

def self_consistency_check(question: str, num_samples: int = 5) -> dict:
    """
    Generate multiple responses, check for consistency.

    If model is confident, responses should be consistent.
    Inconsistency suggests hallucination risk.
    """
    responses = []

    for _ in range(num_samples):
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": question}],
            temperature=0.7  # Some randomness
        )
        responses.append(completion.choices[0].message.content)

    # Compute pairwise similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(responses)
    similarities = cosine_similarity(vectors)

    # Average pairwise similarity
    avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))

    return {
        'responses': responses,
        'avg_similarity': avg_similarity,
        'is_consistent': avg_similarity > 0.7,  # Threshold
        'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'
    }

Part 5: RAG-Specific Evaluation

Retrieval Quality Metrics

def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:
    """
    Evaluate retrieval quality using IR metrics.

    Precision: What % of retrieved docs are relevant?
    Recall: What % of relevant docs were retrieved?
    MRR: Mean Reciprocal Rank
    NDCG: Normalized Discounted Cumulative Gain
    """
    retrieved_ids = [doc['id'] for doc in retrieved_docs]

    # Precision
    true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))
    precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0

    # Recall
    recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0

    # F1
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    # MRR (Mean Reciprocal Rank)
    mrr = 0.0
    for i, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_doc_ids:
            mrr = 1.0 / i
            break

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'mrr': mrr,
        'num_retrieved': len(retrieved_ids),
        'num_relevant_retrieved': true_positives
    }

End-to-End RAG Evaluation

def evaluate_rag_pipeline(
    question: str,
    generated_answer: str,
    retrieved_docs: List[dict],
    ground_truth: str,
    relevant_doc_ids: List[str]
) -> dict:
    """
    Comprehensive RAG evaluation.

    1. Retrieval quality (precision, recall)
    2. Answer quality (ROUGE, BERTScore)
    3. Answer grounding (hallucination check)
    4. Citation accuracy
    """
    # 1. Retrieval metrics
    retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)

    # 2. Answer quality
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    rouge_scores = compute_rouge(generated_answer, ground_truth)
    bert_scores = compute_bertscore([generated_answer], [ground_truth])

    # 3. Grounding check
    grounding = check_grounding(generated_answer, context)

    # 4. LLM-as-judge overall quality
    judge_scores = llm_judge_score(generated_answer, ground_truth)

    return {
        'retrieval': retrieval_scores,
        'answer_quality': {
            'rouge': rouge_scores,
            'bertscore': bert_scores
        },
        'grounding': grounding,
        'llm_judge': judge_scores,
        'overall_pass': (
            retrieval_scores['f1'] > 0.5 and
            grounding['grounding_status'] == "GROUNDED" and
            judge_scores['accuracy'] >= 4
        )
    }

Part 6: Prompt Testing Frameworks

Framework 1: Regression Test Suite

class PromptTestSuite:
    """
    Unit tests for prompts (like pytest for LLMs).
    """

    def __init__(self):
        self.tests = []
        self.results = []

    def add_test(self, name: str, input: str, criteria: str):
        """Add a test case."""
        self.tests.append({
            'name': name,
            'input': input,
            'criteria': criteria
        })

    def run(self, generate_fn):
        """Run all tests with given generation function."""
        for test in self.tests:
            response = generate_fn(test['input'])
            passed, reason = binary_eval(
                question=test['input'],
                response=response,
                criteria=test['criteria']
            )

            self.results.append({
                'test_name': test['name'],
                'passed': passed,
                'reason': reason,
                'response': response
            })

        return self.results

    def summary(self) -> dict:
        """Get test summary."""
        total = len(self.results)
        passed = sum(1 for r in self.results if r['passed'])

        return {
            'total_tests': total,
            'passed': passed,
            'failed': total - passed,
            'pass_rate': passed / total if total > 0 else 0.0
        }

# Usage
suite = PromptTestSuite()
suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")
suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")

def my_generate(prompt):
    # Your LLM call
    return client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

results = suite.run(my_generate)
print(suite.summary())
# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}

Framework 2: A/B Testing Framework

class ABTest:
    """
    A/B test prompts or models.
    """

    def __init__(self, test_cases: List[dict]):
        self.test_cases = test_cases
        self.results = []

    def run(self, generate_a, generate_b):
        """Compare two generation functions."""
        for test in self.test_cases:
            response_a = generate_a(test['input'])
            response_b = generate_b(test['input'])

            winner, reason = pairwise_comparison(
                question=test['input'],
                response_a=response_a,
                response_b=response_b
            )

            self.results.append({
                'input': test['input'],
                'response_a': response_a,
                'response_b': response_b,
                'winner': winner,
                'reason': reason
            })

        return self.results

    def summary(self) -> dict:
        """Aggregate results."""
        total = len(self.results)
        a_wins = sum(1 for r in self.results if r['winner'] == 'A')
        b_wins = sum(1 for r in self.results if r['winner'] == 'B')
        ties = sum(1 for r in self.results if r['winner'] == 'TIE')

        return {
            'total_comparisons': total,
            'a_wins': a_wins,
            'b_wins': b_wins,
            'ties': ties,
            'a_win_rate': a_wins / total if total > 0 else 0.0,
            'statistical_significance': self._check_significance(a_wins, b_wins, total)
        }

    def _check_significance(self, a_wins, b_wins, total):
        """Simple binomial test for statistical significance."""
        from scipy.stats import binom_test
        # H0: Both equally good (p=0.5)
        p_value = binom_test(max(a_wins, b_wins), total, 0.5)
        return p_value < 0.05  # Significant at 95% confidence

Part 7: Production Monitoring

Continuous Evaluation Pipeline

import logging
from datetime import datetime

class ProductionMonitor:
    """
    Monitor LLM performance in production.
    """

    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate
        self.metrics = []
        self.logger = logging.getLogger(__name__)

    def log_interaction(self, user_input: str, model_output: str, metadata: dict):
        """Log interaction for evaluation."""
        import random

        # Sample traffic for evaluation
        if random.random() < self.sample_rate:
            # Run automated checks
            toxicity = self._check_toxicity(model_output)
            perplexity = compute_perplexity(model_output)

            metric = {
                'timestamp': datetime.now().isoformat(),
                'user_input': user_input,
                'model_output': model_output,
                'toxicity_score': toxicity,
                'perplexity': perplexity,
                'latency_ms': metadata.get('latency_ms'),
                'model_version': metadata.get('model_version')
            }

            self.metrics.append(metric)

            # Alert if anomaly detected
            if toxicity > 0.5:
                self.logger.warning(f"High toxicity detected: {toxicity}")

    def _check_toxicity(self, text: str) -> float:
        """Check for toxic content."""
        from detoxify import Detoxify
        model = Detoxify('original')
        results = model.predict(text)
        return max(results.values())  # Max toxicity score

    def get_metrics(self) -> dict:
        """Aggregate metrics."""
        if not self.metrics:
            return {}

        return {
            'total_interactions': len(self.metrics),
            'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),
            'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),
            'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),
            'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)
        }

Part 8: Best Practices

Practice 1: Layered Evaluation Strategy

# Layer 1: Fast, cheap automated checks
def quick_checks(response: str) -> bool:
    """Run fast automated checks."""
    # Length check
    if len(response) < 10:
        return False

    # Toxicity check
    if check_toxicity(response) > 0.5:
        return False

    # Basic coherence (perplexity)
    if compute_perplexity(response) > 100:
        return False

    return True

# Layer 2: LLM-as-judge (selective)
def llm_evaluation(response: str, criteria: str) -> float:
    """Run LLM evaluation on subset."""
    scores = llm_judge_score(response, criteria)
    return sum(scores.values()) / len(scores)  # Average score

# Layer 3: Human review (expensive, critical cases)
def flag_for_human_review(response: str, confidence: float) -> bool:
    """Determine if human review needed."""
    return (
        confidence < 0.7 or
        len(response) > 1000 or  # Long responses
        "uncertain" in response.lower()  # Model uncertainty
    )

# Combined pipeline
def evaluate_response(question: str, response: str) -> dict:
    # Layer 1: Quick checks
    if not quick_checks(response):
        return {'status': 'failed_quick_checks', 'human_review': False}

    # Layer 2: LLM judge
    score = llm_evaluation(response, "accuracy and helpfulness")
    confidence = score / 5.0

    # Layer 3: Human review decision
    needs_human = flag_for_human_review(response, confidence)

    return {
        'status': 'passed' if score >= 3.5 else 'failed',
        'score': score,
        'confidence': confidence,
        'human_review': needs_human
    }

Practice 2: Version Your Prompts

from typing import Dict
import hashlib

class PromptVersion:
    """Track prompt versions for A/B testing and rollback."""

    def __init__(self):
        self.versions = {}
        self.active_version = None

    def register(self, name: str, prompt_template: str, metadata: dict = None):
        """Register a prompt version."""
        version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]

        self.versions[version_id] = {
            'name': name,
            'template': prompt_template,
            'metadata': metadata or {},
            'created_at': datetime.now(),
            'metrics': {'total_uses': 0, 'avg_score': 0.0}
        }

        return version_id

    def use(self, version_id: str, **kwargs) -> str:
        """Use a specific prompt version."""
        if version_id not in self.versions:
            raise ValueError(f"Unknown version: {version_id}")

        version = self.versions[version_id]
        version['metrics']['total_uses'] += 1

        return version['template'].format(**kwargs)

    def update_metrics(self, version_id: str, score: float):
        """Update performance metrics for a version."""
        version = self.versions[version_id]
        current_avg = version['metrics']['avg_score']
        total_uses = version['metrics']['total_uses']

        # Running average
        new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses
        version['metrics']['avg_score'] = new_avg

# Usage
pm = PromptVersion()

v1 = pm.register(
    name="question_answering_v1",
    prompt_template="Answer this question: {question}",
    metadata={'author': 'alice', 'date': '2024-01-01'}
)

v2 = pm.register(
    name="question_answering_v2",
    prompt_template="You are a helpful assistant. Answer: {question}",
    metadata={'author': 'bob', 'date': '2024-01-15'}
)

# A/B test
prompt = pm.use(v1, question="What is AI?")  # 50% traffic
score = llm_evaluation(response, criteria)
pm.update_metrics(v1, score)

Quick Decision Trees

"Which evaluation method should I use?"

Have ground truth labels?
  YES → ROUGE, BERTScore, Exact Match
  NO  → LLM-as-judge, Human review

Evaluating factual correctness?
  YES → Grounding check, Factuality verification
  NO  → Subjective quality → LLM-as-judge

Need fast feedback (CI/CD)?
  YES → Binary pass/fail tests
  NO  → Comprehensive multi-metric evaluation

Budget constraints?
  Tight → Automated metrics only
  Moderate → LLM-as-judge + sampling
  No limit → Human review gold standard

"How to detect hallucinations?"

Have source documents (RAG)?
  YES → Grounding check against context
  NO  → Continue

Can verify with search?
  YES → Factuality check with web search
  NO  → Continue

Check model confidence?
  YES → Self-consistency check (multiple samples)
  NO  → Flag for human review

Resources

ROUGE: https://github.com/google-research/google-research/tree/master/rouge
BERTScore: https://github.com/Tiiiger/bert_score
OpenAI Evals: https://github.com/openai/evals
LangChain Evaluation: https://python.langchain.com/docs/guides/evaluation/
Ragas (RAG eval): https://github.com/explodinggradients/ragas

Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence

name	llm-evaluation
description	LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.

LLM Evaluation & Testing

Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.

Quick Reference

When to use this skill:

Testing LLM application outputs
Validating prompt quality and consistency
Detecting hallucinations and factual errors
Creating evaluation benchmarks
A/B testing prompts or models
Implementing continuous evaluation (CI/CD)
Measuring retrieval quality (for RAG)
Debugging unexpected LLM behavior

Metrics covered:

Traditional: BLEU, ROUGE, BERTScore, Perplexity
LLM-as-Judge: GPT-4 evaluation, rubric-based scoring
Task-specific: Exact match, F1, accuracy, recall
Quality: Toxicity, bias, coherence, relevance

Part 1: Evaluation Fundamentals

The LLM Evaluation Challenge

Why LLM evaluation is hard:

Subjective quality - "Good" output varies by use case
No single ground truth - Multiple valid answers
Context-dependent - Same output good/bad in different scenarios
Expensive to label - Human evaluation doesn't scale
Adversarial brittleness - Small prompt changes = large output changes

Solution: Multi-layered evaluation

Layer 1: Automated Metrics (fast, scalable)
  ↓
Layer 2: LLM-as-Judge (flexible, nuanced)
  ↓
Layer 3: Human Review (gold standard, expensive)

Evaluation Dataset Structure

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalExample:
    """Single evaluation example."""
    input: str  # User input / prompt
    expected_output: Optional[str]  # Gold standard (if exists)
    context: Optional[str]  # Additional context (for RAG)
    metadata: dict  # Category, difficulty, etc.

@dataclass
class EvalResult:
    """Evaluation result for one example."""
    example_id: str
    actual_output: str
    scores: dict  # {'metric_name': score}
    passed: bool
    failure_reason: Optional[str]

# Example dataset
eval_dataset = [
    EvalExample(
        input="What is the capital of France?",
        expected_output="Paris",
        context=None,
        metadata={'category': 'factual', 'difficulty': 'easy'}
    ),
    EvalExample(
        input="Explain quantum entanglement",
        expected_output=None,  # No single answer
        context=None,
        metadata={'category': 'explanation', 'difficulty': 'hard'}
    )
]

Part 2: Traditional Metrics

Metric 1: Exact Match (Simplest)

def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:
    """
    Binary metric: 1.0 if match, 0.0 otherwise.

    Use for: Classification, short answers, structured output
    Limitations: Too strict for generation tasks
    """
    if not case_sensitive:
        predicted = predicted.lower().strip()
        expected = expected.lower().strip()

    return 1.0 if predicted == expected else 0.0

# Example
score = exact_match("Paris", "paris")  # 1.0
score = exact_match("The capital is Paris", "Paris")  # 0.0

Metric 2: ROUGE (Recall-Oriented)

from rouge_score import rouge_scorer

def compute_rouge(predicted: str, expected: str) -> dict:
    """
    ROUGE metrics for text overlap.

    ROUGE-1: Unigram overlap
    ROUGE-2: Bigram overlap
    ROUGE-L: Longest common subsequence

    Use for: Summarization, translation
    Limitations: Doesn't capture semantics
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(expected, predicted)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Example
scores = compute_rouge(
    predicted="Paris is the capital of France",
    expected="The capital of France is Paris"
)
# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}

Metric 3: BERTScore (Semantic Similarity)

from bert_score import score as bert_score

def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:
    """
    Semantic similarity using BERT embeddings.

    Better than ROUGE for:
    - Paraphrases
    - Semantic equivalence
    - Generation quality

    Returns: Precision, Recall, F1
    """
    P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

# Example
scores = compute_bertscore(
    predicted=["The capital of France is Paris"],
    expected=["Paris is France's capital city"]
)
# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}

Metric 4: Perplexity (Model Confidence)

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def compute_perplexity(text: str, model_name: str = "gpt2") -> float:
    """
    Perplexity: How "surprised" is the model by this text?

    Lower = More likely/fluent
    Use for: Fluency, naturalness
    Limitations: Doesn't measure correctness
    """
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

    perplexity = torch.exp(loss).item()
    return perplexity

# Example
ppl = compute_perplexity("Paris is the capital of France")  # Low (fluent)
ppl2 = compute_perplexity("Capital France the is Paris of")  # High (awkward)

Part 3: LLM-as-Judge Evaluation

Pattern 1: Rubric-Based Scoring

from openai import OpenAI

client = OpenAI()

EVALUATION_PROMPT = """
You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:

**Criteria:**
1. **Accuracy**: Is the information factually correct?
2. **Completeness**: Does it fully answer the question?
3. **Clarity**: Is it easy to understand?
4. **Conciseness**: Is it appropriately brief?

**Response to evaluate:**
{response}

**Expected answer (reference):**
{expected}

Provide scores in JSON format:
{{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "clarity": <1-5>,
  "conciseness": <1-5>,
  "reasoning": "Brief explanation"
}}
"""

def llm_judge_score(response: str, expected: str) -> dict:
    """
    Use GPT-4 as judge with rubric scoring.

    Pros: Flexible, nuanced, scales well
    Cons: Costs $, potential bias, slower
    """
    prompt = EVALUATION_PROMPT.format(response=response, expected=expected)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    scores = json.loads(completion.choices[0].message.content)
    return scores

# Example
scores = llm_judge_score(
    response="Paris is the capital of France, located in the north-central part of the country.",
    expected="Paris"
)
# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}

Pattern 2: Binary Pass/Fail Evaluation

PASS_FAIL_PROMPT = """
Evaluate if the assistant's response is acceptable.

**Question:** {question}
**Response:** {response}
**Criteria:** {criteria}

Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.
"""

def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:
    """
    Simple pass/fail evaluation.

    Use for: Unit tests, regression tests, CI/CD
    """
    prompt = PASS_FAIL_PROMPT.format(
        question=question,
        response=response,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0  # Deterministic
    )

    result = completion.choices[0].message.content
    passed = result.startswith("PASS")
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return passed, reason

# Example
passed, reason = binary_eval(
    question="What is the capital of France?",
    response="The capital is Paris",
    criteria="Response must mention Paris"
)
# (True, "Response correctly identifies Paris as the capital")

Pattern 3: Pairwise Comparison (A/B Testing)

PAIRWISE_PROMPT = """
Compare two responses to the same question. Which is better?

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

**Criteria:** {criteria}

Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.
"""

def pairwise_comparison(
    question: str,
    response_a: str,
    response_b: str,
    criteria: str = "Overall quality, accuracy, and helpfulness"
) -> tuple[str, str]:
    """
    A/B test two responses.

    Use for: Prompt engineering, model comparison
    """
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )

    result = completion.choices[0].message.content
    winner = result.split()[0]  # "A", "B", or "TIE"
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return winner, reason

# Example
winner, reason = pairwise_comparison(
    question="Explain quantum computing",
    response_a="Quantum computers use qubits instead of bits...",
    response_b="Quantum computing is complex. It uses quantum mechanics."
)
# ("A", "Response A provides more detail and explanation")

Part 4: Hallucination Detection

Method 1: Grounding Check

def check_grounding(response: str, context: str) -> dict:
    """
    Verify response is grounded in provided context.

    Critical for RAG systems.
    """
    GROUNDING_PROMPT = """
    Context: {context}

    Response: {response}

    Is the response fully supported by the context? Answer with:
    - "GROUNDED": All claims supported
    - "PARTIALLY_GROUNDED": Some claims unsupported
    - "NOT_GROUNDED": Contains unsupported claims

    List any unsupported claims.
    """

    prompt = GROUNDING_PROMPT.format(context=context, response=response)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    result = completion.choices[0].message.content
    status = result.split("\n")[0]
    unsupported = [line for line in result.split("\n")[1:] if line.strip()]

    return {
        'grounding_status': status,
        'unsupported_claims': unsupported,
        'is_hallucination': status != "GROUNDED"
    }

Method 2: Factuality Check (External Verification)

def check_factuality(claim: str, use_search: bool = True) -> dict:
    """
    Verify factual claims using external sources.

    Options:
    1. Web search + verification
    2. Knowledge base lookup
    3. Cross-reference with trusted source
    """
    if use_search:
        # Use web search to verify
        from tavily import TavilyClient
        tavily = TavilyClient(api_key="your-key")

        # Search for evidence
        results = tavily.search(claim, max_results=3)

        # Ask LLM to verify based on search results
        VERIFY_PROMPT = """
        Claim: {claim}

        Search results:
        {results}

        Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.
        Explanation:
        """

        prompt = VERIFY_PROMPT.format(
            claim=claim,
            results="\n\n".join([r['content'] for r in results])
        )

        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        result = completion.choices[0].message.content
        is_factual = result.startswith("TRUE")

        return {
            'claim': claim,
            'factual': is_factual,
            'evidence': results,
            'explanation': result
        }

Method 3: Self-Consistency Check

def self_consistency_check(question: str, num_samples: int = 5) -> dict:
    """
    Generate multiple responses, check for consistency.

    If model is confident, responses should be consistent.
    Inconsistency suggests hallucination risk.
    """
    responses = []

    for _ in range(num_samples):
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": question}],
            temperature=0.7  # Some randomness
        )
        responses.append(completion.choices[0].message.content)

    # Compute pairwise similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(responses)
    similarities = cosine_similarity(vectors)

    # Average pairwise similarity
    avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))

    return {
        'responses': responses,
        'avg_similarity': avg_similarity,
        'is_consistent': avg_similarity > 0.7,  # Threshold
        'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'
    }

Part 5: RAG-Specific Evaluation

Retrieval Quality Metrics

def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:
    """
    Evaluate retrieval quality using IR metrics.

    Precision: What % of retrieved docs are relevant?
    Recall: What % of relevant docs were retrieved?
    MRR: Mean Reciprocal Rank
    NDCG: Normalized Discounted Cumulative Gain
    """
    retrieved_ids = [doc['id'] for doc in retrieved_docs]

    # Precision
    true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))
    precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0

    # Recall
    recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0

    # F1
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    # MRR (Mean Reciprocal Rank)
    mrr = 0.0
    for i, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_doc_ids:
            mrr = 1.0 / i
            break

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'mrr': mrr,
        'num_retrieved': len(retrieved_ids),
        'num_relevant_retrieved': true_positives
    }

End-to-End RAG Evaluation

def evaluate_rag_pipeline(
    question: str,
    generated_answer: str,
    retrieved_docs: List[dict],
    ground_truth: str,
    relevant_doc_ids: List[str]
) -> dict:
    """
    Comprehensive RAG evaluation.

    1. Retrieval quality (precision, recall)
    2. Answer quality (ROUGE, BERTScore)
    3. Answer grounding (hallucination check)
    4. Citation accuracy
    """
    # 1. Retrieval metrics
    retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)

    # 2. Answer quality
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    rouge_scores = compute_rouge(generated_answer, ground_truth)
    bert_scores = compute_bertscore([generated_answer], [ground_truth])

    # 3. Grounding check
    grounding = check_grounding(generated_answer, context)

    # 4. LLM-as-judge overall quality
    judge_scores = llm_judge_score(generated_answer, ground_truth)

    return {
        'retrieval': retrieval_scores,
        'answer_quality': {
            'rouge': rouge_scores,
            'bertscore': bert_scores
        },
        'grounding': grounding,
        'llm_judge': judge_scores,
        'overall_pass': (
            retrieval_scores['f1'] > 0.5 and
            grounding['grounding_status'] == "GROUNDED" and
            judge_scores['accuracy'] >= 4
        )
    }

Part 6: Prompt Testing Frameworks

Framework 1: Regression Test Suite

class PromptTestSuite:
    """
    Unit tests for prompts (like pytest for LLMs).
    """

    def __init__(self):
        self.tests = []
        self.results = []

    def add_test(self, name: str, input: str, criteria: str):
        """Add a test case."""
        self.tests.append({
            'name': name,
            'input': input,
            'criteria': criteria
        })

    def run(self, generate_fn):
        """Run all tests with given generation function."""
        for test in self.tests:
            response = generate_fn(test['input'])
            passed, reason = binary_eval(
                question=test['input'],
                response=response,
                criteria=test['criteria']
            )

            self.results.append({
                'test_name': test['name'],
                'passed': passed,
                'reason': reason,
                'response': response
            })

        return self.results

    def summary(self) -> dict:
        """Get test summary."""
        total = len(self.results)
        passed = sum(1 for r in self.results if r['passed'])

        return {
            'total_tests': total,
            'passed': passed,
            'failed': total - passed,
            'pass_rate': passed / total if total > 0 else 0.0
        }

# Usage
suite = PromptTestSuite()
suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")
suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")

def my_generate(prompt):
    # Your LLM call
    return client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

results = suite.run(my_generate)
print(suite.summary())
# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}

Framework 2: A/B Testing Framework

class ABTest:
    """
    A/B test prompts or models.
    """

    def __init__(self, test_cases: List[dict]):
        self.test_cases = test_cases
        self.results = []

    def run(self, generate_a, generate_b):
        """Compare two generation functions."""
        for test in self.test_cases:
            response_a = generate_a(test['input'])
            response_b = generate_b(test['input'])

            winner, reason = pairwise_comparison(
                question=test['input'],
                response_a=response_a,
                response_b=response_b
            )

            self.results.append({
                'input': test['input'],
                'response_a': response_a,
                'response_b': response_b,
                'winner': winner,
                'reason': reason
            })

        return self.results

    def summary(self) -> dict:
        """Aggregate results."""
        total = len(self.results)
        a_wins = sum(1 for r in self.results if r['winner'] == 'A')
        b_wins = sum(1 for r in self.results if r['winner'] == 'B')
        ties = sum(1 for r in self.results if r['winner'] == 'TIE')

        return {
            'total_comparisons': total,
            'a_wins': a_wins,
            'b_wins': b_wins,
            'ties': ties,
            'a_win_rate': a_wins / total if total > 0 else 0.0,
            'statistical_significance': self._check_significance(a_wins, b_wins, total)
        }

    def _check_significance(self, a_wins, b_wins, total):
        """Simple binomial test for statistical significance."""
        from scipy.stats import binom_test
        # H0: Both equally good (p=0.5)
        p_value = binom_test(max(a_wins, b_wins), total, 0.5)
        return p_value < 0.05  # Significant at 95% confidence

Part 7: Production Monitoring

Continuous Evaluation Pipeline

import logging
from datetime import datetime

class ProductionMonitor:
    """
    Monitor LLM performance in production.
    """

    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate
        self.metrics = []
        self.logger = logging.getLogger(__name__)

    def log_interaction(self, user_input: str, model_output: str, metadata: dict):
        """Log interaction for evaluation."""
        import random

        # Sample traffic for evaluation
        if random.random() < self.sample_rate:
            # Run automated checks
            toxicity = self._check_toxicity(model_output)
            perplexity = compute_perplexity(model_output)

            metric = {
                'timestamp': datetime.now().isoformat(),
                'user_input': user_input,
                'model_output': model_output,
                'toxicity_score': toxicity,
                'perplexity': perplexity,
                'latency_ms': metadata.get('latency_ms'),
                'model_version': metadata.get('model_version')
            }

            self.metrics.append(metric)

            # Alert if anomaly detected
            if toxicity > 0.5:
                self.logger.warning(f"High toxicity detected: {toxicity}")

    def _check_toxicity(self, text: str) -> float:
        """Check for toxic content."""
        from detoxify import Detoxify
        model = Detoxify('original')
        results = model.predict(text)
        return max(results.values())  # Max toxicity score

    def get_metrics(self) -> dict:
        """Aggregate metrics."""
        if not self.metrics:
            return {}

        return {
            'total_interactions': len(self.metrics),
            'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),
            'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),
            'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),
            'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)
        }

Part 8: Best Practices

Practice 1: Layered Evaluation Strategy

# Layer 1: Fast, cheap automated checks
def quick_checks(response: str) -> bool:
    """Run fast automated checks."""
    # Length check
    if len(response) < 10:
        return False

    # Toxicity check
    if check_toxicity(response) > 0.5:
        return False

    # Basic coherence (perplexity)
    if compute_perplexity(response) > 100:
        return False

    return True

# Layer 2: LLM-as-judge (selective)
def llm_evaluation(response: str, criteria: str) -> float:
    """Run LLM evaluation on subset."""
    scores = llm_judge_score(response, criteria)
    return sum(scores.values()) / len(scores)  # Average score

# Layer 3: Human review (expensive, critical cases)
def flag_for_human_review(response: str, confidence: float) -> bool:
    """Determine if human review needed."""
    return (
        confidence < 0.7 or
        len(response) > 1000 or  # Long responses
        "uncertain" in response.lower()  # Model uncertainty
    )

# Combined pipeline
def evaluate_response(question: str, response: str) -> dict:
    # Layer 1: Quick checks
    if not quick_checks(response):
        return {'status': 'failed_quick_checks', 'human_review': False}

    # Layer 2: LLM judge
    score = llm_evaluation(response, "accuracy and helpfulness")
    confidence = score / 5.0

    # Layer 3: Human review decision
    needs_human = flag_for_human_review(response, confidence)

    return {
        'status': 'passed' if score >= 3.5 else 'failed',
        'score': score,
        'confidence': confidence,
        'human_review': needs_human
    }

Practice 2: Version Your Prompts

from typing import Dict
import hashlib

class PromptVersion:
    """Track prompt versions for A/B testing and rollback."""

    def __init__(self):
        self.versions = {}
        self.active_version = None

    def register(self, name: str, prompt_template: str, metadata: dict = None):
        """Register a prompt version."""
        version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]

        self.versions[version_id] = {
            'name': name,
            'template': prompt_template,
            'metadata': metadata or {},
            'created_at': datetime.now(),
            'metrics': {'total_uses': 0, 'avg_score': 0.0}
        }

        return version_id

    def use(self, version_id: str, **kwargs) -> str:
        """Use a specific prompt version."""
        if version_id not in self.versions:
            raise ValueError(f"Unknown version: {version_id}")

        version = self.versions[version_id]
        version['metrics']['total_uses'] += 1

        return version['template'].format(**kwargs)

    def update_metrics(self, version_id: str, score: float):
        """Update performance metrics for a version."""
        version = self.versions[version_id]
        current_avg = version['metrics']['avg_score']
        total_uses = version['metrics']['total_uses']

        # Running average
        new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses
        version['metrics']['avg_score'] = new_avg

# Usage
pm = PromptVersion()

v1 = pm.register(
    name="question_answering_v1",
    prompt_template="Answer this question: {question}",
    metadata={'author': 'alice', 'date': '2024-01-01'}
)

v2 = pm.register(
    name="question_answering_v2",
    prompt_template="You are a helpful assistant. Answer: {question}",
    metadata={'author': 'bob', 'date': '2024-01-15'}
)

# A/B test
prompt = pm.use(v1, question="What is AI?")  # 50% traffic
score = llm_evaluation(response, criteria)
pm.update_metrics(v1, score)

Quick Decision Trees

"Which evaluation method should I use?"

Have ground truth labels?
  YES → ROUGE, BERTScore, Exact Match
  NO  → LLM-as-judge, Human review

Evaluating factual correctness?
  YES → Grounding check, Factuality verification
  NO  → Subjective quality → LLM-as-judge

Need fast feedback (CI/CD)?
  YES → Binary pass/fail tests
  NO  → Comprehensive multi-metric evaluation

Budget constraints?
  Tight → Automated metrics only
  Moderate → LLM-as-judge + sampling
  No limit → Human review gold standard

"How to detect hallucinations?"

Have source documents (RAG)?
  YES → Grounding check against context
  NO  → Continue

Can verify with search?
  YES → Factuality check with web search
  NO  → Continue

Check model confidence?
  YES → Self-consistency check (multiple samples)
  NO  → Flag for human review

Resources

ROUGE: https://github.com/google-research/google-research/tree/master/rouge
BERTScore: https://github.com/Tiiiger/bert_score
OpenAI Evals: https://github.com/openai/evals
LangChain Evaluation: https://python.langchain.com/docs/guides/evaluation/
Ragas (RAG eval): https://github.com/explodinggradients/ragas

Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence