一键导入
evaluating-llms
// Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
// Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
[HINT] 下载包含 SKILL.md 和所有相关文件的完整技能目录
| name | evaluating-llms |
| description | Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment. |
Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.
Apply this skill when:
Common triggers:
By Task Type:
| Task Type | Primary Approach | Metrics | Tools |
|---|---|---|---|
| Classification (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn |
| Generation (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging |
| Question Answering | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators |
| RAG Systems | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library |
| Code Generation | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest |
| Multi-step Agents | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators |
By Volume and Cost:
| Samples | Speed | Cost | Recommended Approach |
|---|---|---|---|
| 1,000+ | Immediate | $0 | Automated metrics (regex, JSON validation) |
| 100-1,000 | Minutes | $0.01-0.10 each | LLM-as-judge (GPT-4, Claude) |
| < 100 | Hours | $1-10 each | Human evaluation (pairwise comparison) |
Layered Approach (Recommended for Production):
Test single prompt-response pairs for correctness.
Methods:
Example Use Cases:
Quick Start (Python):
import pytest
from openai import OpenAI
client = OpenAI()
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
{"role": "user", "content": text}
],
temperature=0
)
return response.choices[0].message.content.strip().lower()
def test_positive_sentiment():
result = classify_sentiment("I love this product!")
assert result == "positive"
For complete unit evaluation examples, see examples/python/unit_evaluation.py and examples/typescript/unit-evaluation.ts.
Evaluate RAG systems using RAGAS framework metrics.
Critical Metrics (Priority Order):
Faithfulness (Target: > 0.8) - MOST CRITICAL
Answer Relevance (Target: > 0.7)
Context Relevance (Target: > 0.7)
Context Precision (Target: > 0.5)
Context Recall (Target: > 0.8)
Quick Start (Python with RAGAS):
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["The capital of France is Paris."],
"contexts": [["Paris is the capital of France."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")
For comprehensive RAG evaluation patterns, see references/rag-evaluation.md and examples/python/ragas_example.py.
Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.
When to Use:
Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics
Best Practices:
Quick Start (Python):
from openai import OpenAI
client = OpenAI()
def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
"""Returns (score 1-5, reasoning)"""
eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.
USER PROMPT: {prompt}
LLM RESPONSE: {response}
Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": eval_prompt}],
temperature=0.3
)
content = result.choices[0].message.content
lines = content.strip().split('\n')
score = int(lines[0].split(':')[1].strip())
reasoning = lines[1].split(':', 1)[1].strip()
return score, reasoning
For detailed LLM-as-judge patterns and prompt templates, see references/llm-as-judge.md and examples/python/llm_as_judge.py.
Measure hallucinations, bias, and toxicity in LLM outputs.
Methods:
Faithfulness to Context (RAG):
Factual Accuracy (Closed-Book):
Self-Consistency:
Types of Bias:
Evaluation Methods:
Stereotype Tests:
Counterfactual Evaluation:
Tools:
For comprehensive safety evaluation patterns, see references/safety-evaluation.md.
Assess model capabilities using standardized benchmarks.
Standard Benchmarks:
| Benchmark | Coverage | Format | Difficulty | Use Case |
|---|---|---|---|---|
| MMLU | 57 subjects (STEM, humanities) | Multiple choice | High school - professional | General intelligence |
| HellaSwag | Sentence completion | Multiple choice | Common sense | Reasoning validation |
| GPQA | PhD-level science | Multiple choice | Very high (expert-level) | Frontier model testing |
| HumanEval | 164 Python problems | Code generation | Medium | Code capability |
| MATH | 12,500 competition problems | Math solving | High school competitions | Math reasoning |
Domain-Specific Benchmarks:
When to Use Benchmarks:
Quick Start (lm-evaluation-harness):
pip install lm-eval
# Evaluate GPT-4 on MMLU
lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5
For detailed benchmark testing patterns, see references/benchmarks.md and scripts/benchmark_runner.py.
Monitor and optimize LLM quality in production environments.
Compare two LLM configurations:
Metrics:
Real-time quality monitoring:
Sample-based human evaluation:
For production evaluation patterns and monitoring strategies, see references/production-evaluation.md.
For tasks with discrete outputs (sentiment, intent, category).
Metrics:
Quick Start (Python):
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]
accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
For complete classification evaluation examples, see examples/python/classification_metrics.py.
For open-ended text generation (summaries, creative writing, responses).
Automated Metrics (Use with Caution):
Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.
Recommended Approach:
For detailed generation evaluation patterns, see references/evaluation-types.md.
| If Task Is... | Use This Framework | Primary Metric |
|---|---|---|
| RAG system | RAGAS | Faithfulness > 0.8 |
| Classification | scikit-learn metrics | Accuracy, F1 |
| Generation quality | LLM-as-judge | Quality rubric (1-5) |
| Code generation | HumanEval | Pass@1, Test pass rate |
| Model comparison | Benchmark testing | MMLU, HellaSwag scores |
| Safety validation | Hallucination detection | Faithfulness, Fact-check |
| Production monitoring | Online evaluation | User feedback, Business KPIs |
| Library | Use Case | Installation |
|---|---|---|
| RAGAS | RAG evaluation | pip install ragas |
| DeepEval | General LLM evaluation, pytest integration | pip install deepeval |
| LangSmith | Production monitoring, A/B testing | pip install langsmith |
| lm-eval | Benchmark testing (MMLU, HumanEval) | pip install lm-eval |
| scikit-learn | Classification metrics | pip install scikit-learn |
| Application | Hallucination Risk | Bias Risk | Toxicity Risk | Evaluation Priority |
|---|---|---|---|---|
| Customer Support | High | Medium | High | 1. Faithfulness, 2. Toxicity, 3. Bias |
| Medical Diagnosis | Critical | High | Low | 1. Factual Accuracy, 2. Hallucination, 3. Bias |
| Creative Writing | Low | Medium | Medium | 1. Quality/Fluency, 2. Content Policy |
| Code Generation | Medium | Low | Low | 1. Functional Correctness, 2. Security |
| Content Moderation | Low | Critical | Critical | 1. Bias, 2. False Positives/Negatives |
For comprehensive documentation on specific topics:
references/evaluation-types.mdreferences/rag-evaluation.mdreferences/safety-evaluation.mdreferences/benchmarks.mdreferences/llm-as-judge.mdreferences/production-evaluation.mdreferences/metrics-reference.mdPython Examples:
examples/python/unit_evaluation.py - Basic prompt testing with pytestexamples/python/ragas_example.py - RAGAS RAG evaluationexamples/python/deepeval_example.py - DeepEval framework usageexamples/python/llm_as_judge.py - GPT-4 as evaluatorexamples/python/classification_metrics.py - Accuracy, precision, recallexamples/python/benchmark_testing.py - HumanEval exampleTypeScript Examples:
examples/typescript/unit-evaluation.ts - Vitest + OpenAIexamples/typescript/llm-as-judge.ts - GPT-4 evaluationexamples/typescript/langsmith-integration.ts - Production monitoringRun evaluations without loading code into context (token-free):
scripts/run_ragas_eval.py - Run RAGAS evaluation on datasetscripts/compare_models.py - A/B test two modelsscripts/benchmark_runner.py - Run MMLU/HumanEval benchmarksscripts/hallucination_checker.py - Detect hallucinations in outputsExample usage:
# Run RAGAS evaluation on custom dataset
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
# Compare GPT-4 vs Claude on benchmark
python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
Related Skills:
building-ai-chat: Evaluate AI chat applications (this skill tests what that skill builds)prompt-engineering: Test prompt quality and effectivenesstesting-strategies: Apply testing pyramid to LLM evaluation (unit → integration → E2E)observability: Production monitoring and alerting for LLM qualitybuilding-ci-pipelines: Integrate LLM evaluation into CI/CDWorkflow Integration:
prompt-engineering skill)llm-evaluation skill)building-ai-chat skill)llm-evaluation skill)deploying-applications skill)llm-evaluation + observability skills)1. Over-reliance on Automated Metrics for Generation
2. Ignoring Faithfulness in RAG Systems
3. No Production Monitoring
4. Biased LLM-as-Judge Evaluation
5. Insufficient Benchmark Coverage
6. Missing Safety Evaluation