| name | evaluate |
| description | Evaluates RAG retrieval and LLM-as-judge metrics (faithfulness, relevancy, context precision). Triggers: measure RAG quality, knowledge gap, RAG eval, golden dataset. |
| effort | medium |
| disable-model-invocation | true |
| argument-hint | [--threshold N] |
| allowed-tools | Bash, Read |
RAG Evaluation
Evaluate RAG quality using LLM-as-a-Judge methodology.
Usage
/evaluate [--threshold 0.7]
Execution
Direct Execution (recommended for most projects)
python3 scripts/evaluate_rag.py
python3 scripts/evaluate_rag.py \
--faithfulness 0.7 \
--relevancy 0.7 \
--context 0.6
python3 scripts/knowledge_gaps.py --detect
python3 scripts/knowledge_gaps.py --report
Docker Execution (containerized projects)
docker exec {api-container} python3 scripts/evaluate_rag.py
docker exec {api-container} python3 scripts/evaluate_rag.py \
--faithfulness 0.7 \
--relevancy 0.7 \
--context 0.6
docker exec {api-container} python3 scripts/knowledge_gaps.py --detect
docker exec {api-container} python3 scripts/knowledge_gaps.py --report
Metrics
| Metric | Description | Target |
|---|
| Faithfulness | Is answer based on context? | >70% |
| Relevancy | Does answer address question? | >70% |
| Context Precision | Is found context accurate? | >60% |
Evaluation Process
- Generate test queries from golden dataset
- Execute RAG pipeline for each query
- LLM judges each response on metrics
- Report aggregate scores
Golden Dataset
Located at: scripts/golden_dataset.json (or project-specific path)
{
"queries": [
{
"query": "How to configure rate limiting?",
"expected_topics": ["nginx", "rate-limiting"],
"expected_sources": ["kb/nginx/howto/rate-limiting.md"]
}
]
}
Output Example
RAG Evaluation Results
======================
Total Queries: 50
Average Faithfulness: 0.82
Average Relevancy: 0.78
Average Context Precision: 0.71
Quality: GOOD
Failed Queries (faithfulness < 0.7):
- Query: "How to backup PostgreSQL?"
Score: 0.45
Issue: No relevant documents found
Knowledge Gaps
After evaluation, check for gaps:
python3 scripts/knowledge_gaps.py --detect
docker exec {api-container} python3 scripts/knowledge_gaps.py --detect
Output:
Knowledge Gaps Detected:
1. PostgreSQL backup procedures (5 failed queries)
2. Redis caching configuration (3 failed queries)
3. Ollama model selection (2 failed queries)
Quality Gates
Rules
- MUST use a golden dataset — never evaluate on synthetic queries only
- NEVER report a score without listing the failed queries alongside it
- CRITICAL: if the golden dataset is missing, stop and ask the user to provide one
- MANDATORY: thresholds come from project config, not hardcoded defaults, when available
Gotchas
- LLM-as-a-judge scores are non-deterministic; a single run fluctuates by ±10 points even with
temperature=0. Always report the average and stddev over ≥3 runs, not a one-shot number.
- The default threshold trio (0.7 / 0.7 / 0.6) was calibrated on English KBs. Multilingual corpora (Polish + English in the same index) score systematically 5-15 points lower — recalibrate per language, or split the golden dataset by language.
- Golden datasets drift: when the KB is reindexed or documents are renamed,
expected_sources may point at moved or deleted paths. A sudden drop in context_precision across unrelated queries usually means dataset rot, not RAG regression — validate the dataset paths first.
- Judges often reward verbose answers as "more faithful" because there is more text to ground. Tune the judge prompt to penalize padding, or cap answer length in the generator before evaluation.
When NOT to Use
- For auditing skill quality (the 5-criteria check) — that lives in
scripts/evaluate_skills.py
- For general-purpose LLM output scoring without a KB — use
/review or a tailored prompt
- For unit tests or code correctness — use
/test
- For continuous evaluation without a golden dataset — build the dataset first