一键在 Manus 中运行任何 Skill

dspy-ragas

星标6

分支1

更新时间2026年6月13日 13:41

Use Ragas to evaluate DSPy RAG pipelines with decomposed metrics. Use when you want to evaluate RAG quality, measure faithfulness, context precision, context recall, answer relevancy, or diagnose retriever vs generator issues. Also used for ragas, pip install ragas, ragas evaluate, RAG evaluation, faithfulness metric, context precision, context recall, answer relevancy, answer correctness, decomposed RAG metrics, ragas dspy, DSPyOptimizer ragas, ragas[dspy], EvaluationDataset, ragas vs dspy.Evaluate, which RAG metric, retriever vs generator quality.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

lebsral

lebsral/DSPy-Programming-not-prompting-LMs-skills

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

相关职业SOC

基于 SOC 职业分类

软件质量保证分析师与测试员计算机与数学类职业·SOC 15-1253

文件资源管理器

4 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

ai-building-chatbots

lebsral/DSPy-Programming-not-prompting-LMs-skills

Build a conversational AI assistant with memory and state. Use when you need a customer support chatbot, helpdesk bot, onboarding assistant, sales qualification bot, FAQ assistant, or any multi-turn conversational AI. Also used for chatbot remember previous messages, conversational AI keeps forgetting context, build a helpdesk bot that actually works, chatbot drops context after a few turns, Intercom bot alternative, Zendesk AI alternative, build WhatsApp bot, Slack bot with AI, chatbot escalation to human agent, LangChain chatbot but simpler, chatbot for SaaS onboarding flow.

2026-06-276

ai-building-pipelines

lebsral/DSPy-Programming-not-prompting-LMs-skills

Chain multiple AI steps into one reliable pipeline. Use when your AI task is too complex for one prompt, you need to break AI logic into stages, combine classification then generation, do multi-step reasoning, build a compound AI system, orchestrate multiple models, or wire AI components together. Also used for LangChain LCEL alternative, how to chain LLM calls together, one prompt is not enough, multi-step AI workflow, AI pipeline that actually works in production, prompt chaining keeps breaking, DAG of LLM calls, extract then classify then generate, compound AI system design, how to combine multiple AI steps without spaghetti code.

2026-06-276

ai-checking-outputs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Verify and validate AI output before it reaches users. Use when you need guardrails, output validation, safety checks, content filtering, fact-checking AI responses, catching hallucinations, preventing bad outputs, or quality gates. Also used for - AI output looks right but is wrong, how to validate JSON from LLM, LLM returns invalid data, catch bad AI outputs before users see them, output quality gate, AI guardrails for production, verify LLM did not hallucinate fields, post-processing LLM responses. Uses dspy.Refine (iterative with feedback) and dspy.BestOfN (sampling, pick best).

2026-06-276

ai-cleaning-data

lebsral/DSPy-Programming-not-prompting-LMs-skills

Normalize and fix messy data fields using AI. Use when normalizing addresses, standardizing company names, fixing inconsistent date formats, cleaning CSV data before import, correcting typos in bulk data, normalizing phone number formats, standardizing job titles, cleaning up free-text fields, data quality improvement with AI, fixing formatting inconsistencies, bulk data normalization, preparing messy data for analysis, AI-powered data wrangling.

2026-06-276

ai-coordinating-agents

lebsral/DSPy-Programming-not-prompting-LMs-skills

Build multiple AI agents that work together. Use when you need a supervisor agent that delegates to specialists, agent handoff, parallel research agents, support escalation (L1 to L2), content pipeline (writer + editor + fact-checker), or any multi-agent system. Also used for CrewAI alternative, AutoGen alternative, LangGraph multi-agent, agents that talk to each other, specialist agents with a supervisor, agents keep stepping on each other, build an AI team, route tasks to the right agent, when one agent is not enough, parallel agents for research.

2026-06-276

ai-cutting-costs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.

2026-06-276

name

dspy-ragas

description

Ragas — Decomposed RAG Evaluation for DSPy

Guide the user through evaluating DSPy RAG pipelines with Ragas, an evaluation framework that decomposes RAG quality into independent metrics for retriever and generator.

Step 1: Understand the evaluation need

Before setting up Ragas, clarify:

Do you have a RAG pipeline already? Ragas evaluates retriever + generator quality — you need a working pipeline first.
Do you have ground-truth answers? Some metrics (Faithfulness, AnswerRelevancy) are reference-free; others (ContextPrecision, ContextRecall) need reference answers.
What are you diagnosing? If you just need an accuracy score, use dspy.Evaluate. Ragas shines when you need to know whether the retriever or generator is the weak link.

What is Ragas

Ragas is an open-source evaluation framework (12.9k+ GitHub stars, Apache 2.0) purpose-built for RAG pipelines. Instead of a single accuracy score, it breaks evaluation into decomposed metrics:

Metric	What it measures	Needs ground truth?	Evaluates
Faithfulness	Is the answer grounded in retrieved context?	No	Generator
AnswerRelevancy	Does the answer address the question?	No	Generator
ContextPrecision	Are relevant docs ranked higher?	Yes (reference)	Retriever
ContextRecall	Did retrieval find all relevant info?	Yes (reference)	Retriever
AnswerCorrectness	Does the answer match the reference?	Yes (reference)	End-to-end

This decomposition tells you where your RAG pipeline fails — retriever or generator — so you know what to fix.

When to use Ragas vs dspy.Evaluate

Use case	Tool
Diagnose retriever vs generator issues	Ragas — decomposed metrics isolate the problem
Measure overall pipeline accuracy	`dspy.Evaluate` with SemanticF1 or exact match
Optimization objective (BootstrapFewShot, MIPROv2)	`dspy.Evaluate` — Ragas metrics are too slow for inner-loop optimization
Evaluate before and after optimization	Both — use `dspy.Evaluate` for the score that was optimized, Ragas for deeper analysis
Reference-free evaluation	Ragas Faithfulness + AnswerRelevancy — no ground truth needed

Best practice: Use dspy.Evaluate with a fast metric (SemanticF1) as your optimization objective, then use Ragas for post-optimization analysis to understand why your pipeline performs the way it does.

Setup

# Core Ragas (evaluation only)
pip install ragas

# With DSPy optimizer support (uses MIPROv2 internally)
pip install "ragas[dspy]"

Ragas requires an LLM for its metrics. By default it uses OpenAI (OPENAI_API_KEY), but you can configure any LLM via LangChain wrappers.

Evaluating a DSPy RAG pipeline with Ragas

Step 1: Collect predictions from your DSPy pipeline

Run your DSPy RAG pipeline on a set of questions and collect the inputs, retrieved contexts, and generated answers:

import dspy

# Your DSPy RAG pipeline
class RAG(dspy.Module):
    def __init__(self, retriever):
        self.retrieve = retriever
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return dspy.Prediction(
            answer=self.generate(context=context, question=question).answer,
            context=context,
        )

# Collect predictions
results = []
for example in devset:
    pred = rag(question=example.question)
    results.append({
        "user_input": example.question,
        "response": pred.answer,
        "retrieved_contexts": pred.context,
        "reference": example.answer,  # ground truth, if available
    })

Step 2: Build a Ragas EvaluationDataset

from ragas import EvaluationDataset, SingleTurnSample

samples = [
    SingleTurnSample(
        user_input=r["user_input"],
        response=r["response"],
        retrieved_contexts=r["retrieved_contexts"],
        reference=r.get("reference"),  # optional for some metrics
    )
    for r in results
]
dataset = EvaluationDataset(samples=samples)

Step 3: Run evaluation

from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
    AnswerCorrectness,
)

# Pick metrics based on what you have
# Without ground truth: Faithfulness + AnswerRelevancy
# With ground truth: add ContextPrecision, ContextRecall, AnswerCorrectness
result = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(),
        AnswerRelevancy(),
        ContextPrecision(),
        ContextRecall(),
        AnswerCorrectness(),
    ],
)

print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.92, 'context_precision': 0.75,
#  'context_recall': 0.68, 'answer_correctness': 0.81}

Step 4: Interpret results

Faithfulness low (< 0.8)?
  → Generator is hallucinating beyond retrieved context
  → Fix: add assertions, use GroundedRAG pattern (/ai-stopping-hallucinations)

ContextPrecision low (< 0.7)?
  → Retriever returns relevant docs but ranks them poorly
  → Fix: tune k, try hybrid search, re-rank (/dspy-qdrant)

ContextRecall low (< 0.7)?
  → Retriever misses relevant documents entirely
  → Fix: improve chunking, add more docs, try different embeddings (/ai-searching-docs)

AnswerRelevancy low (< 0.8)?
  → Generator answers don't address the question
  → Fix: improve signatures, optimize with MIPROv2 (/dspy-miprov2)

AnswerCorrectness low but Faithfulness high?
  → Generator is faithful to context but context is wrong
  → Focus on retriever improvements

Using a custom LLM with Ragas

By default Ragas uses OpenAI (OPENAI_API_KEY). Ragas v0.4+ supports multiple LLM backends via Instructor or LiteLLM adapters:

from ragas.llms import llm_factory
from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY
evaluator_llm = llm_factory("claude-sonnet-4-5-20250929", provider="anthropic", client=client)
# or use provider="openai" with an OpenAI() client, etc.

result = evaluate(
    dataset=dataset,
    metrics=[Faithfulness(), AnswerRelevancy()],
    llm=evaluator_llm,
)

Note: LangchainLLMWrapper was deprecated in Ragas v0.3.8. If you see old examples using it, switch to llm_factory() instead.

Per-sample scores

Get scores for each sample to find problem areas:

result = evaluate(dataset=dataset, metrics=[Faithfulness(), ContextRecall()])

# Convert to pandas DataFrame
df = result.to_pandas()
print(df[["user_input", "faithfulness", "context_recall"]])

# Find worst-performing samples
worst = df.nsmallest(5, "faithfulness")
for _, row in worst.iterrows():
    print(f"Q: {row['user_input']}")
    print(f"  Faithfulness: {row['faithfulness']:.2f}")

DSPyOptimizer (advanced)

Ragas includes a DSPyOptimizer that uses MIPROv2 internally to optimize Ragas's own metric prompts. This can improve evaluation accuracy for domain-specific data.

pip install "ragas[dspy]"

from ragas.metrics import Faithfulness
from ragas.integrations.dspy import DSPyOptimizer

# Optimize the Faithfulness metric's internal prompts
metric = Faithfulness()
optimizer = DSPyOptimizer(metric=metric)

# Requires a labeled dataset where you know the correct faithfulness scores
optimized_metric = optimizer.optimize(dataset=labeled_eval_dataset)

# Use the optimized metric for more accurate evaluation
result = evaluate(dataset=dataset, metrics=[optimized_metric])

This is advanced — only needed if Ragas's default metrics don't align well with your domain's definition of faithfulness, relevancy, etc.

Ragas in a DSPy development workflow

1. Build RAG pipeline          → /ai-searching-docs or /dspy-retrieval
2. Create devset               → /dspy-data
3. Evaluate with dspy.Evaluate → /dspy-evaluate (SemanticF1 as optimization target)
4. Optimize with MIPROv2       → /dspy-miprov2
5. Deep analysis with Ragas    → this skill (diagnose retriever vs generator)
6. Fix weak components         → /ai-stopping-hallucinations, /dspy-qdrant, /ai-improving-accuracy
7. Re-evaluate with both       → confirm improvements

Gotchas

Ragas metrics call an LLM — each metric makes multiple LLM calls per sample. A 100-sample evaluation with 5 metrics = ~500 LLM calls. Budget for the cost.
Don't use Ragas as an optimizer objective — it's too slow for inner-loop optimization. Use DSPy's built-in metrics for compile(), then Ragas for analysis.
ContextPrecision and ContextRecall need ground truth — if you don't have reference answers, use Faithfulness + AnswerRelevancy (reference-free).
Claude uses deprecated Ragas APIs from older tutorials. Ragas has had multiple breaking changes: v0.2 introduced EvaluationDataset/SingleTurnSample (replacing Dataset from datasets), v0.3.8 deprecated LangchainLLMWrapper, and v0.4 migrated metrics to a new BasePrompt architecture. If Claude generates code using Dataset, LangchainLLMWrapper, or ground_truths, it is using deprecated APIs. Always use EvaluationDataset, SingleTurnSample, and llm_factory().
Claude installs ragas without checking the version. Ragas v0.4+ has significant API changes from v0.2/v0.3. Pin the version in requirements (ragas>=0.4) to avoid mixing old and new APIs.

Additional resources

Ragas documentation
Ragas GitHub
For API details, see reference.md
For worked examples, see examples.md

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

DSPy's built-in evaluation (SemanticF1, exact match, LM-as-judge) — /dspy-evaluate
Building RAG pipelines — /ai-searching-docs
Retrieval modules and vector DBs — /dspy-retrieval, /dspy-qdrant
Stopping hallucinations (when Faithfulness is low) — /ai-stopping-hallucinations
Optimizing RAG accuracy — /ai-improving-accuracy, /dspy-miprov2
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do