一键在 Manus 中运行任何 Skill

dspy-evaluate

星标6

分支1

更新时间2026年6月13日 13:41

Use when you need to measure how well your DSPy program performs — writing metrics, scoring against a dev set, or comparing before/after optimization. Common scenarios - measuring accuracy before and after optimization, writing custom metrics for your task, scoring a program against a held-out dev set, comparing two prompt strategies, building a test suite for AI quality, or running regression tests on AI outputs. Related - ai-improving-accuracy, ai-scoring, ai-monitoring. Also used for dspy.Evaluate, dspy.evaluate, write DSPy metric function, measure AI accuracy, evaluate DSPy program, dev set evaluation, before and after optimization comparison, custom scoring function, test AI quality systematically, AI regression testing, metric-driven development, how to know if my DSPy program improved, score predictions against labels, evaluation harness for LLM, CI/CD for AI quality.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

lebsral

lebsral/DSPy-Programming-not-prompting-LMs-skills

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

相关职业SOC

基于 SOC 职业分类

软件质量保证分析师与测试员计算机与数学类职业·SOC 15-1253

文件资源管理器

5 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

ai-auditing-code

lebsral/DSPy-Programming-not-prompting-LMs-skills

Review DSPy code for correctness and best practices. Use when you want a code review of your DSPy program, need to check if your AI code follows best practices, want to find anti-patterns in your DSPy usage, or need a quality audit of your AI implementation. Also use for DSPy code review, is my DSPy code correct, review my AI code, best practices check, DSPy anti-patterns, code quality audit, am I using DSPy right, sanity check my AI code, peer review my DSPy program, does this follow DSPy conventions.

2026-06-136

ai-checking-outputs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Verify and validate AI output before it reaches users. Use when you need guardrails, output validation, safety checks, content filtering, fact-checking AI responses, catching hallucinations, preventing bad outputs, or quality gates. Also used for - AI output looks right but is wrong, how to validate JSON from LLM, LLM returns invalid data, catch bad AI outputs before users see them, output quality gate, AI guardrails for production, verify LLM did not hallucinate fields, post-processing LLM responses. Uses dspy.Refine (iterative with feedback) and dspy.BestOfN (sampling, pick best).

2026-06-136

ai-choosing-architecture

lebsral/DSPy-Programming-not-prompting-LMs-skills

Pick the right DSPy module and architecture for your AI feature. Use when you are not sure whether to use Predict, ChainOfThought, ReAct, or a pipeline, need to choose between DSPy patterns, want architecture advice for your AI feature, or are deciding between a single module and a multi-step pipeline. Also use for which DSPy module should I use, Predict vs ChainOfThought, when to use ReAct, single module vs pipeline, DSPy architecture decision, CoT vs PoT vs ReAct, do I need a pipeline, module selection guide, DSPy pattern selection, how to structure my DSPy program.

2026-06-136

ai-cleaning-data

lebsral/DSPy-Programming-not-prompting-LMs-skills

Normalize and fix messy data fields using AI. Use when normalizing addresses, standardizing company names, fixing inconsistent date formats, cleaning CSV data before import, correcting typos in bulk data, normalizing phone number formats, standardizing job titles, cleaning up free-text fields, data quality improvement with AI, fixing formatting inconsistencies, bulk data normalization, preparing messy data for analysis, AI-powered data wrangling.

2026-06-136

ai-cutting-costs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.

2026-06-136

ai-do

lebsral/DSPy-Programming-not-prompting-LMs-skills

Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.

2026-06-136

name

dspy-evaluate

description

Evaluate Your DSPy Program

Guide the user through measuring AI quality with DSPy's Evaluate class. The pattern: pick a metric, prepare a devset, run the evaluator, interpret results, then feed the same metric into an optimizer.

What is dspy.Evaluate

dspy.Evaluate runs your program on every devset example, scores each with a metric, and reports the aggregate score. It handles threading and progress display. Returns a percentage (0-100).

Built-in metrics

DSPy provides answer_exact_match (normalized string equality) and answer_passage_match (substring check). Both expect an answer field on example and prediction.

SemanticF1

Measures token-level overlap between the predicted and expected answer using an F1 score. More forgiving than exact match — it gives partial credit for answers that are close but not identical:

from dspy.evaluate import SemanticF1

semantic_f1 = SemanticF1()

evaluator = Evaluate(devset=devset, metric=semantic_f1, num_threads=4)
score = evaluator(my_program)

SemanticF1 is a good default metric for open-ended QA tasks where exact match is too strict. It expects a question field on the example plus a response field on both the example and prediction (not answer). Constructor: SemanticF1(threshold=0.66, decompositional=False). The threshold controls the minimum score during optimization (when trace is set).

CompleteAndGrounded

Checks whether the predicted answer is both complete (covers all key claims in the gold answer) and grounded (doesn't hallucinate facts not in the gold answer):

from dspy.evaluate import CompleteAndGrounded

complete_and_grounded = CompleteAndGrounded()

evaluator = Evaluate(devset=devset, metric=complete_and_grounded, num_threads=4)
score = evaluator(my_program)

This is an LM-based metric — it uses the configured LM to judge completeness and groundedness. It expects response and context fields on the prediction, and question and response on the example. Constructor: CompleteAndGrounded(threshold=0.66). Useful for RAG tasks where you care about both recall and precision of facts.

Custom metrics

A metric is def metric(example, prediction, trace=None) returning bool, int, or float. The trace parameter is None during evaluation but set during optimization (use this to apply stricter requirements during training).

Multi-field scoring

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").strip().lower() == getattr(example, f, "").strip().lower()
    )
    return correct / len(fields)

LM-as-judge

For open-ended tasks (summaries, creative writing, complex QA), use an LM to judge quality. Define a signature for the judge, then call it inside your metric:

class AssessAnswer(dspy.Signature):
    """Assess if the predicted answer correctly addresses the question."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField(desc="The reference answer")
    predicted_answer: str = dspy.InputField(desc="The answer to evaluate")
    is_correct: bool = dspy.OutputField(desc="True if the prediction is correct and complete")

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Use a separate LM for the judge

To avoid the model grading its own work, use a different (often stronger) LM for the judge:

judge_lm = dspy.LM("openai/gpt-4o")  # or "anthropic/claude-sonnet-4-5-20250929", etc.

def llm_judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessAnswer)
    with dspy.context(lm=judge_lm):
        result = judge(
            question=example.question,
            gold_answer=example.answer,
            predicted_answer=prediction.answer,
        )
    return result.is_correct

Graded judge (float scores)

Return a float instead of a bool for partial credit:

class GradeAnswer(dspy.Signature):
    """Grade the predicted answer on a scale of 0 to 5."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    score: int = dspy.OutputField(desc="Score from 0 (completely wrong) to 5 (perfect)")
    justification: str = dspy.OutputField(desc="Why this score was given")

def graded_metric(example, prediction, trace=None):
    judge = dspy.ChainOfThought(GradeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.score / 5.0  # normalize to 0.0-1.0

Composite metrics

Combine multiple signals into a single score with weights:

def composite_metric(example, prediction, trace=None):
    # Correctness (primary signal)
    correct = float(prediction.answer.strip().lower() == example.answer.strip().lower())

    # Conciseness (prefer shorter answers)
    concise = float(len(prediction.answer.split()) < 50)

    # Has reasoning (check that the model explained its thinking)
    has_reasoning = float(len(getattr(prediction, "reasoning", "")) > 20)

    # Weighted combination
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

Mixing exact checks with LM judges

def hybrid_metric(example, prediction, trace=None):
    # Fast exact check
    if prediction.answer.strip().lower() == example.answer.strip().lower():
        return 1.0

    # Fall back to LM judge for partial credit
    judge = dspy.Predict(AssessAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return 0.5 if result.is_correct else 0.0

Debugging with per-example scores

Evaluate returns an EvaluationResult with .score (aggregate percentage) and .results (list of (example, prediction, score) tuples). Use .results to find failing examples and understand failure patterns.

Common patterns

Trace-aware metrics for optimization

The trace parameter is None during evaluation but set during optimization. Use this to apply stricter requirements during training:

def metric(example, prediction, trace=None):
    correct = prediction.answer.strip().lower() == example.answer.strip().lower()
    if trace is not None:
        # During optimization: also require good reasoning
        has_reasoning = len(getattr(prediction, "reasoning", "")) > 50
        return correct and has_reasoning
    # During evaluation: only check correctness
    return correct

This makes the optimizer filter for traces where the model both got the answer right and showed its work. The result is more robust few-shot demonstrations.

Before-and-after comparison

A common workflow for measuring the impact of optimization:

from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_table=5)

# Baseline
baseline = evaluator(my_program)

# Optimize
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(my_program, trainset=trainset)

# Compare
optimized_result = evaluator(optimized)
print(f"Baseline:  {baseline.score:.1f}%")
print(f"Optimized: {optimized_result.score:.1f}%")
print(f"Delta:     {optimized_result.score - baseline.score:+.1f}%")

Gotchas

Claude uses return_all_scores=True or return_outputs=True which no longer exist. Evaluate now returns an EvaluationResult object. Access .score for the aggregate percentage and .results for per-example (example, prediction, score) tuples. Do not pass return_all_scores or return_outputs.
Claude uses SemanticF1 with answer fields but it expects response. SemanticF1 looks for response on both the example and prediction, not answer. If your signature uses answer, either rename the field or write a wrapper metric.
Claude uses CompleteAndGrounded without providing context. CompleteAndGrounded expects response and context on the prediction, and question and response on the example. Without context, it cannot check groundedness.
Metrics must return a float or bool, not a string -- returning a string silently breaks scoring.
Small dev sets (<30 examples) give unreliable scores -- results can swing 10-20% between runs. Aim for 50+ examples for stable evaluation.

Additional resources

dspy.Evaluate API docs
dspy.SemanticF1 API docs
dspy.CompleteAndGrounded API docs
For constructor signatures and method reference, see reference.md
For worked examples (exact match, LM judge, composite), see examples.md

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Need to prepare training and evaluation data? Use /dspy-data
Ready to optimize with few-shot examples? Use /dspy-bootstrap-few-shot
Want the best prompt optimization? Use /dspy-miprov2
For the full measure-improve-verify loop, see /ai-improving-accuracy
For decomposed RAG evaluation (faithfulness, context precision/recall) see /dspy-ragas
For worked examples (exact match, LM judge, composite), see examples.md
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do