Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

$pwd:

dspy-evaluation-harness

Name: Dspy Evaluation Harness
Author: intertwine

// Build DSPy evaluation harnesses with rich-feedback metrics that are essential for GEPA optimization. Use when writing a metric function, calling dspy.Evaluate, splitting dev/val sets, debugging "why is my optimizer not improving?", or designing CI-ready DSPy eval suites.

Executar no Manus

$ git log --oneline --stat

stars:245

forks:22

updated:25 de maio de 2026 às 05:28

Explorador de arquivos

3 arquivos

SKILL.md

readonly

name	dspy-evaluation-harness
description	Build DSPy evaluation harnesses with rich-feedback metrics that are essential for GEPA optimization. Use when writing a metric function, calling dspy.Evaluate, splitting dev/val sets, debugging "why is my optimizer not improving?", or designing CI-ready DSPy eval suites.
when_to_use	User mentions `dspy.Evaluate`, a "metric", a devset/valset/trainset, evaluation, scoring, or asks why their GEPA optimization isn't converging (almost always: their metric is too thin).

DSPy Evaluation Harness (3.2.x)

The metric is usually more important than the program. For dspy.GEPA especially, the quality of textual feedback in your metric determines whether optimization converges.

Two rules

Return a dspy.Prediction(score=..., feedback=...), not a dict. dspy.Evaluate's parallel executor aggregates scores via sum, which breaks on dict outputs (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). dspy.Prediction supports __float__/__add__ and is what GEPA's adapter natively unwraps. A bare float still works for pure dspy.Evaluate scoring, but GEPA needs the score+feedback pair.
Separate valset. Never optimize and evaluate on the same examples. Optimizers overfit fast.

Canonical rich-feedback metric

import dspy

def rich_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None,
                pred_name: str | None = None, pred_trace=None):
    # 1. Compute sub-scores — multi-axis beats scalar
    correctness = 1.0 if _normalize(pred.answer) == _normalize(gold.answer) else 0.0
    cited = _has_citation(pred.answer, gold.sources) if hasattr(gold, "sources") else 1.0
    concise = 1.0 if len(pred.answer.split()) <= 50 else 0.5
    score = 0.6 * correctness + 0.25 * cited + 0.15 * concise

    # 2. Write feedback that teaches the optimizer
    parts = []
    if correctness < 1.0:
        parts.append(
            f"Answer mismatch. Predicted: {pred.answer!r}. Expected: {gold.answer!r}. "
            f"Likely cause: reasoning skipped the units/quantity in the question."
        )
    if cited < 1.0:
        parts.append("Did not ground the claim in the provided sources. Quote a source fragment.")
    if concise < 1.0:
        parts.append("Answer exceeded 50 words — tighten to one sentence.")
    if not parts:
        parts.append("Correct, grounded, and concise.")
    feedback = " ".join(parts)

    return dspy.Prediction(score=score, feedback=feedback)

Canonical harness

evaluator = dspy.Evaluate(
    devset=valset,
    metric=rich_metric,
    num_threads=8,
    display_progress=True,
    display_table=10,          # pretty-print first 10 rows
    provide_traceback=True,    # surface exceptions, don't swallow them
    max_errors=5,
    failure_score=0.0,
    save_as_json="eval_runs/baseline.json",
)
result = evaluator(program)
print("Overall:", result.score)
for example_result in result.results[:3]:
    print(example_result)

dspy.Evaluate returns an EvaluationResult with .score (aggregate float) and .results (list of (example, pred, score) tuples).

Dataset hygiene

Size: 20–50 examples is enough for GEPA's reflective loop; 100–500 for MIPROv2-style bootstrapping.
Split: hand-curate two disjoint sets — trainset (for optimization) and valset (for metric-on-optimized-program). A test set you never look at during development is gold.
Representativeness beats size. Include edge cases, ambiguity, adversarial inputs.
Build dspy.Example(...).with_inputs("question", "context") — the with_inputs call marks which fields are inputs vs. gold outputs.

trainset = [
    dspy.Example(question="…", answer="…").with_inputs("question"),
    ...
]

Multi-axis metrics (recommended)

Combine correctness, faithfulness, format adherence, latency, and cost. Each axis should be a 0–1 float with a written definition. Weight them explicitly; don't hide weights inside magic numbers — make them constants so optimizers can be told to trade off.

CI-ready eval suite

# tests/test_dspy_eval.py
import dspy, pytest
from my_program import program, valset, rich_metric

@pytest.fixture(scope="module")
def evaluator():
    return dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
                         display_progress=False, provide_traceback=True)

def test_program_meets_threshold(evaluator):
    result = evaluator(program)
    assert result.score >= 0.75, f"Regression: {result.score:.3f}"

Run offline in CI with a cached LM (dspy.LM(..., cache=True)) + pre-populated DSPY_CACHEDIR.

Tracing & observability

track_usage=True on dspy.configure accumulates token counts on predictions (pred.get_lm_usage()).
MLflow: import mlflow; mlflow.dspy.autolog() → traces every prediction.
W&B: pass use_wandb=True to dspy.GEPA to log Pareto fronts.
Always log eval results to a versioned JSON file (save_as_json=...) so you can diff runs.

Anti-patterns

Scalar-only metrics (float but no feedback) when using GEPA — wasted signal.
return {"score": s, "feedback": f} (dict) — crashes dspy.Evaluate's parallel aggregator. Use dspy.Prediction(score=s, feedback=f).
Exact-match metrics on open-ended generation tasks — use semantic or LM-as-judge scoring.
Evaluating on the trainset — optimistic by 10–30 points.
Silently swallowing exceptions (provide_traceback=False) — you'll blame the LM for a KeyError.
Changing the metric mid-experiment without re-baselining — prior numbers become incomparable.

Feed this metric into dspy-gepa-optimizer.
Full harness pattern → reference.md.
Runnable example → example_metric.py.

related-skills.json

mesmo repositório

dspy-advanced-workflow.md

from "intertwine/dspy-agent-skills"

Drive a complete DSPy 3.2.x project end-to-end — spec → program → metric → baseline → GEPA optimize → export → deploy. Orchestrates the other four DSPy skills (dspy-fundamentals, dspy-evaluation-harness, dspy-gepa-optimizer, dspy-rlm-module) in the correct order. Use this for any non-trivial DSPy build from scratch.

2026-05-25245

dspy-fundamentals.md

from "intertwine/dspy-agent-skills"

Write idiomatic DSPy 3.2.x programs — typed Signatures, dspy.Module subclasses, Predict/ChainOfThought/ReAct/ProgramOfThought, and save/load. Use this when starting any new DSPy project or when fixing non-idiomatic DSPy code (hard-coded prompts, ad-hoc string templates, untyped outputs, non-serializable classes).

2026-05-25245

dspy-gepa-optimizer.md

from "intertwine/dspy-agent-skills"

Optimize DSPy programs with dspy.GEPA — the reflective/evolutionary optimizer that is the 2026 gold standard for DSPy (beats MIPROv2 on complex tasks with far fewer rollouts when the metric returns rich feedback). Use when the user says optimize, compile, GEPA, reflective optimization, or "make this program better" and a DSPy program + metric + trainset exist.

2026-05-25245

dspy-rlm-module.md

from "intertwine/dspy-agent-skills"

Use dspy.RLM (Recursive Language Model) for reasoning over contexts too large to fit in an LLM's working window — entire codebases, long logs, massive documents, or multi-step data exploration that needs a sandboxed Python REPL. Use when the input is >100k tokens, needs recursive chunking, or benefits from the LLM writing and running code to probe data.

2026-04-21245

package.json

"author": "intertwine"

"repository": "intertwine/dspy-agent-skills"

Abrir repositório GitHub Ver repositórios do creator

$ install --global

$ download --local

Executar no Manus

$ useful --forSOC

Analistas de garantia de qualidade de software e testadoresInformática e Matemática15-1253L4

name	dspy-evaluation-harness
description	Build DSPy evaluation harnesses with rich-feedback metrics that are essential for GEPA optimization. Use when writing a metric function, calling dspy.Evaluate, splitting dev/val sets, debugging "why is my optimizer not improving?", or designing CI-ready DSPy eval suites.
when_to_use	User mentions `dspy.Evaluate`, a "metric", a devset/valset/trainset, evaluation, scoring, or asks why their GEPA optimization isn't converging (almost always: their metric is too thin).

DSPy Evaluation Harness (3.2.x)

The metric is usually more important than the program. For dspy.GEPA especially, the quality of textual feedback in your metric determines whether optimization converges.

Two rules

Return a dspy.Prediction(score=..., feedback=...), not a dict. dspy.Evaluate's parallel executor aggregates scores via sum, which breaks on dict outputs (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). dspy.Prediction supports __float__/__add__ and is what GEPA's adapter natively unwraps. A bare float still works for pure dspy.Evaluate scoring, but GEPA needs the score+feedback pair.
Separate valset. Never optimize and evaluate on the same examples. Optimizers overfit fast.

Canonical rich-feedback metric

import dspy

def rich_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None,
                pred_name: str | None = None, pred_trace=None):
    # 1. Compute sub-scores — multi-axis beats scalar
    correctness = 1.0 if _normalize(pred.answer) == _normalize(gold.answer) else 0.0
    cited = _has_citation(pred.answer, gold.sources) if hasattr(gold, "sources") else 1.0
    concise = 1.0 if len(pred.answer.split()) <= 50 else 0.5
    score = 0.6 * correctness + 0.25 * cited + 0.15 * concise

    # 2. Write feedback that teaches the optimizer
    parts = []
    if correctness < 1.0:
        parts.append(
            f"Answer mismatch. Predicted: {pred.answer!r}. Expected: {gold.answer!r}. "
            f"Likely cause: reasoning skipped the units/quantity in the question."
        )
    if cited < 1.0:
        parts.append("Did not ground the claim in the provided sources. Quote a source fragment.")
    if concise < 1.0:
        parts.append("Answer exceeded 50 words — tighten to one sentence.")
    if not parts:
        parts.append("Correct, grounded, and concise.")
    feedback = " ".join(parts)

    return dspy.Prediction(score=score, feedback=feedback)

Canonical harness

evaluator = dspy.Evaluate(
    devset=valset,
    metric=rich_metric,
    num_threads=8,
    display_progress=True,
    display_table=10,          # pretty-print first 10 rows
    provide_traceback=True,    # surface exceptions, don't swallow them
    max_errors=5,
    failure_score=0.0,
    save_as_json="eval_runs/baseline.json",
)
result = evaluator(program)
print("Overall:", result.score)
for example_result in result.results[:3]:
    print(example_result)

dspy.Evaluate returns an EvaluationResult with .score (aggregate float) and .results (list of (example, pred, score) tuples).

Dataset hygiene

Size: 20–50 examples is enough for GEPA's reflective loop; 100–500 for MIPROv2-style bootstrapping.
Split: hand-curate two disjoint sets — trainset (for optimization) and valset (for metric-on-optimized-program). A test set you never look at during development is gold.
Representativeness beats size. Include edge cases, ambiguity, adversarial inputs.
Build dspy.Example(...).with_inputs("question", "context") — the with_inputs call marks which fields are inputs vs. gold outputs.

trainset = [
    dspy.Example(question="…", answer="…").with_inputs("question"),
    ...
]

Multi-axis metrics (recommended)

CI-ready eval suite

# tests/test_dspy_eval.py
import dspy, pytest
from my_program import program, valset, rich_metric

@pytest.fixture(scope="module")
def evaluator():
    return dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
                         display_progress=False, provide_traceback=True)

def test_program_meets_threshold(evaluator):
    result = evaluator(program)
    assert result.score >= 0.75, f"Regression: {result.score:.3f}"

Run offline in CI with a cached LM (dspy.LM(..., cache=True)) + pre-populated DSPY_CACHEDIR.

Tracing & observability

track_usage=True on dspy.configure accumulates token counts on predictions (pred.get_lm_usage()).
MLflow: import mlflow; mlflow.dspy.autolog() → traces every prediction.
W&B: pass use_wandb=True to dspy.GEPA to log Pareto fronts.
Always log eval results to a versioned JSON file (save_as_json=...) so you can diff runs.

Anti-patterns

Scalar-only metrics (float but no feedback) when using GEPA — wasted signal.
return {"score": s, "feedback": f} (dict) — crashes dspy.Evaluate's parallel aggregator. Use dspy.Prediction(score=s, feedback=f).
Exact-match metrics on open-ended generation tasks — use semantic or LM-as-judge scoring.
Evaluating on the trainset — optimistic by 10–30 points.
Silently swallowing exceptions (provide_traceback=False) — you'll blame the LM for a KeyError.
Changing the metric mid-experiment without re-baselining — prior numbers become incomparable.

Feed this metric into dspy-gepa-optimizer.
Full harness pattern → reference.md.
Runnable example → example_metric.py.

dspy-evaluation-harness

DSPy Evaluation Harness (3.2.x)

Two rules

Canonical rich-feedback metric

Canonical harness

Dataset hygiene

Multi-axis metrics (recommended)

CI-ready eval suite

Tracing & observability

Anti-patterns

Next

DSPy Evaluation Harness (3.2.x)

Two rules

Canonical rich-feedback metric

Canonical harness

Dataset hygiene

Multi-axis metrics (recommended)

CI-ready eval suite

Tracing & observability

Anti-patterns

Next

dspy-evaluation-harness

DSPy Evaluation Harness (3.2.x)

Two rules

Canonical rich-feedback metric

Canonical harness

Dataset hygiene

Multi-axis metrics (recommended)

CI-ready eval suite

Tracing & observability

Anti-patterns

Next

Mais deste repositório

Mais deste repositório

DSPy Evaluation Harness (3.2.x)

Two rules

Canonical rich-feedback metric

Canonical harness

Dataset hygiene

Multi-axis metrics (recommended)

CI-ready eval suite

Tracing & observability

Anti-patterns

Next