Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

dspy-gepa-optimizer

Name: Dspy Gepa Optimizer
Author: intertwine

// Optimize DSPy programs with dspy.GEPA — the reflective/evolutionary optimizer that is the 2026 gold standard for DSPy (beats MIPROv2 on complex tasks with far fewer rollouts when the metric returns rich feedback). Use when the user says optimize, compile, GEPA, reflective optimization, or "make this program better" and a DSPy program + metric + trainset exist.

In Manus ausführen

$ git log --oneline --stat

stars:245

forks:22

updated:25. Mai 2026 um 05:28

Datei-Explorer

4 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

dspy-advanced-workflow.md

from "intertwine/dspy-agent-skills"

Drive a complete DSPy 3.2.x project end-to-end — spec → program → metric → baseline → GEPA optimize → export → deploy. Orchestrates the other four DSPy skills (dspy-fundamentals, dspy-evaluation-harness, dspy-gepa-optimizer, dspy-rlm-module) in the correct order. Use this for any non-trivial DSPy build from scratch.

2026-05-25245

dspy-evaluation-harness.md

from "intertwine/dspy-agent-skills"

Build DSPy evaluation harnesses with rich-feedback metrics that are essential for GEPA optimization. Use when writing a metric function, calling dspy.Evaluate, splitting dev/val sets, debugging "why is my optimizer not improving?", or designing CI-ready DSPy eval suites.

2026-05-25245

dspy-fundamentals.md

from "intertwine/dspy-agent-skills"

Write idiomatic DSPy 3.2.x programs — typed Signatures, dspy.Module subclasses, Predict/ChainOfThought/ReAct/ProgramOfThought, and save/load. Use this when starting any new DSPy project or when fixing non-idiomatic DSPy code (hard-coded prompts, ad-hoc string templates, untyped outputs, non-serializable classes).

2026-05-25245

dspy-rlm-module.md

from "intertwine/dspy-agent-skills"

Use dspy.RLM (Recursive Language Model) for reasoning over contexts too large to fit in an LLM's working window — entire codebases, long logs, massive documents, or multi-step data exploration that needs a sandboxed Python REPL. Use when the input is >100k tokens, needs recursive chunking, or benefits from the LLM writing and running code to probe data.

2026-04-21245

package.json

"author": "intertwine"

"repository": "intertwine/dspy-agent-skills"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

name	dspy-gepa-optimizer
description	Optimize DSPy programs with dspy.GEPA — the reflective/evolutionary optimizer that is the 2026 gold standard for DSPy (beats MIPROv2 on complex tasks with far fewer rollouts when the metric returns rich feedback). Use when the user says optimize, compile, GEPA, reflective optimization, or "make this program better" and a DSPy program + metric + trainset exist.
when_to_use	User asks to optimize/compile/tune a DSPy program, mentions GEPA or reflective optimization, or has a working program with a non-trivial metric and wants to improve it.

DSPy GEPA Optimizer (3.2.x)

GEPA (Genetic-Pareto) is a reflective optimizer: it mutates a program's instructions and few-shots using an LM that reads your metric's textual feedback and proposes improvements. It maintains a Pareto frontier across validation tasks and is the default recommendation for complex DSPy workloads in 2026.

The expansion "Genetic-Evolutionary Prompt Adaptation" that appears in some AI-generated summaries is an LLM-hallucinated backronym. The paper defines GEPA as Genetic-Pareto; the "Pareto" is load-bearing (GEPA keeps a frontier of candidates rather than collapsing to one).

Prerequisites — do these first or GEPA wastes rollouts

A dspy.Module that runs end-to-end (see dspy-fundamentals).
A rich-feedback metric returning dspy.Prediction(score=float, feedback=str) (see dspy-evaluation-harness). A float-only metric makes GEPA no better than MIPRO. A dict with the same fields still crashes dspy.Evaluate under DSPy 3.2.1 — use dspy.Prediction.
trainset and a separate valset. For GEPA, maximize training examples and keep validation just large enough to represent the downstream distribution; do not reuse the same examples for both.
A reflection_lm — a strong LM (often the same or stronger than the task LM) set to temperature=1.0 for creative proposals. Current DSPy docs use a GPT-5-class reflection model with a large output budget.

Canonical call

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-5-mini"))
reflection_lm = dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000)

optimizer = dspy.GEPA(
    metric=rich_metric,
    auto="medium",                       # "light" / "medium" / "heavy"
    reflection_lm=reflection_lm,
    reflection_minibatch_size=3,
    candidate_selection_strategy="pareto",  # or "current_best"
    skip_perfect_score=True,
    use_merge=True,
    num_threads=8,
    track_stats=True,
    track_best_outputs=True,             # enables inference-time best-of selection
    log_dir="./gepa_logs",               # resume/checkpoint
    seed=0,
)

optimized = optimizer.compile(
    student=program,
    trainset=trainset,
    valset=valset,
)

# Pareto inspection
pareto = optimized.detailed_results.val_aggregate_scores
print("Pareto frontier:", sorted(pareto, reverse=True)[:5])

optimized.save("optimized_program.json", save_program=False)

Import paths

Either works; use the top-level in new code:

import dspy
dspy.GEPA(...)                              # preferred
# equivalently:
from dspy.teleprompt import GEPA

Metric contract (precise)

import dspy

def rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    score = ...      # 0.0..1.0
    feedback = ...   # detailed natural-language critique
    return dspy.Prediction(score=score, feedback=feedback)

Return dspy.Prediction, not a dict. Some upstream GEPA prose describes score/feedback as a dict-like shape, but dspy.Evaluate in DSPy 3.2.1 still crashes on a literal dict metric (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). GEPA uses dspy.Evaluate internally for candidate scoring, so a dict return can fail inside GEPA too, not just in your explicit Evaluate(...) calls.

pred_name / pred_trace are set during reflection on a specific predictor inside your module — write per-predictor feedback when possible (credit assignment). If you cannot localize feedback, return program-level feedback rather than a vague score-only critique.
Feedback quality is the load-bearing part: specifics about why it failed and what good looks like are what the reflection LM acts on.

Budget knobs

Use either auto=... or explicit budget — not both.

Mode	Rough rollouts	When to use
`auto="light"`	~20–40 full evals	Sanity-check GEPA works on your metric
`auto="medium"`	~80–150 full evals	Everyday optimization
`auto="heavy"`	~300–600 full evals	Final run before ship
`max_full_evals=N`	Explicit	Deterministic budget
`max_metric_calls=N`	Explicit	Hard cap on metric invocations (more predictable cost)

Each "full eval" ≈ len(valset) metric calls. Budget accordingly for cost.

Constructor parameters (every one, DSPy 3.2.x)

dspy.GEPA(
    metric,                                  # required
    auto=None,                               # Literal["light","medium","heavy"] | None
    max_full_evals=None,
    max_metric_calls=None,
    reflection_minibatch_size=3,
    candidate_selection_strategy="pareto",   # or "current_best"
    reflection_lm=None,                      # required in practice
    skip_perfect_score=True,
    add_format_failure_as_feedback=False,
    instruction_proposer=None,               # custom ProposalFn
    component_selector="round_robin",        # or a callable
    use_merge=True,
    max_merge_invocations=5,
    num_threads=None,
    failure_score=0.0,
    perfect_score=1.0,
    log_dir=None,
    track_stats=False,
    use_wandb=False,
    wandb_api_key=None,                      # overrides WANDB_API_KEY env var
    wandb_init_kwargs=None,                  # dict forwarded to wandb.init(...)
    track_best_outputs=False,
    warn_on_score_mismatch=True,
    use_mlflow=False,
    seed=0,
    gepa_kwargs=None,                        # e.g. {"use_cloudpickle": True} for dynamic signatures
)

.compile(student, *, trainset, valset=None, teacher=None) — teacher is not currently used.

Data split guidance

DSPy's general prompt-optimizer docs often recommend a validation-heavy split, such as 20% train / 80% validation, because small prompt optimizers can overfit tiny trainsets. GEPA is different: maximize the training set and reserve only enough validation examples to represent downstream behavior. The Pareto frontier still needs a real valset, but GEPA learns from traces and textual feedback on training examples, so starving trainset hurts.

BetterTogether in DSPy 3.2.x

If you want a multi-stage optimizer loop, DSPy 3.2.0's BetterTogether now accepts arbitrary named optimizers instead of the older fixed prompt_optimizer / weight_optimizer pair:

optimizer = dspy.BetterTogether(
    metric=rich_metric,
    bootstrap=dspy.BootstrapFewShotWithRandomSearch(metric=rich_metric),
    gepa=dspy.GEPA(metric=rich_metric, auto="light", reflection_lm=reflection_lm),
)

optimized = optimizer.compile(
    student=program,
    trainset=trainset,
    valset=valset,
    strategy="bootstrap -> gepa",
)

Pass strategy= explicitly when you use named stages like bootstrap=... and gepa=.... DSPy 3.2.0's default strategy is still "p -> w -> p", which only works if your optimizer keys are literally p and w.

Keep plain GEPA as the default first pass. Reach for BetterTogether only when you have a specific reason to chain optimizers and want the valset to pick the best intermediate program.

When GEPA > MIPROv2

Your metric can produce specific, teachable critiques (GEPA's superpower).
The program has multiple predictors that need targeted improvements (GEPA gives per-predictor feedback; MIPRO doesn't).
Rollout budget is small (GEPA converges faster with rich feedback).

When MIPROv2 > GEPA

Metric is scalar-only (no signal to reflect on) — use dspy.MIPROv2.
You want pure few-shot bootstrapping with no instruction mutation.
Very large trainset (500+) where Bayesian search over demos pays off.

When SIMBA is worth trying

dspy.SIMBA is a lighter reflective optimizer. Try it when you want a cheaper reflective pass than GEPA, your program is simple, or you need quick exploration before committing to a full GEPA run. Keep GEPA as the default for multi-predictor programs where per-predictor feedback and Pareto candidate selection matter.

Resume & checkpointing

log_dir writes candidate programs + scores per round. To resume an interrupted run, point log_dir at the same directory — GEPA picks up from the last checkpoint. Inspect <log_dir>/candidates/ to see every proposed program.

Inference-time best-of with `track_best_outputs`

With track_best_outputs=True, GEPA records, per task, the best prediction seen across all candidates. At inference time on held-out data, you can ensemble or select among the top-Pareto programs for robustness. Access via optimized.detailed_results.best_outputs_valset.

Anti-patterns

Float-only metric ("score is 0.7") with no feedback — GEPA collapses to random search.
Same set used for train and val — Pareto selection overfits.
reflection_lm = small model — it can't critique; use the strongest LM you can afford for this role.
Running auto="heavy" on an untested metric — burn money to learn the metric was bugged. Run auto="light" first.
Ignoring log_dir — losing a 4-hour run to a disconnect is very painful.

Gotcha: `reflection_lm` is required at construction, not compile

dspy.GEPA(...) asserts reflection_lm is not None (or a custom instruction_proposer) at init time — you cannot defer it to .compile(). If you see

AssertionError: GEPA requires a reflection language model...

add reflection_lm=dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000) to the constructor, or substitute the strongest instruction-following model available on your provider. dspy.LM(...) is a cheap stub until you actually call it, so constructing one doesn't hit the network.

Build the metric → dspy-evaluation-harness.
End-to-end pipeline → dspy-advanced-workflow.
Parameter reference → reference.md.
Runnable example → example_gepa.py.
BetterTogether chaining example → example_bettertogether.py.

dspy-gepa-optimizer

DSPy GEPA Optimizer (3.2.x)

Prerequisites — do these first or GEPA wastes rollouts

Canonical call

Import paths

Metric contract (precise)

Budget knobs

Constructor parameters (every one, DSPy 3.2.x)

Data split guidance

BetterTogether in DSPy 3.2.x

When GEPA > MIPROv2

When MIPROv2 > GEPA

When SIMBA is worth trying

Resume & checkpointing

Inference-time best-of with `track_best_outputs`

Anti-patterns

Gotcha: `reflection_lm` is required at construction, not compile

Next

DSPy GEPA Optimizer (3.2.x)

Prerequisites — do these first or GEPA wastes rollouts

Canonical call

Import paths

Metric contract (precise)

Budget knobs

Constructor parameters (every one, DSPy 3.2.x)

Data split guidance

BetterTogether in DSPy 3.2.x

When GEPA > MIPROv2

When MIPROv2 > GEPA

When SIMBA is worth trying

Resume & checkpointing

Inference-time best-of with `track_best_outputs`

Anti-patterns

Gotcha: `reflection_lm` is required at construction, not compile

Next

dspy-gepa-optimizer

Mehr aus diesem Repository

Mehr aus diesem Repository

DSPy GEPA Optimizer (3.2.x)

Prerequisites — do these first or GEPA wastes rollouts

Canonical call

Import paths

Metric contract (precise)

Budget knobs

Constructor parameters (every one, DSPy 3.2.x)

Data split guidance

BetterTogether in DSPy 3.2.x

When GEPA > MIPROv2

When MIPROv2 > GEPA

When SIMBA is worth trying

Resume & checkpointing

Inference-time best-of with track_best_outputs

Anti-patterns

Gotcha: reflection_lm is required at construction, not compile

Next

DSPy GEPA Optimizer (3.2.x)

Prerequisites — do these first or GEPA wastes rollouts

Canonical call

Import paths

Metric contract (precise)

Budget knobs

Constructor parameters (every one, DSPy 3.2.x)

Data split guidance

BetterTogether in DSPy 3.2.x

When GEPA > MIPROv2

When MIPROv2 > GEPA

When SIMBA is worth trying

Resume & checkpointing

Inference-time best-of with track_best_outputs

Anti-patterns

Gotcha: reflection_lm is required at construction, not compile

Next

Inference-time best-of with `track_best_outputs`

Gotcha: `reflection_lm` is required at construction, not compile

Inference-time best-of with `track_best_outputs`

Gotcha: `reflection_lm` is required at construction, not compile