| name | dspy-advanced-workflow |
| description | Drive a complete DSPy 3.2.x project end-to-end — spec → program → metric → baseline → GEPA optimize → export → deploy. Orchestrates the other four DSPy skills (dspy-fundamentals, dspy-evaluation-harness, dspy-gepa-optimizer, dspy-rlm-module) in the correct order. Use this for any non-trivial DSPy build from scratch. |
| when_to_use | User wants to build, optimize, and ship a new DSPy pipeline; says "full workflow" / "end to end" / "from scratch"; or needs the standard loop applied to a greenfield task. |
DSPy Advanced Workflow (2026)
This skill runs the seven-step loop that turns a natural-language task description into an optimized, saved, deployable DSPy program. Every step delegates to a specific skill — invoke them in order.
The seven steps
1. Spec
Rephrase the user's task in one sentence. Identify inputs, outputs, the quality axis that matters, and any constraints (latency, cost, tool access, context size). Pick predictor shape:
| Task shape | Predictor |
|---|
| Single-step structured I/O | dspy.Predict / dspy.ChainOfThought |
| Tool use / multi-step | dspy.ReAct |
| Code execution | dspy.ProgramOfThought |
| Long context / codebase | dspy.RLM → dspy-rlm-module |
2. Program
Write the typed dspy.Signature + dspy.Module subclass per dspy-fundamentals. No hard-coded prompts. Keep predictors named so GEPA can target them.
3. Data
Build trainset and separate valset as dspy.Example(...).with_inputs(...). For GEPA, maximize trainset size and keep validation just large enough to represent downstream behavior; held-out testset is reported on at the end only. See dspy-evaluation-harness.
4. Rich metric
Write rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None) returning dspy.Prediction(score=0..1, feedback="natural-language critique"). The feedback is load-bearing — it's what GEPA's reflection LM learns from. A dict with the same fields crashes dspy.Evaluate; only dspy.Prediction aggregates correctly. See dspy-evaluation-harness.
5. Baseline
evaluator = dspy.Evaluate(devset=valset, metric=rich_metric,
num_threads=8, display_progress=True,
provide_traceback=True,
save_as_json="runs/baseline.json")
baseline = evaluator(program)
print("Baseline:", baseline.score)
6. GEPA optimize
reflection_lm = dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000)
optimizer = dspy.GEPA(
metric=rich_metric,
auto="medium",
reflection_lm=reflection_lm,
candidate_selection_strategy="pareto",
track_stats=True,
track_best_outputs=True,
log_dir="./gepa_logs",
num_threads=8,
seed=0,
)
optimized = optimizer.compile(student=program, trainset=trainset, valset=valset)
print("Optimized:", evaluator(optimized).score)
Run auto="light" first as a sanity check; move to auto="medium"/"heavy" for the final run. See dspy-gepa-optimizer.
If you need a deliberate multi-stage compile loop, DSPy 3.2.x also exposes dspy.BetterTogether(metric=..., bootstrap=..., gepa=...) for chaining named optimizers after you have a clean baseline GEPA setup.
7. Export & deploy
optimized.save("artifacts/program.json", save_program=False)
optimized.save("artifacts/program_dir/", save_program=True)
Deploy:
- Load with
dspy.load("artifacts/program_dir/") or reconstruct + .load("program.json").
- Wrap in FastAPI/CLI.
- Enable
track_usage=True for cost/latency observability.
- Log with MLflow (
mlflow.dspy.autolog()) or W&B in CI.
- Keep an offline regression test that runs the
evaluator against the saved program and fails CI below a threshold.
Full orchestration template
"""DSPy end-to-end pipeline — spec → optimize → deploy."""
import dspy
from pathlib import Path
class MyTask(dspy.Signature):
"""<one-line instruction from the spec>."""
input_field: str = dspy.InputField()
output_field: str = dspy.OutputField()
class MyProgram(dspy.Module):
def __init__(self):
super().__init__()
self.step = dspy.ChainOfThought(MyTask)
def forward(self, **kw):
return self.step(**kw)
trainset = [...]
valset = [...]
def rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
score = ...
feedback = ...
return dspy.Prediction(score=score, feedback=feedback)
dspy.configure(lm=dspy.LM("openai/gpt-4o"), track_usage=True)
evaluator = dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
display_progress=True, provide_traceback=True,
save_as_json="runs/baseline.json")
program = MyProgram()
print("Baseline:", evaluator(program).score)
optimizer = dspy.GEPA(
metric=rich_metric,
auto="medium",
reflection_lm=dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000),
candidate_selection_strategy="pareto",
track_stats=True, track_best_outputs=True,
log_dir="./gepa_logs", num_threads=8, seed=0,
)
optimized = optimizer.compile(student=program, trainset=trainset, valset=valset)
print("Optimized:", evaluator(optimized).score)
Path("artifacts").mkdir(exist_ok=True)
optimized.save("artifacts/program.json", save_program=False)
Guardrails
- Never skip step 3 (rich metric). GEPA without feedback ≈ random search.
- Always baseline before optimizing — no baseline, no claim.
- Save both pre- and post-optimization metrics to JSON for auditability.
- If held-out test score drops post-optimization, your valset is too narrow. Expand valset and re-run.
- Freeze optimized program with
module._compiled = True before multi-stage re-compilation.