Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

ai-improving-accuracy

Sterne6

Forks1

Aktualisiert13. Juni 2026 um 13:41

Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Also used for spent hours tweaking prompts, trial and error prompt engineering is not working, quality plateaued early, stale prompts everywhere in your codebase, my AI is only 60% accurate, how to measure AI quality, AI evaluation framework, benchmark my LLM, prompt optimization not working, systematic way to improve AI, AI accuracy plateaued, DSPy optimizer tutorial, MIPROv2 optimization, how to go from 70% to 90% accuracy.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

lebsral

lebsral/DSPy-Programming-not-prompting-LMs-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

4 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

ai-auditing-code

lebsral/DSPy-Programming-not-prompting-LMs-skills

Review DSPy code for correctness and best practices. Use when you want a code review of your DSPy program, need to check if your AI code follows best practices, want to find anti-patterns in your DSPy usage, or need a quality audit of your AI implementation. Also use for DSPy code review, is my DSPy code correct, review my AI code, best practices check, DSPy anti-patterns, code quality audit, am I using DSPy right, sanity check my AI code, peer review my DSPy program, does this follow DSPy conventions.

2026-06-136

ai-checking-outputs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Verify and validate AI output before it reaches users. Use when you need guardrails, output validation, safety checks, content filtering, fact-checking AI responses, catching hallucinations, preventing bad outputs, or quality gates. Also used for - AI output looks right but is wrong, how to validate JSON from LLM, LLM returns invalid data, catch bad AI outputs before users see them, output quality gate, AI guardrails for production, verify LLM did not hallucinate fields, post-processing LLM responses. Uses dspy.Refine (iterative with feedback) and dspy.BestOfN (sampling, pick best).

2026-06-136

ai-choosing-architecture

lebsral/DSPy-Programming-not-prompting-LMs-skills

Pick the right DSPy module and architecture for your AI feature. Use when you are not sure whether to use Predict, ChainOfThought, ReAct, or a pipeline, need to choose between DSPy patterns, want architecture advice for your AI feature, or are deciding between a single module and a multi-step pipeline. Also use for which DSPy module should I use, Predict vs ChainOfThought, when to use ReAct, single module vs pipeline, DSPy architecture decision, CoT vs PoT vs ReAct, do I need a pipeline, module selection guide, DSPy pattern selection, how to structure my DSPy program.

2026-06-136

ai-cleaning-data

lebsral/DSPy-Programming-not-prompting-LMs-skills

Normalize and fix messy data fields using AI. Use when normalizing addresses, standardizing company names, fixing inconsistent date formats, cleaning CSV data before import, correcting typos in bulk data, normalizing phone number formats, standardizing job titles, cleaning up free-text fields, data quality improvement with AI, fixing formatting inconsistencies, bulk data normalization, preparing messy data for analysis, AI-powered data wrangling.

2026-06-136

ai-cutting-costs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.

2026-06-136

ai-do

lebsral/DSPy-Programming-not-prompting-LMs-skills

Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.

2026-06-136

name

ai-improving-accuracy

description

Measure and Improve Your AI

Guide the user through measuring how well their AI works, then systematically improving it. This is a loop: define "good" -> measure -> improve -> verify.

The Workflow

Define what "good" means — write a metric
Measure current quality — run an evaluation
Improve — choose an optimizer, run it
Verify — re-evaluate to confirm improvement
Iterate or ship

Step 1: Understand the problem

Ask the user:

What does your AI get wrong? (wrong answers, wrong format, inconsistent, too slow?)
Do you have labeled examples? (how many? what format?)
How do you know when an answer is good? (exact match, partial credit, human judgment?)
Have you tried optimization before? (if yes, what and what happened?)

If the user does not have labeled data, point them to /ai-generating-data first.

Step 2: Define what "good" means (write a metric)

A metric takes an expected answer and the AI answer, and returns a score.

Exact match (simplest)

def metric(example, prediction, trace=None):
    return prediction.answer == example.answer

Normalized match (handles capitalization/whitespace)

def metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

Partial credit (for multi-field outputs)

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
    )
    return correct / len(fields)

F1 score (for text overlap)

def metric(example, prediction, trace=None):
    gold_tokens = set(example.answer.lower().split())
    pred_tokens = set(prediction.answer.lower().split())
    if not gold_tokens or not pred_tokens:
        return float(gold_tokens == pred_tokens)
    precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
    recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

AI-as-judge (for open-ended tasks)

When exact match is too strict (summaries, creative tasks, open-ended Q&A):

class AssessQuality(dspy.Signature):
    """Assess if the predicted answer is correct and complete."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Training-aware metric

The trace parameter is not None during optimization. Use it for stricter requirements during training:

def metric(example, prediction, trace=None):
    correct = prediction.answer == example.answer
    if trace is not None:
        # During optimization, also require good reasoning
        has_reasoning = len(prediction.reasoning) > 50
        return correct and has_reasoning
    return correct

Step 3: Measure current quality (run evaluation)

import dspy
from dspy.evaluate import Evaluate

devset = [
    dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"),
    # 50-200+ examples for reliable evaluation
]

evaluator = Evaluate(
    devset=devset,
    metric=metric,
    num_threads=4,
    display_progress=True,
    display_table=5,   # show 5 example results
)

baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")

Step 4: Improve (choose an optimizer)

Training examples	Recommended optimizer	Expected improvement
<100	GEPA (instruction tuning, feedback-driven)	10-25%
20-50	BootstrapFewShot	5-20%
50-200	BootstrapFewShot, then MIPROv2	15-35%
200-500	MIPROv2 (auto="medium")	20-40%
500+	MIPROv2 (auto="heavy") or BootstrapFinetune	25-50%

Stacking tip: Run BootstrapFewShot first, then MIPROv2 on the result. Bootstrap finds good examples, then MIPRO refines the instructions.

BootstrapFewShot (start here)

Fast, cheap. Finds good examples by running your program and keeping successful traces.

optimizer = dspy.BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,   # default is 16 — lower for small datasets
)
optimized = optimizer.compile(my_program, trainset=trainset)

MIPROv2 (recommended for most cases)

Optimizes both instructions and examples. Best general-purpose optimizer.

optimizer = dspy.MIPROv2(
    metric=metric,
    auto="medium",    # "light" (default), "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)

BootstrapFinetune (maximum quality)

Fine-tunes model weights. Requires 500+ examples and a fine-tunable model:

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)

For the full fine-tuning workflow, see /ai-fine-tuning.

When optimization plateaus

Symptom	Likely cause	Fix
Score stuck at 60-70%	Task too complex for single step	Break into subtasks — see `/ai-reasoning`
Optimizer overfits (train high, dev flat)	Too little training data	Generate more examples — see `/ai-generating-data`
Score varies wildly between runs	Non-deterministic metric or small devset	Increase devset to 100+, set temperature=0
Score high but users complain	Metric does not match real quality	Rewrite metric based on actual failure patterns
Score high on devset but poor on new data	Optimized prompts overfit to validation distribution	Hold out a separate test set. Research shows 10-15% degradation on unseen distributions (arxiv 2507.19457). Re-optimize if switching data domains
Optimizer returns same score as baseline	Task is saturated for this model	Try harder examples or a weaker task LM to create optimization signal. See `/dspy-gepa`

Optimized prompts are model-specific. If you change models, re-run your optimizer.

Step 5: Verify and ship

optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")

# Save
optimized.save("optimized_program.json")

# Load later
my_program = MyProgram()
my_program.load("optimized_program.json")

Gotchas

Claude writes metrics that return strings instead of floats or bools. DSPy metrics must return a numeric score (float 0.0-1.0) or a boolean. Returning a string like "correct" silently breaks evaluation — the score will be 0 for every example.
Claude forgets .with_inputs() on evaluation Examples. Every dspy.Example must call .with_inputs("field1", ...) to mark input fields. Without this, the evaluator passes all fields (including the expected output) to the program, inflating scores because the model sees the answer.
Claude uses the same data for training and evaluation. Always split into trainset and devset. Evaluating on training data gives misleadingly high scores — the optimizer may have memorized those exact examples.
AI-as-judge metrics are slow and expensive during optimization. Each training example triggers a separate LM call for the judge. For a 200-example trainset with MIPROv2 auto="medium", this can add thousands of extra LM calls. Use exact-match or F1 metrics during optimization, then validate with AI-as-judge on the final result.
display_table reveals metric bugs that the score hides. A 75% score looks reasonable, but display_table=10 might show the metric gives credit for completely wrong answers that happen to match on whitespace. Always inspect individual predictions before trusting aggregate scores.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

See optimization progress live -- see /ai-watching-optimization
Cost reduction once quality is good -- see /ai-cutting-costs
Production monitoring to track quality after deployment -- see /ai-monitoring
Experiment tracking to log and compare optimization runs -- see /ai-tracking-experiments
Generating data when you need more training examples -- see /ai-generating-data
Fixing errors when the AI crashes or throws exceptions -- see /ai-fixing-errors
Signatures for defining typed input/output contracts -- see /dspy-signatures
Optimizers for detailed API on MIPROv2 and BootstrapFewShot -- see /dspy-optimizers
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

For optimizer details and metric patterns, see reference.md