원클릭으로 Manus에서 모든 스킬 실행

ai-fine-tuning

스타6

포크1

업데이트2026년 6월 13일 13:41

Fine-tune models on your data to maximize quality and cut costs. Use when prompt optimization hit a ceiling, you need domain specialization, you want cheaper models to match expensive ones, you heard fine-tuning will make us AI-native, you have 500+ training examples, or you need to train on proprietary data. Also use when you have spent weeks of manual iteration with no systematic improvement path, or manual prompt tuning got you to a working system but quality plateaued. Covers DSPy BootstrapFinetune, BetterTogether, model distillation, and when to fine-tune vs optimize prompts, LoRA vs full fine-tune, when to fine-tune vs few-shot, distill GPT-4 into a smaller model, teacher-student model training, custom model training with DSPy, model distillation, make a cheap model as good as GPT-4.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

lebsral

lebsral/DSPy-Programming-not-prompting-LMs-skills

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

관련 직업SOC

SOC 직업 분류 기준

데이터 과학자컴퓨터 및 수학직·SOC 15-2051

파일 탐색기

5 개 파일

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

ai-auditing-code

lebsral/DSPy-Programming-not-prompting-LMs-skills

Review DSPy code for correctness and best practices. Use when you want a code review of your DSPy program, need to check if your AI code follows best practices, want to find anti-patterns in your DSPy usage, or need a quality audit of your AI implementation. Also use for DSPy code review, is my DSPy code correct, review my AI code, best practices check, DSPy anti-patterns, code quality audit, am I using DSPy right, sanity check my AI code, peer review my DSPy program, does this follow DSPy conventions.

2026-06-136

ai-checking-outputs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Verify and validate AI output before it reaches users. Use when you need guardrails, output validation, safety checks, content filtering, fact-checking AI responses, catching hallucinations, preventing bad outputs, or quality gates. Also used for - AI output looks right but is wrong, how to validate JSON from LLM, LLM returns invalid data, catch bad AI outputs before users see them, output quality gate, AI guardrails for production, verify LLM did not hallucinate fields, post-processing LLM responses. Uses dspy.Refine (iterative with feedback) and dspy.BestOfN (sampling, pick best).

2026-06-136

ai-choosing-architecture

lebsral/DSPy-Programming-not-prompting-LMs-skills

Pick the right DSPy module and architecture for your AI feature. Use when you are not sure whether to use Predict, ChainOfThought, ReAct, or a pipeline, need to choose between DSPy patterns, want architecture advice for your AI feature, or are deciding between a single module and a multi-step pipeline. Also use for which DSPy module should I use, Predict vs ChainOfThought, when to use ReAct, single module vs pipeline, DSPy architecture decision, CoT vs PoT vs ReAct, do I need a pipeline, module selection guide, DSPy pattern selection, how to structure my DSPy program.

2026-06-136

ai-cleaning-data

lebsral/DSPy-Programming-not-prompting-LMs-skills

Normalize and fix messy data fields using AI. Use when normalizing addresses, standardizing company names, fixing inconsistent date formats, cleaning CSV data before import, correcting typos in bulk data, normalizing phone number formats, standardizing job titles, cleaning up free-text fields, data quality improvement with AI, fixing formatting inconsistencies, bulk data normalization, preparing messy data for analysis, AI-powered data wrangling.

2026-06-136

ai-cutting-costs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.

2026-06-136

ai-do

lebsral/DSPy-Programming-not-prompting-LMs-skills

Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.

2026-06-136

name

ai-fine-tuning

description

Fine-Tune Models on Your Data

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.

Should you fine-tune?

Before writing any code, walk through these questions with the user:

Have you optimized prompts first? If not, use /ai-improving-accuracy — prompt optimization is 10x cheaper and often sufficient.
Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
What's the goal — quality or cost?
- Quality: You've maxed out prompt optimization and need more accuracy
- Cost: You want a small cheap model to match an expensive one

When to fine-tune

You've already optimized prompts with MIPROv2 and hit a ceiling
You have 500+ labeled examples (1000+ is better)
Your baseline is >50% and you need to push higher
You want to distill an expensive model into a cheaper one (10-50x cost savings)
Your domain has specialized vocabulary or patterns the base model doesn't know
You need faster inference (smaller fine-tuned models are faster)

When NOT to fine-tune

You haven't tried prompt optimization yet — start with /ai-improving-accuracy
You have fewer than 500 examples — need more data? Use /ai-generating-data to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
Your baseline is below 50% — your data or task definition needs work
You're still iterating on what the task is — fine-tuning locks you in
You don't have a clear metric — you can't evaluate fine-tuning without one
Your use case changes frequently — fine-tuned models don't adapt to new instructions easily

Prerequisites checklist

Before starting, confirm:

Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
Baseline: Prompt-optimized program with measured accuracy (use /ai-improving-accuracy)
Metric: Clear, automated metric that scores predictions
Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU

Step 1: Prepare your data and baseline

Build a strong baseline first

Always compare fine-tuning against a prompt-optimized baseline:

import dspy

lm = dspy.LM("openai/gpt-4o")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

# Define your program
class Classify(dspy.Signature):
    """Classify the support ticket."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# Prepare data
import json
with open("labeled_data.json") as f:
    data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

# Split: 80% train, 10% dev, 10% test
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]

# Measure baseline
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score:.1f}%")

Optimize prompts first (your comparison point)

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")

If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.

Step 2: BootstrapFinetune (core fine-tuning)

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# Evaluate the fine-tuned model
finetuned_score = evaluator(finetuned)
print(f"Baseline:         {baseline_score:.1f}%")
print(f"Prompt-optimized: {prompt_score:.1f}%")
print(f"Fine-tuned:       {finetuned_score:.1f}%")

How it works

Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
Filter by metric: Only successful traces become training data
Fine-tune weights: Sends traces to the model provider's fine-tuning API
Return optimized program: The program now uses the fine-tuned model

Requirements

A fine-tunable model (OpenAI gpt-4o-mini, gpt-4o; or local open-source models)
500+ training examples (more traces bootstrapped = better fine-tuning)
A metric that reliably identifies good outputs

Step 3: Model distillation (expensive to cheap)

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.

Teacher-student pattern

# Step 1: Teacher — expensive model, high quality
teacher_lm = dspy.LM("openai/gpt-4o")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=teacher_lm)

# Build and optimize the teacher
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized)
print(f"Teacher (GPT-4o): {teacher_score:.1f}%")

# Step 2: Student — fine-tune cheap model on teacher's outputs
student_lm = dspy.LM("openai/gpt-4o-mini")  # or another fine-tunable model
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)

student_score = evaluator(student_finetuned)
print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")

Typical results

Model	Quality	Cost per 1M tokens
GPT-4o (teacher)	85%	~$5.00
GPT-4o-mini (no tuning)	70%	~$0.15
GPT-4o-mini (fine-tuned)	81%	~$0.15

The fine-tuned student costs 33x less and retains ~95% of teacher quality.

Small models can dramatically outperform frontier models on narrow tasks. In a Yale project parsing 3.6M historical names, GPT-4 and Gemini achieved ~70% accuracy. Fine-tuned Qwen models (0.8B-4B parameters) hit 94-96% — beating frontier models by 25+ points while running locally. The key insight: for well-defined extraction tasks with enough training data (500K+ synthetic examples), tiny fine-tuned models dominate.

Step 4: BetterTogether (maximum quality)

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.

optimizer = dspy.BetterTogether(
    metric=metric,
    p=dspy.MIPROv2(metric=metric),
    w=dspy.BootstrapFinetune(metric=metric),
)
best = optimizer.compile(program, trainset=trainset, strategy="p -> w -> p")

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

How it works

The strategy string "p -> w -> p" controls the sequence — p maps to MIPROv2 (prompt optimizer) and w maps to BootstrapFinetune (weight optimizer):

Round 1 (p): Optimize prompts (instructions + few-shot examples)
Round 2 (w): Fine-tune weights using the optimized prompts
Round 3 (p): Re-optimize prompts for the fine-tuned model
Each round builds on the previous, creating synergy between prompt and weight optimization

If you omit the optimizer kwargs, BetterTogether defaults to p=BootstrapFewShotWithRandomSearch and w=BootstrapFinetune.

When to use BetterTogether

You want the absolute best quality and have the compute budget
Fine-tuning alone didn't close the gap to your quality target
You have 500+ examples and a reliable metric

Step 5: Evaluate and deploy

Thorough evaluation

Always evaluate on the held-out test set (not dev set):

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")

Save and load for production

# Save
finetuned.save("finetuned_program.json")

# Load later
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="New support ticket...")

When fine-tuning goes wrong

Can't bootstrap enough traces

If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.

Fixes:

Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
Relax your metric during bootstrapping (accept partial credit)
Simplify your task (break multi-step into single steps)

Output format errors from small models

Small fine-tuned models (<4B params) often produce JSON syntax errors — unclosed braces, missing quotes, trailing commas. Switch to YAML output format during fine-tuning to eliminate these entirely. YAML is more forgiving to generate and parses reliably from small models.

Model overfits (high train accuracy, low test accuracy)

Fixes:

Add more training data
Reduce fine-tuning epochs (if provider allows)
Use a larger base model (less prone to overfitting)
Simplify your output format

Fine-tuning didn't improve over prompt optimization

Fixes:

Check that bootstrapping produced enough successful traces (need 200+)
Try BetterTogether instead of BootstrapFinetune alone
Verify your metric actually correlates with quality
Try a different base model

Infrastructure choices

OpenAI API (easiest)

Works with gpt-4o-mini and gpt-4o. DSPy handles the fine-tuning API calls automatically:

lm = dspy.LM("openai/gpt-4o-mini")  # or any fine-tunable model via API

Pros: No GPU needed, simple setup, fast
Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices

Local fine-tuning (own your model)

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

Pros: Data stays private, no per-token costs after training, full control
Cons: Needs GPU(s), more setup, slower iteration

Cloud GPU platforms

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:

Pros: Scalable, no hardware to manage
Cons: Costs vary, setup per platform

Gotchas

Skipping prompt optimization and jumping straight to fine-tuning. Claude defaults to recommending fine-tuning when users mention quality issues. Always confirm the user has run MIPROv2 or similar prompt optimization first — fine-tuning without a prompt-optimized baseline wastes compute and makes it impossible to measure whether fine-tuning actually helped.
Using the dev set for final evaluation. Claude often evaluates the fine-tuned model on the same dev set used during optimization. Always evaluate on a held-out test set that was never seen during training or prompt optimization. Report both dev and test scores so the user can spot overfitting.
Passing teacher= without an optimized teacher program. When using BootstrapFinetune for distillation, Claude sometimes passes the unoptimized base program as the teacher. The teacher must be the prompt-optimized version — otherwise the student learns from mediocre traces and fine-tuning underperforms.
Forgetting that BootstrapFinetune needs a fine-tunable model. Not all models support fine-tuning via API. Claude sometimes configures dspy.LM("anthropic/claude-sonnet-4-5-20250929") for BootstrapFinetune, but Anthropic does not offer a fine-tuning API. Use OpenAI models or local open-source models for weight optimization.
Not checking how many traces were bootstrapped. If bootstrapping only produces 50 successful traces from 1000 examples, the fine-tuning data is too small. Check the bootstrap log output and aim for 200+ successful traces. If too few succeed, use a stronger teacher model or relax the metric.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Build a strong baseline before fine-tuning — see /ai-improving-accuracy
BootstrapFinetune API details — see /dspy-bootstrap-finetune
BetterTogether optimizer — see /dspy-better-together
Cost reduction beyond distillation — see /ai-cutting-costs
Generate synthetic training data — see /ai-generating-data
Fix fine-tuning or evaluation errors — see /ai-fixing-errors
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

For worked examples (classification, distillation, BetterTogether), see examples.md
For BootstrapFinetune, BetterTogether, and MIPROv2 API details, see reference.md