| name | unsloth-training |
| description | Fine-tune LLMs with Unsloth using GRPO or SFT. Supports FP8, vision models, mobile deployment, Docker, packing, GGUF export, dataset preparation, synthetic data, MLX (Apple Silicon). Use when: train with GRPO, fine-tune, reward functions, SFT training, FP8 training, vision fine-tuning, phone deployment, docker training, packing, export to GGUF, prepare dataset, synthetic data, install unsloth, environment flags, MLX training. |
Guide LLM fine-tuning using Unsloth:
- GRPO - RL with reward functions (no labeled outputs needed)
- SFT - Supervised fine-tuning with input/output pairs
- Vision - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)
Key capabilities:
- FP8 Training - 60% less VRAM, 1.4x faster (RTX 40+, H100)
- 3x Packing - Automatic 2-5x speedup for mixed-length data
- Docker - Official
unsloth/unsloth image
- Mobile - QAT → ExecuTorch → iOS/Android (~40 tok/s)
- Export - GGUF, Ollama, vLLM, LM Studio, SGLang
<quick_start>
GRPO with FP8 (60% less VRAM):
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B",
max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
model, r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
def correctness_reward(completions, answer, **kwargs):
return [2.0 if extract_answer(c) == a else 0.0
for c, a in zip(completions, answer)]
trainer = GRPOTrainer(
model=model,
args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()
SFT with Packing (2-5x faster):
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model, train_dataset=dataset, processing_class=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2, num_train_epochs=3,
learning_rate=2e-4, packing=True,
),
)
trainer.train()
</quick_start>
<success_criteria>
A training run is successful when:
- Model loads without OOM errors
- Reward (GRPO) or loss (SFT) shows improvement trend
- Generated outputs match expected format
- Model exported to desired format (LoRA, merged, GGUF)
- Test inference produces reasonable outputs
</success_criteria>
<activation_triggers>
Explicit triggers:
/unsloth grpo - GRPO (RL) training
/unsloth sft - SFT training
/unsloth fp8 - FP8 training setup
/unsloth vision - VLM fine-tuning
/unsloth mobile - Phone deployment (QAT)
/unsloth docker - Docker container setup
/unsloth troubleshoot - Debug issues
/unsloth install - Installation guide
/unsloth dataset - Dataset preparation
/unsloth mlx - Apple Silicon training
Natural language:
- "train with GRPO", "fine-tune", "reward functions"
- "FP8 training", "fp8", "less VRAM"
- "vision fine-tuning", "VLM", "image training"
- "phone deployment", "mobile LLM", "ExecuTorch"
- "docker training", "container", "unsloth docker"
- "packing", "faster training", "500k context"
- "export GGUF", "Ollama", "vLLM", "SGLang"
- "install unsloth", "pip install", "setup unsloth"
- "prepare dataset", "training data", "synthetic data", "ChatML", "ShareGPT"
- "environment flags", "UNSLOTH_RETURN_LOGITS"
- "MLX", "Apple Silicon", "Mac training", "unsloth-mlx"
</activation_triggers>
<file_locations>
Core references:
reference/reward-design.md - Reward function patterns
reference/domain-examples.md - Voice AI, Sales Agent examples
reference/hyperparameters.md - GRPOConfig reference
reference/troubleshooting.md - Common fixes
Setup and data references:
reference/installation.md - pip/uv install, CUDA versions, venv, Colab
reference/environment-flags.md - UNSLOTH_RETURN_LOGITS, COMPILE_DISABLE, etc.
reference/datasets-guide.md - Formats (ChatML/ShareGPT/Alpaca), chat templates, synthetic data
reference/mlx-training.md - Apple Silicon training with unsloth-mlx
Training feature references:
reference/fp8-training.md - FP8 setup, VRAM savings
reference/deployment.md - Docker, vLLM, LoRA hot-swap, SGLang
reference/export-formats.md - GGUF, Ollama, LM Studio, Dynamic 2.0
reference/advanced-training.md - 500K context, packing, checkpoints
reference/vision-training.md - VLM fine-tuning
reference/mobile-deployment.md - QAT, ExecuTorch, iOS/Android
Code examples: reference/grpo/, reference/sft/
</file_locations>
<core_concepts>
When to Use GRPO vs SFT
| Method | Use When | Data Needed |
|---|
| GRPO | Improving reasoning quality | Prompts + verifiable answers |
| GRPO | Aligning behavior with preferences | Reward functions |
| GRPO | When you can verify correctness | Verifiable outputs |
| SFT | Teaching specific output format | Input/output pairs |
| SFT | Following new instructions | Conversation examples |
| SFT | Learning domain knowledge | Labeled examples |
Model Selection
| Model | Size | VRAM | Use Case |
|---|
unsloth/Qwen2.5-0.5B-Instruct | 0.5B | 5GB | Mobile deployment (~200MB GGUF) |
unsloth/Qwen2.5-1.5B-Instruct | 1.5B | 5GB | Learning/prototyping |
Qwen/Qwen2.5-3B-Instruct | 3B | 8GB | Good balance (recommended start) |
unsloth/Qwen2.5-7B-Instruct | 7B | 16GB | Production quality |
unsloth/Phi-4 | 14B | 20GB | Strong reasoning |
Core Hyperparameters
GRPO (RL):
GRPOConfig(
num_generations=4,
beta=0.04,
learning_rate=5e-6,
max_completion_length=512,
max_steps=300,
)
SFT:
TrainingArguments(
learning_rate=2e-4,
num_train_epochs=3,
per_device_train_batch_size=2,
)
</core_concepts>
<reward_functions>
Reward Function Design
Reward functions are the core of GRPO. They return a list of floats for each completion.
Pattern 1: Correctness (Primary Signal)
def correctness_reward(completions, answer, **kwargs):
"""
+2.0 for correct answer, 0.0 otherwise.
This should be your highest-weighted reward.
"""
rewards = []
for completion, true_answer in zip(completions, answer):
extracted = extract_answer(completion)
try:
pred = float(extracted.replace(",", "").strip())
true = float(true_answer.replace(",", "").strip())
reward = 2.0 if abs(pred - true) < 0.01 else 0.0
except ValueError:
reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
rewards.append(reward)
return rewards
Pattern 2: Format Compliance
def format_reward(completions, **kwargs):
"""
+0.5 for proper XML structure with reasoning and answer tags.
"""
rewards = []
for completion in completions:
has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
if has_reasoning and has_answer:
rewards.append(0.5)
elif has_answer:
rewards.append(0.2)
else:
rewards.append(0.0)
return rewards
Pattern 3: Reasoning Quality
def reasoning_length_reward(completions, **kwargs):
"""
+0.3 for substantive reasoning (30-200 words).
"""
rewards = []
for completion in completions:
reasoning = extract_reasoning(completion)
word_count = len(reasoning.split()) if reasoning else 0
if 30 <= word_count <= 200:
rewards.append(0.3)
elif 15 <= word_count < 30:
rewards.append(0.1)
else:
rewards.append(0.0)
return rewards
Pattern 4: Negative Constraints
def no_hedging_reward(completions, **kwargs):
"""
-0.3 penalty for uncertainty language.
"""
hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
rewards = []
for completion in completions:
has_hedging = any(phrase in completion.lower() for phrase in hedging)
rewards.append(-0.3 if has_hedging else 0.0)
return rewards
Typical Reward Stack
reward_funcs = [
correctness_reward,
format_reward,
reasoning_length_reward,
no_hedging_reward,
]
For domain-specific rewards: See reference/domain-examples.md for Voice AI, Sales Agent, and Support patterns.
</reward_functions>
<prompt_format>
Prompt Structure
System Prompt with XML Tags
SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.
Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""
Extraction Helpers
import re
def extract_answer(text: str) -> str:
"""Extract answer from XML tags"""
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return match.group(1).strip() if match else ""
def extract_reasoning(text: str) -> str:
"""Extract reasoning from XML tags"""
match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
return match.group(1).strip() if match else ""
Dataset Format
GRPO (prompt-only):
dataset = dataset.map(lambda ex: {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["question"]}
],
"answer": ex["answer"]
})
SFT (full conversations):
dataset = dataset.map(lambda ex: {
"conversations": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["input"]},
{"role": "assistant", "content": ex["output"]}
]
})
</prompt_format>
<model_export>
Save and Deploy
Save LoRA Only (~100MB)
model.save_lora("grpo_lora")
Merge and Save Full Model
model.save_pretrained_merged(
"grpo_merged", tokenizer,
save_method="merged_16bit",
)
Export to GGUF for Ollama
model.save_pretrained_gguf(
"grpo_gguf", tokenizer,
quantization_method="q4_k_m",
)
Test with Ollama
cat > Modelfile << EOF
FROM ./grpo_gguf/unsloth.Q4_K_M.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
EOF
ollama create my-model -f Modelfile
ollama run my-model "Solve: 15 + 27 = ?"
</model_export>
## Request Routing
GRPO training: → GRPOConfig, reward functions, dataset prep
→ Reference: reference/grpo/basic_grpo.py
SFT training: → SFTTrainer, dataset formatting
→ Reference: reference/sft/sales_extractor_training.py
Reward function design: → 4 patterns (correctness, format, quality, constraints)
→ Reference: reference/reward-design.md, reference/domain-examples.md
FP8 training: → 60% VRAM savings, env vars, pre-quantized models
→ Reference: reference/fp8-training.md
Docker setup: → Official image, volumes, Jupyter/SSH
→ Reference: reference/deployment.md
Vision fine-tuning: → FastVisionModel, VLM data format
→ Reference: reference/vision-training.md
Mobile deployment: → QAT, ExecuTorch, iOS/Android
→ Reference: reference/mobile-deployment.md
Long context / packing: → 500K context, 2-5x speedup
→ Reference: reference/advanced-training.md
Export formats: → GGUF methods, Ollama, vLLM, SGLang
→ Reference: reference/export-formats.md
Installation: → pip, uv, Colab, CUDA versions
→ Reference: reference/installation.md
Dataset preparation: → Formats, chat templates, synthetic data, multi-column
→ Reference: reference/datasets-guide.md
Environment flags: → UNSLOTH_RETURN_LOGITS, COMPILE_DISABLE, etc.
→ Reference: reference/environment-flags.md
Apple Silicon / MLX: → unsloth-mlx, Mac training
→ Reference: reference/mlx-training.md
Training issues: → reference/troubleshooting.md
<troubleshooting_quick>
Quick Troubleshooting
| Symptom | Fix |
|---|
| Reward not increasing | Wait 300+ steps, then increase learning_rate 2x |
| Reward spiky/unstable | Decrease learning_rate 0.5x, increase beta |
| Model outputs garbage | Increase beta 2-4x, check prompt format |
| Out of memory | Reduce max_completion_length, num_generations=2 |
| No reasoning appearing | Train 500+ steps, use model >= 1.5B |
For detailed troubleshooting: See reference/troubleshooting.md
</troubleshooting_quick>
<training_checklist>
Pre-Training Checklist
GRPO:
SFT:
<cost_estimates>
Cost Estimates
Training Data Generation:
- 50 seeds x 10 variations x ~1000 tokens = ~500K tokens
- Claude Sonnet: ~$1.50
- GPT-4o-mini: ~$0.30
Training Compute:
- Google Colab T4: Free (30-60 min for 1000 examples)
- RunPod A40 (48GB): ~$0.79/hr
Total per experiment: ~$0.50-$1.00 on RunPod
</cost_estimates>
<example_session>
Example Session
User: "I want to train a model that gives concise voice responses"
Claude:
- First, let's set up the model:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B-Instruct",
max_seq_length=1024, load_in_4bit=True,
)
- For voice AI, you need brevity rewards. Here's a custom stack:
def brevity_reward(completions, **kwargs):
"""Voice responses under 50 words get +0.5"""
return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]
def speakable_reward(completions, **kwargs):
"""Penalize markdown that sounds bad spoken"""
bad = ["**", "- ", "```", "http"]
return [-0.2 * sum(1 for b in bad if b in c) for c in completions]
reward_funcs = [correctness_reward, brevity_reward, speakable_reward]
- See
reference/domain-examples.md for complete Voice AI reward patterns.
</example_session>