ワンクリックで
grpo-finetuning
// Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples).
// Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples).
Use Bedrock tool_use to guarantee structured JSON outputs from Claude models. Eliminates JSON parsing failures by forcing responses through typed tool schemas.
Implement SageMaker async inference with S3-based I/O and polling. Use for long-running inference (>60s), large payloads, or batch processing workloads.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, blogs, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
| name | grpo-finetuning |
| description | Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples). |
This skill guides implementation of GRPO fine-tuning for vision-language models, particularly effective when training data is limited. GRPO uses reinforcement learning with custom reward functions instead of traditional supervised fine-tuning.
| Scenario | Recommended Approach |
|---|---|
| Large dataset (>5000 examples) | SFT (Supervised Fine-Tuning) |
| Small dataset (<1000 examples) | GRPO |
| Clear correctness criteria (JSON, format) | GRPO |
| Subjective quality (style, tone) | SFT |
| Need diversity in outputs | GRPO |
GRPO generates multiple completions per prompt, scores them with reward functions, and optimizes the policy to favor higher-reward outputs:
Prompt → Generate N completions → Score each with rewards → Policy gradient update
Create reward functions that return scores between 0 and 1 (or custom ranges):
import json
def formatting_reward_func(completions, **kwargs):
"""Reward valid JSON structure"""
rewards = []
for completion in completions:
try:
json.loads(completion)
rewards.append(1.0)
except json.JSONDecodeError:
rewards.append(0.0)
return rewards
def correctness_reward_func(completions, answers, **kwargs):
"""Reward correct field values"""
rewards = []
for completion, answer in zip(completions, answers):
try:
pred = json.loads(completion)
ref = json.loads(answer)
# Compare fields
matches = sum(1 for k in ref if pred.get(k) == ref[k])
rewards.append(matches / len(ref) * 2.0) # Scale to 0-2
except:
rewards.append(0.0)
return rewards
def anti_hallucination_reward(completions, answers, **kwargs):
"""Penalize extra items not in reference"""
rewards = []
for completion, answer in zip(completions, answers):
try:
pred_items = len(json.loads(completion))
ref_items = len(json.loads(answer))
extra = max(0, pred_items - ref_items)
rewards.append(-0.5 * extra) # Penalty for hallucination
except:
rewards.append(0.0)
return rewards
Key hyperparameters for GRPO (different from SFT):
from trl import GRPOConfig, GRPOTrainer
config = GRPOConfig(
# GRPO-specific
num_generations=2, # Completions per prompt (2-8)
# Lower learning rate than SFT
learning_rate=5e-6, # NOT 2e-4 like SFT
# Standard training
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
# Generation settings
max_completion_length=1024,
max_prompt_length=512,
# Optimization
warmup_ratio=0.1,
logging_steps=1,
save_steps=50,
)
GRPO expects prompts and reference answers:
from datasets import Dataset
from PIL import Image
dataset = Dataset.from_dict({
"prompt": [
"Extract courses from this image...",
"List all courses shown...",
],
"image": [
Image.open("page1.png"),
Image.open("page2.png"),
],
"answer": [
'[{"course_code": "CS101", "title": "Intro to CS"}]',
'[{"course_code": "MATH201", "title": "Calculus II"}]',
],
})
from unsloth import FastVisionModel
# Load model with 4-bit quantization
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
load_in_4bit=True,
)
# Apply LoRA
model = FastVisionModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=32, # Usually 2x rank
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
)
# Initialize trainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
config=config,
train_dataset=dataset,
reward_funcs=[
formatting_reward_func,
correctness_reward_func,
anti_hallucination_reward,
],
)
trainer.train()
GRPO requires more VRAM than SFT due to multiple generations:
| Model Size | SFT Instance | GRPO Instance |
|---|---|---|
| 7-8B | ml.g5.2xlarge (24GB) | ml.g5.4xlarge (48GB) or ml.p4d.24xlarge |
| 13B | ml.g5.4xlarge | ml.p4d.24xlarge |
| 70B | ml.p4d.24xlarge | ml.p5.48xlarge |
Security Note: The example below uses a mutable third-party image tag. For production use, you should:
@sha256:...)# Pin to a specific version for reproducibility and security
# Check https://github.com/unslothai/unsloth/releases for latest stable versions
FROM ghcr.io/unslothai/unsloth:2024.12 # Use specific version, not :stable
# Add AWS dependencies without breaking numpy/scipy
RUN pip install --no-cache-dir --no-deps boto3 botocore
COPY grpo_train.py /opt/ml/code/
ENTRYPOINT ["python3", "/opt/ml/code/grpo_train.py"]
For maximum security, mirror the image to your own ECR registry:
# Pull, scan, and push to ECR
docker pull ghcr.io/unslothai/unsloth:2024.12
docker tag ghcr.io/unslothai/unsloth:2024.12 ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/unsloth:2024.12
docker push ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/unsloth:2024.12
GRPO is slower than SFT due to generation overhead:
Time per step ≈ 150-200 seconds (vs 30-50s for SFT)
Total time ≈ num_examples × time_per_step / batch_size
For 1000 examples: ~45-55 hours on ml.p4d.24xlarge
Symptom: Exit code 139 (SIGSEGV)
Fix: Reduce num_generations or max_completion_length, or use larger instance
Symptom: reward_std: 0.0 in logs
Fix: Reward functions may be too strict; add partial credit
Symptom: loss: 0.0 throughout training
Fix: Ensure reward variance between generations; check frac_reward_zero_std
Monitor these in CloudWatch/logs:
reward/mean: Should increase over trainingreward_std: Non-zero indicates learning signalcompletions/clipped_ratio: High ratio suggests max_length too shortkl: Should stay small (policy not diverging too far)