Run any Skill in Manus with one click

$pwd:

grpo-finetuning

Name: Grpo Finetuning
Author: aws-solutions-library-samples

// Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples).

Run Skill in Manus

$ git log --oneline --stat

stars:314

forks:121

updated:January 27, 2026 at 18:47

SKILL.md

readonly

related-skills.json

same repository

tool-use-structured-output.md

from "aws-solutions-library-samples/guidance-for-claude-code-with-amazon-bedrock"

Use Bedrock tool_use to guarantee structured JSON outputs from Claude models. Eliminates JSON parsing failures by forcing responses through typed tool schemas.

2026-01-27314

async-inference.md

from "aws-solutions-library-samples/guidance-for-claude-code-with-amazon-bedrock"

Implement SageMaker async inference with S3-based I/O and polling. Use for long-running inference (>60s), large payloads, or batch processing workloads.

2026-01-27314

frontend-design.md

from "aws-solutions-library-samples/guidance-for-claude-code-with-amazon-bedrock"

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, blogs, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

2025-11-24314

package.json

"author": "aws-solutions-library-samples"

"repository": "aws-solutions-library-samples/guidance-for-claude-code-with-amazon-bedrock"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

Scenario

Recommended Approach

Large dataset (>5000 examples)

SFT (Supervised Fine-Tuning)

Small dataset (<1000 examples)

GRPO

Clear correctness criteria (JSON, format)

GRPO

Subjective quality (style, tone)

SFT

Need diversity in outputs

GRPO

import json def formatting_reward_func(completions, **kwargs): """Reward valid JSON structure""" rewards = [] for completion in completions: try: json.loads(completion) rewards.append(1.0) except json.JSONDecodeError: rewards.append(0.0) return rewards def correctness_reward_func(completions, answers, **kwargs): """Reward correct field values""" rewards = [] for completion, answer in zip(completions, answers): try: pred = json.loads(completion) ref = json.loads(answer) # Compare fields matches = sum(1 for k in ref if pred.get(k) == ref[k]) rewards.append(matches / len(ref) * 2.0) # Scale to 0-2 except: rewards.append(0.0) return rewards def anti_hallucination_reward(completions, answers, **kwargs): """Penalize extra items not in reference""" rewards = [] for completion, answer in zip(completions, answers): try: pred_items = len(json.loads(completion)) ref_items = len(json.loads(answer)) extra = max(0, pred_items - ref_items) rewards.append(-0.5 * extra) # Penalty for hallucination except: rewards.append(0.0) return rewards

from trl import GRPOConfig, GRPOTrainer config = GRPOConfig( # GRPO-specific num_generations=2, # Completions per prompt (2-8) # Lower learning rate than SFT learning_rate=5e-6, # NOT 2e-4 like SFT # Standard training per_device_train_batch_size=1, gradient_accumulation_steps=4, num_train_epochs=1, # Generation settings max_completion_length=1024, max_prompt_length=512, # Optimization warmup_ratio=0.1, logging_steps=1, save_steps=50, )

from datasets import Dataset from PIL import Image dataset = Dataset.from_dict({ "prompt": [ "Extract courses from this image...", "List all courses shown...", ], "image": [ Image.open("page1.png"), Image.open("page2.png"), ], "answer": [ '[{"course_code": "CS101", "title": "Intro to CS"}]', '[{"course_code": "MATH201", "title": "Calculus II"}]', ], })

from unsloth import FastVisionModel # Load model with 4-bit quantization model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit", load_in_4bit=True, ) # Apply LoRA model = FastVisionModel.get_peft_model( model, r=16, # LoRA rank lora_alpha=32, # Usually 2x rank target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], lora_dropout=0.05, ) # Initialize trainer trainer = GRPOTrainer( model=model, processing_class=tokenizer, config=config, train_dataset=dataset, reward_funcs=[ formatting_reward_func, correctness_reward_func, anti_hallucination_reward, ], ) trainer.train()

Model Size

SFT Instance

GRPO Instance

7-8B

ml.g5.2xlarge (24GB)

ml.g5.4xlarge (48GB) or ml.p4d.24xlarge

13B

ml.g5.4xlarge

ml.p4d.24xlarge

70B

ml.p4d.24xlarge

ml.p5.48xlarge

# Pin to a specific version for reproducibility and security # Check https://github.com/unslothai/unsloth/releases for latest stable versions FROM ghcr.io/unslothai/unsloth:2024.12 # Use specific version, not :stable # Add AWS dependencies without breaking numpy/scipy RUN pip install --no-cache-dir --no-deps boto3 botocore COPY grpo_train.py /opt/ml/code/ ENTRYPOINT ["python3", "/opt/ml/code/grpo_train.py"]

# Pull, scan, and push to ECR docker pull ghcr.io/unslothai/unsloth:2024.12 docker tag ghcr.io/unslothai/unsloth:2024.12 ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/unsloth:2024.12 docker push ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/unsloth:2024.12

Scenario

Recommended Approach

Large dataset (>5000 examples)

SFT (Supervised Fine-Tuning)

Small dataset (<1000 examples)

GRPO

Clear correctness criteria (JSON, format)

GRPO

Subjective quality (style, tone)

SFT

Need diversity in outputs

GRPO

Model Size

SFT Instance

GRPO Instance

7-8B

ml.g5.2xlarge (24GB)

ml.g5.4xlarge (48GB) or ml.p4d.24xlarge

13B

ml.g5.4xlarge

ml.p4d.24xlarge

70B

ml.p4d.24xlarge

ml.p5.48xlarge

grpo-finetuning

When to Use GRPO vs SFT

Core Concepts

Implementation Pattern

1. Define Reward Functions

2. Training Configuration

3. Dataset Format

4. Trainer Setup

AWS Infrastructure

SageMaker Training Job

Docker Container

Training Time Estimation

Common Issues

1. OOM During Generation

2. Zero Rewards

3. No Learning Signal

Success Metrics

When to Use GRPO vs SFT

Core Concepts

Implementation Pattern

1. Define Reward Functions

2. Training Configuration

3. Dataset Format

4. Trainer Setup

AWS Infrastructure

SageMaker Training Job

Docker Container

Training Time Estimation

Common Issues

1. OOM During Generation

2. Zero Rewards

3. No Learning Signal

Success Metrics

name	grpo-finetuning
description	Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples).

grpo-finetuning

More from this repository

When to Use GRPO vs SFT

Core Concepts

Implementation Pattern

1. Define Reward Functions

2. Training Configuration

3. Dataset Format

4. Trainer Setup

AWS Infrastructure

SageMaker Training Job

Docker Container

Training Time Estimation

Common Issues

1. OOM During Generation

2. Zero Rewards

3. No Learning Signal

Success Metrics

When to Use GRPO vs SFT

Core Concepts

Implementation Pattern

1. Define Reward Functions

2. Training Configuration

3. Dataset Format

4. Trainer Setup

AWS Infrastructure

SageMaker Training Job

Docker Container

Training Time Estimation

Common Issues

1. OOM During Generation

2. Zero Rewards

3. No Learning Signal

Success Metrics

More from this repository