| name | training-optimization |
| description | Advanced techniques for optimizing LLM fine-tuning. Covers learning rates, LoRA configuration, batch sizes, gradient strategies, hyperparameter tuning, and monitoring. Use when fine-tuning models for best performance. |
Training Optimization
Master advanced techniques for efficient, high-quality LLM fine-tuning.
Overview
Fine-tuning is an art. Optimize:
- Learning rates - Schedulers, warmup, optimal values
- LoRA configuration - Rank, alpha, target modules
- Batch optimization - Size, accumulation, sequence length
- Precision - FP16, BF16, mixed precision
- Gradient strategies - Checkpointing, clipping, accumulation
- Hyperparameter tuning - Grid search, Bayesian optimization
- Monitoring - WandB, TensorBoard, loss curves
- Quality - Prevent overfitting, improve convergence
Quick Start
Optimal Default Configuration
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.2-7B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
use_gradient_checkpointing="unsloth"
)
training_args = TrainingArguments(
output_dir="./outputs",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
save_strategy="epoch",
save_total_limit=3
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=2048,
args=training_args
)
trainer.train()
Learning Rate Optimization
Finding Optimal LR (Learning Rate Finder)
from transformers.trainer_utils import IntervalStrategy
def find_learning_rate(model, tokenizer, dataset, min_lr=1e-7, max_lr=1):
"""Run LR finder to find optimal learning rate"""
learning_rates = []
losses = []
for lr in [1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]:
print(f"Testing LR: {lr}")
args = TrainingArguments(
output_dir=f"./lr_test_{lr}",
learning_rate=lr,
max_steps=100,
per_device_train_batch_size=2,
logging_steps=10
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset.select(range(200)),
args=args
)
result = trainer.train()
learning_rates.append(lr)
losses.append(result.training_loss)
import matplotlib.pyplot as plt
plt.plot(learning_rates, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.savefig('lr_finder.png')
optimal_idx = np.argmin(np.gradient(losses))
optimal_lr = learning_rates[optimal_idx]
print(f"Optimal LR: {optimal_lr}")
return optimal_lr
Learning Rate Schedules
training_args = TrainingArguments(
lr_scheduler_type="cosine",
learning_rate=2e-4,
warmup_ratio=0.03
)
training_args = TrainingArguments(
lr_scheduler_type="linear",
learning_rate=2e-4,
warmup_steps=100
)
training_args = TrainingArguments(
lr_scheduler_type="constant_with_warmup",
learning_rate=2e-4,
warmup_ratio=0.05
)
training_args = TrainingArguments(
lr_scheduler_type="polynomial",
learning_rate=2e-4,
warmup_ratio=0.03
)
training_args = TrainingArguments(
lr_scheduler_type="cosine_with_restarts",
learning_rate=2e-4,
warmup_ratio=0.03
)
LR Guidelines by Model Size
| Model Size | Learning Rate | Warmup Steps | Batch Size |
|---|
| 1-3B | 5e-4 to 1e-3 | 50-100 | 8-16 |
| 7B | 2e-4 to 5e-4 | 100-200 | 4-8 |
| 13B | 1e-4 to 2e-4 | 200-300 | 2-4 |
| 30B+ | 5e-5 to 1e-4 | 300-500 | 1-2 |
LoRA Configuration
LoRA Rank Optimization
model = FastLanguageModel.get_peft_model(
model,
r=8,
lora_alpha=16,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
)
model = FastLanguageModel.get_peft_model(
model,
r=64,
lora_alpha=64,
)
LoRA Alpha Guidelines
lora_alpha = 8
lora_alpha = 16
lora_alpha = 32
Target Modules Selection
target_modules = ["q_proj", "v_proj"]
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head"
]
LoRA Parameter Count
def calculate_lora_params(base_model_size, rank, num_target_modules):
"""
Estimate trainable parameters with LoRA
"""
params_per_module = rank * 2 * 4096
total_lora_params = params_per_module * num_target_modules
return {
"lora_params": total_lora_params,
"base_params": base_model_size * 1e9,
"trainable_percent": (total_lora_params / (base_model_size * 1e9)) * 100
}
params = calculate_lora_params(
base_model_size=7,
rank=16,
num_target_modules=7
)
print(f"Trainable: {params['trainable_percent']:.2f}%")
Batch Size & Gradient Accumulation
Effective Batch Size
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
)
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
)
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
)
Optimal Batch Size Guidelines
| Model Size | Min Effective Batch | Recommended | Max (if possible) |
|---|
| 1-3B | 4 | 8-16 | 32 |
| 7B | 8 | 16-32 | 64 |
| 13B | 16 | 32-64 | 128 |
| 30B+ | 32 | 64-128 | 256 |
Sequence Length Optimization
max_seq_length = 512
max_seq_length = 2048
max_seq_length = 4096
max_seq_length = 8192
Dynamic Batch Size
def get_dynamic_batch_size(seq_length, gpu_memory=24):
"""Calculate optimal batch size for sequence length"""
if seq_length <= 512:
return 8 if gpu_memory >= 16 else 4
elif seq_length <= 1024:
return 4 if gpu_memory >= 16 else 2
elif seq_length <= 2048:
return 2 if gpu_memory >= 24 else 1
else:
return 1
batch_size = get_dynamic_batch_size(
seq_length=2048,
gpu_memory=24
)
Mixed Precision Training
BF16 vs FP16
import torch
training_args = TrainingArguments(
bf16=torch.cuda.is_bf16_supported(),
)
training_args = TrainingArguments(
fp16=not torch.cuda.is_bf16_supported(),
fp16_opt_level="O1",
)
Gradient Scaler (for FP16)
from torch.cuda.amp import GradScaler
training_args = TrainingArguments(
fp16=True,
fp16_full_eval=True,
fp16_opt_level="O2",
)
Gradient Optimization
Gradient Checkpointing
model = FastLanguageModel.get_peft_model(
model,
use_gradient_checkpointing="unsloth"
)
model = FastLanguageModel.get_peft_model(
model,
use_gradient_checkpointing=False
)
Gradient Clipping
training_args = TrainingArguments(
max_grad_norm=1.0,
)
training_args = TrainingArguments(
max_grad_norm=0.0
)
Gradient Accumulation Steps
def calculate_accumulation(
target_batch_size: int,
per_device_batch: int,
num_gpus: int = 1
):
"""Calculate gradient accumulation steps"""
return target_batch_size // (per_device_batch * num_gpus)
accumulation_steps = calculate_accumulation(
target_batch_size=32,
per_device_batch=4,
num_gpus=1
)
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8
)
Hyperparameter Tuning
Grid Search
from itertools import product
search_space = {
'learning_rate': [1e-4, 2e-4, 5e-4],
'lora_rank': [8, 16, 32],
'lora_alpha': [8, 16, 32],
'batch_size': [4, 8]
}
best_loss = float('inf')
best_params = None
for lr, rank, alpha, batch in product(*search_space.values()):
print(f"Testing: lr={lr}, rank={rank}, alpha={alpha}, batch={batch}")
model = configure_lora(rank=rank, alpha=alpha)
trainer = SFTTrainer(
model=model,
args=TrainingArguments(
learning_rate=lr,
per_device_train_batch_size=batch,
num_train_epochs=1
)
)
result = trainer.train()
if result.training_loss < best_loss:
best_loss = result.training_loss
best_params = (lr, rank, alpha, batch)
print(f"Best params: {best_params}")
Bayesian Optimization (Optuna)
import optuna
def objective(trial):
"""Optuna objective function"""
lr = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
rank = trial.suggest_int('lora_rank', 8, 64, step=8)
alpha = trial.suggest_int('lora_alpha', 8, 64, step=8)
batch = trial.suggest_categorical('batch_size', [2, 4, 8])
model = configure_lora(rank=rank, alpha=alpha)
trainer = SFTTrainer(
model=model,
args=TrainingArguments(
learning_rate=lr,
per_device_train_batch_size=batch,
num_train_epochs=1
)
)
result = trainer.train()
return result.training_loss
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)
print(f"Best params: {study.best_params}")
print(f"Best loss: {study.best_value}")
W&B Sweeps
import wandb
sweep_config = {
'method': 'bayes',
'metric': {
'name': 'train_loss',
'goal': 'minimize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': -9.21,
'max': -6.91
},
'lora_rank': {
'values': [8, 16, 32, 64]
},
'batch_size': {
'values': [2, 4, 8]
}
}
}
sweep_id = wandb.sweep(sweep_config, project="llm-tuning")
def train_sweep():
wandb.init()
config = wandb.config
trainer = SFTTrainer(...)
trainer.train()
wandb.agent(sweep_id, train_sweep, count=10)
Training Monitoring
Weights & Biases Integration
import wandb
wandb.init(
project="llm-finetuning",
config={
"learning_rate": 2e-4,
"lora_rank": 16,
"batch_size": 8,
"model": "Llama-3.2-7B"
}
)
training_args = TrainingArguments(
output_dir="./outputs",
report_to="wandb",
logging_steps=10,
run_name="medical-model-v1"
)
wandb.log({"custom_metric": value})
TensorBoard Integration
training_args = TrainingArguments(
output_dir="./outputs",
logging_dir="./logs",
report_to="tensorboard",
logging_steps=10
)
Loss Curve Interpretation
def analyze_training(log_history):
"""Analyze training progress"""
losses = [log['loss'] for log in log_history if 'loss' in log]
recent_losses = losses[-10:]
improvement = (recent_losses[0] - recent_losses[-1]) / recent_losses[0]
if improvement < 0.01:
print("⚠️ Training has plateaued")
print("Consider: increase LR, train longer, add data")
loss_std = np.std(losses[-50:])
if loss_std > 0.1:
print("⚠️ Training is unstable")
print("Consider: reduce LR, clip gradients")
Preventing Overfitting
Early Stopping
from transformers import EarlyStoppingCallback
training_args = TrainingArguments(
evaluation_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
Regularization Techniques
model = FastLanguageModel.get_peft_model(
model,
lora_dropout=0.05,
)
training_args = TrainingArguments(
weight_decay=0.01,
)
Data Augmentation
Validation Strategy
training_args = TrainingArguments(
evaluation_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True
)
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(
dataset,
test_size=0.1,
random_state=42
)
trainer = SFTTrainer(
train_dataset=train_data,
eval_dataset=val_data,
)
Advanced Techniques
Curriculum Learning
def curriculum_training(model, tokenizer, dataset):
"""Train on easy examples first, then harder ones"""
sorted_dataset = sorted(
dataset,
key=lambda x: len(x['output'])
)
stages = [
(0, len(sorted_dataset) // 3, 1),
(0, 2 * len(sorted_dataset) // 3, 1),
(0, len(sorted_dataset), 2)
]
for start, end, epochs in stages:
print(f"Stage: examples {start}-{end}, {epochs} epochs")
stage_dataset = sorted_dataset[start:end]
trainer = SFTTrainer(
model=model,
train_dataset=stage_dataset,
args=TrainingArguments(num_train_epochs=epochs)
)
trainer.train()
Progressive Sequence Length
def progressive_training(model, tokenizer, dataset):
"""Increase sequence length during training"""
stages = [
(512, 1),
(1024, 1),
(2048, 2)
]
for seq_len, epochs in stages:
print(f"Training with max_seq_length={seq_len}")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=seq_len,
args=TrainingArguments(num_train_epochs=epochs)
)
trainer.train()
Learning Rate Warmup + Decay
training_args = TrainingArguments(
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
)
training_args = TrainingArguments(
learning_rate=2e-4,
warmup_steps=100,
)
Memory Optimization
Reduce Memory Usage
use_gradient_checkpointing="unsloth"
per_device_train_batch_size=1
gradient_accumulation_steps=8
bf16=True
target_modules=["q_proj", "v_proj"]
r=8
max_seq_length=1024
optim="adamw_8bit"
load_in_4bit=True
Memory Calculation
def estimate_memory(
model_size_b: float,
lora_rank: int,
batch_size: int,
seq_length: int,
precision: str = "bf16"
):
"""Estimate GPU memory requirements"""
model_memory = model_size_b * 0.5
lora_params = lora_rank * 2 * 4096 * 7
lora_memory = (lora_params * 2) / 1e9
activation_memory = batch_size * seq_length * model_size_b * 0.002
optimizer_memory = lora_memory * 1
total = model_memory + lora_memory + activation_memory + optimizer_memory
return {
'model': model_memory,
'lora': lora_memory,
'activations': activation_memory,
'optimizer': optimizer_memory,
'total': total,
'recommended_gpu': '16GB' if total < 14 else '24GB' if total < 22 else '48GB'
}
mem = estimate_memory(
model_size_b=7,
lora_rank=16,
batch_size=2,
seq_length=2048
)
print(f"Estimated memory: {mem['total']:.1f} GB")
print(f"Recommended GPU: {mem['recommended_gpu']}")
Troubleshooting
Issue: Training is Slow
Solutions:
use_gradient_checkpointing=False
per_device_train_batch_size=4
bf16=True
eval_steps=500
target_modules=["q_proj", "v_proj"]
Issue: Out of Memory
Solutions:
per_device_train_batch_size=1
gradient_accumulation_steps=8
use_gradient_checkpointing="unsloth"
max_seq_length=1024
Issue: Loss Not Decreasing
Solutions:
learning_rate=5e-4
r=32
num_train_epochs=5
weight_decay=0.0
lr_scheduler_type="linear"
Issue: Model Overfitting
Solutions:
r=8
weight_decay=0.01
num_train_epochs=1
Best Practices
1. Start with Defaults
2. Monitor Everything
3. Save Checkpoints
training_args = TrainingArguments(
save_strategy="epoch",
save_total_limit=3,
load_best_model_at_end=True
)
4. Validate During Training
5. Document Experiments
Summary
Training optimization workflow:
- ✓ Start with optimal defaults
- ✓ Monitor training (W&B/TensorBoard)
- ✓ Adjust if needed (LR, batch size, LoRA rank)
- ✓ Prevent overfitting (validation, early stopping)
- ✓ Optimize memory (checkpointing, quantization)
- ✓ Fine-tune hyperparameters (grid search/Bayesian)
- ✓ Document everything
Remember: Most models work well with defaults. Only optimize if you have specific issues.
Default Recipe (works 90% of the time):
- Learning rate: 2e-4
- LoRA rank: 16
- LoRA alpha: 16
- Batch size: 8 (effective)
- Scheduler: cosine with 3% warmup
- Precision: BF16
- Epochs: 3
- Target modules: All attention + MLP