一键在 Manus 中运行任何 Skill

$pwd:

training-optimization

Name: Training Optimization
Author: ScientiaCapital

// Advanced techniques for optimizing LLM fine-tuning. Covers learning rates, LoRA configuration, batch sizes, gradient strategies, hyperparameter tuning, and monitoring. Use when fine-tuning models for best performance.

在 Manus 中运行

$ git log --oneline --stat

stars:1

forks:0

updated:2025年11月7日 18:01

SKILL.md

readonly

related-skills.json

同仓库

adaptive-workflows.md

from "ScientiaCapital/unsloth-mcp-server"

Self-learning workflow system that tracks what works best for your use cases. Records experiment results, suggests optimizations, creates custom templates, and builds a personal knowledge base. Use to learn from experience and optimize your LLM workflows over time.

2025-11-071

dataset-engineering.md

from "ScientiaCapital/unsloth-mcp-server"

Create, clean, and optimize datasets for LLM fine-tuning. Covers formats (Alpaca, ShareGPT, ChatML), synthetic data generation, quality assessment, and augmentation. Use when preparing data for training.

2025-11-071

model-deployment.md

from "ScientiaCapital/unsloth-mcp-server"

Export and deploy fine-tuned models to production. Covers GGUF/Ollama, vLLM, HuggingFace Hub, Docker, quantization, and platform selection. Use after fine-tuning when you need to deploy models efficiently.

2025-11-071

superbpe.md

from "ScientiaCapital/unsloth-mcp-server"

Train and use SuperBPE tokenizers for 20-33% token reduction across any project. Covers training, optimization, validation, and integration with any LLM framework. Use when you need efficient tokenization, want to reduce API costs, or maximize context windows.

2025-11-071

unsloth-tokenizer.md

from "ScientiaCapital/unsloth-mcp-server"

Analyze, compare, and work with tokenizers using Unsloth tools. Compare different tokenizers, analyze token efficiency, and integrate with Unsloth models. For SuperBPE training, see the 'superbpe' skill.

2025-11-071

unsloth-finetuning.md

from "ScientiaCapital/unsloth-mcp-server"

Fine-tune LLMs 2x faster with 80% less memory using Unsloth. Use when the user wants to fine-tune models like Llama, Mistral, Phi, or Gemma. Handles model loading, LoRA configuration, training, and model export.

2025-11-071

package.json

"author": "ScientiaCapital"

"repository": "ScientiaCapital/unsloth-mcp-server"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	training-optimization
description	Advanced techniques for optimizing LLM fine-tuning. Covers learning rates, LoRA configuration, batch sizes, gradient strategies, hyperparameter tuning, and monitoring. Use when fine-tuning models for best performance.

Training Optimization

Master advanced techniques for efficient, high-quality LLM fine-tuning.

Overview

Fine-tuning is an art. Optimize:

Learning rates - Schedulers, warmup, optimal values
LoRA configuration - Rank, alpha, target modules
Batch optimization - Size, accumulation, sequence length
Precision - FP16, BF16, mixed precision
Gradient strategies - Checkpointing, clipping, accumulation
Hyperparameter tuning - Grid search, Bayesian optimization
Monitoring - WandB, TensorBoard, loss curves
Quality - Prevent overfitting, improve convergence

Quick Start

Optimal Default Configuration

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-7B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth"  # Memory efficient
)

# Configure LoRA (optimal defaults)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank (8-64)
    lora_alpha=16,          # Alpha = rank typically works well
    lora_dropout=0,         # 0 for Unsloth (already efficient)
    target_modules=[        # All attention + MLP for best quality
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
    use_gradient_checkpointing="unsloth"
)

# Training arguments (optimal defaults)
training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,        # Effective batch = 2*4 = 8
    num_train_epochs=3,
    learning_rate=2e-4,                   # 2e-4 is a sweet spot
    lr_scheduler_type="cosine",           # Cosine decay
    warmup_ratio=0.03,                    # 3% warmup
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),  # BF16 if available
    logging_steps=10,
    optim="adamw_8bit",                   # 8-bit AdamW (memory efficient)
    save_strategy="epoch",
    save_total_limit=3
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
    args=training_args
)

trainer.train()

Learning Rate Optimization

Finding Optimal LR (Learning Rate Finder)

from transformers.trainer_utils import IntervalStrategy

def find_learning_rate(model, tokenizer, dataset, min_lr=1e-7, max_lr=1):
    """Run LR finder to find optimal learning rate"""

    learning_rates = []
    losses = []

    # Try different learning rates
    for lr in [1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]:
        print(f"Testing LR: {lr}")

        args = TrainingArguments(
            output_dir=f"./lr_test_{lr}",
            learning_rate=lr,
            max_steps=100,
            per_device_train_batch_size=2,
            logging_steps=10
        )

        trainer = SFTTrainer(
            model=model,
            tokenizer=tokenizer,
            train_dataset=dataset.select(range(200)),  # Small subset
            args=args
        )

        result = trainer.train()
        learning_rates.append(lr)
        losses.append(result.training_loss)

    # Plot results
    import matplotlib.pyplot as plt
    plt.plot(learning_rates, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.savefig('lr_finder.png')

    # Optimal LR is typically where loss decreases fastest
    optimal_idx = np.argmin(np.gradient(losses))
    optimal_lr = learning_rates[optimal_idx]

    print(f"Optimal LR: {optimal_lr}")
    return optimal_lr

Learning Rate Schedules

# 1. Cosine Decay (Recommended)
training_args = TrainingArguments(
    lr_scheduler_type="cosine",
    learning_rate=2e-4,
    warmup_ratio=0.03  # 3% warmup, then cosine decay
)

# 2. Linear Decay
training_args = TrainingArguments(
    lr_scheduler_type="linear",
    learning_rate=2e-4,
    warmup_steps=100
)

# 3. Constant with Warmup
training_args = TrainingArguments(
    lr_scheduler_type="constant_with_warmup",
    learning_rate=2e-4,
    warmup_ratio=0.05
)

# 4. Polynomial Decay
training_args = TrainingArguments(
    lr_scheduler_type="polynomial",
    learning_rate=2e-4,
    warmup_ratio=0.03
)

# 5. Cosine with Restarts
training_args = TrainingArguments(
    lr_scheduler_type="cosine_with_restarts",
    learning_rate=2e-4,
    warmup_ratio=0.03
)

LR Guidelines by Model Size

Model Size	Learning Rate	Warmup Steps	Batch Size
1-3B	5e-4 to 1e-3	50-100	8-16
7B	2e-4 to 5e-4	100-200	4-8
13B	1e-4 to 2e-4	200-300	2-4
30B+	5e-5 to 1e-4	300-500	1-2

LoRA Configuration

LoRA Rank Optimization

# Rank (r) controls capacity and parameter count

# Low rank (r=4-8): Fast, memory efficient, less capacity
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    lora_alpha=16,  # Alpha typically 1-2x rank
    # Use for: Simple tasks, limited data
)

# Medium rank (r=16-32): Balanced (recommended)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    # Use for: Most tasks, default choice
)

# High rank (r=64-128): Max capacity, slower
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=64,
    # Use for: Complex tasks, lots of data
)

LoRA Alpha Guidelines

# Alpha controls the scaling of LoRA updates

# Conservative (alpha = rank/2)
lora_alpha = 8  # for r=16
# Result: Slow adaptation, stable

# Standard (alpha = rank)
lora_alpha = 16  # for r=16
# Result: Balanced (recommended)

# Aggressive (alpha = 2*rank)
lora_alpha = 32  # for r=16
# Result: Fast adaptation, may be unstable

Target Modules Selection

# Minimal (attention only): Fast, less capacity
target_modules = ["q_proj", "v_proj"]

# Standard (all attention): Good balance
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Extended (attention + MLP): Best quality (recommended)
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj"
]

# Full (everything): Maximum capacity
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
    "embed_tokens", "lm_head"
]

LoRA Parameter Count

def calculate_lora_params(base_model_size, rank, num_target_modules):
    """
    Estimate trainable parameters with LoRA
    """
    # For a 7B model with r=16 and 4 target modules:
    # ~16M trainable parameters (0.2% of base model)

    params_per_module = rank * 2 * 4096  # Approximate
    total_lora_params = params_per_module * num_target_modules

    return {
        "lora_params": total_lora_params,
        "base_params": base_model_size * 1e9,
        "trainable_percent": (total_lora_params / (base_model_size * 1e9)) * 100
    }

# Example
params = calculate_lora_params(
    base_model_size=7,      # 7B model
    rank=16,
    num_target_modules=7    # All attention + MLP
)
print(f"Trainable: {params['trainable_percent']:.2f}%")

Batch Size & Gradient Accumulation

Effective Batch Size

# Effective batch = per_device_batch * gradient_accumulation * num_gpus

# Example 1: Limited VRAM
training_args = TrainingArguments(
    per_device_train_batch_size=1,    # Very small
    gradient_accumulation_steps=8,    # Accumulate 8 steps
    # Effective batch = 1 * 8 = 8
)

# Example 2: More VRAM
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    # Effective batch = 4 * 2 = 8 (same effective batch)
)

# Example 3: Multi-GPU
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    # With 2 GPUs: 2 * 2 * 2 = 8
)

Optimal Batch Size Guidelines

Model Size	Min Effective Batch	Recommended	Max (if possible)
1-3B	4	8-16	32
7B	8	16-32	64
13B	16	32-64	128
30B+	32	64-128	256

Sequence Length Optimization

# Balance memory, speed, and quality

# Short sequences (faster, less memory)
max_seq_length = 512
# Use for: Short-form content, Q&A

# Medium sequences (balanced)
max_seq_length = 2048
# Use for: Most tasks (recommended)

# Long sequences (slower, more memory)
max_seq_length = 4096
# Use for: Long-form content, documents

# Very long (requires optimization)
max_seq_length = 8192
# Use for: Full documents, requires packing

Dynamic Batch Size

# Adjust batch size based on sequence length
def get_dynamic_batch_size(seq_length, gpu_memory=24):
    """Calculate optimal batch size for sequence length"""

    if seq_length <= 512:
        return 8 if gpu_memory >= 16 else 4
    elif seq_length <= 1024:
        return 4 if gpu_memory >= 16 else 2
    elif seq_length <= 2048:
        return 2 if gpu_memory >= 24 else 1
    else:  # > 2048
        return 1

# Usage
batch_size = get_dynamic_batch_size(
    seq_length=2048,
    gpu_memory=24  # GB
)

Mixed Precision Training

BF16 vs FP16

import torch

# BF16 (preferred if available)
training_args = TrainingArguments(
    bf16=torch.cuda.is_bf16_supported(),  # Auto-detect
    # Better numerical stability than FP16
    # Same memory savings as FP16
    # Recommended for: A100, H100, 4090
)

# FP16 (fallback)
training_args = TrainingArguments(
    fp16=not torch.cuda.is_bf16_supported(),
    fp16_opt_level="O1",  # O1, O2, or O3
    # Good for: V100, older GPUs
)

# FP32 (full precision, not recommended)
# Use only for debugging

Gradient Scaler (for FP16)

# Prevent underflow with FP16
from torch.cuda.amp import GradScaler

training_args = TrainingArguments(
    fp16=True,
    fp16_full_eval=True,
    fp16_opt_level="O2",  # More aggressive optimization
)

# Unsloth handles this automatically

Gradient Optimization

Gradient Checkpointing

# Trade computation for memory
model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing="unsloth"  # Unsloth-optimized
    # Saves ~30-50% memory
    # Adds ~10-20% training time
)

# Without gradient checkpointing (more memory, faster)
model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing=False
)

Gradient Clipping

# Prevent exploding gradients
training_args = TrainingArguments(
    max_grad_norm=1.0,  # Clip gradients above this norm
    # Lower (0.5): More conservative
    # Default (1.0): Standard (recommended)
    # Higher (2.0): Less clipping
)

# Disable clipping
training_args = TrainingArguments(
    max_grad_norm=0.0  # No clipping
)

Gradient Accumulation Steps

# Accumulate gradients for larger effective batch

# Calculate accumulation steps
def calculate_accumulation(
    target_batch_size: int,
    per_device_batch: int,
    num_gpus: int = 1
):
    """Calculate gradient accumulation steps"""
    return target_batch_size // (per_device_batch * num_gpus)

# Example: Want batch=32, have 1 GPU, can fit batch=4
accumulation_steps = calculate_accumulation(
    target_batch_size=32,
    per_device_batch=4,
    num_gpus=1
)  # Returns 8

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8
)

Hyperparameter Tuning

Grid Search

from itertools import product

# Define search space
search_space = {
    'learning_rate': [1e-4, 2e-4, 5e-4],
    'lora_rank': [8, 16, 32],
    'lora_alpha': [8, 16, 32],
    'batch_size': [4, 8]
}

# Grid search
best_loss = float('inf')
best_params = None

for lr, rank, alpha, batch in product(*search_space.values()):
    print(f"Testing: lr={lr}, rank={rank}, alpha={alpha}, batch={batch}")

    # Configure model
    model = configure_lora(rank=rank, alpha=alpha)

    # Train
    trainer = SFTTrainer(
        model=model,
        args=TrainingArguments(
            learning_rate=lr,
            per_device_train_batch_size=batch,
            num_train_epochs=1  # Quick test
        )
    )

    result = trainer.train()

    if result.training_loss < best_loss:
        best_loss = result.training_loss
        best_params = (lr, rank, alpha, batch)

print(f"Best params: {best_params}")

Bayesian Optimization (Optuna)

import optuna

def objective(trial):
    """Optuna objective function"""

    # Suggest hyperparameters
    lr = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
    rank = trial.suggest_int('lora_rank', 8, 64, step=8)
    alpha = trial.suggest_int('lora_alpha', 8, 64, step=8)
    batch = trial.suggest_categorical('batch_size', [2, 4, 8])

    # Configure and train
    model = configure_lora(rank=rank, alpha=alpha)

    trainer = SFTTrainer(
        model=model,
        args=TrainingArguments(
            learning_rate=lr,
            per_device_train_batch_size=batch,
            num_train_epochs=1
        )
    )

    result = trainer.train()

    return result.training_loss

# Run optimization
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

print(f"Best params: {study.best_params}")
print(f"Best loss: {study.best_value}")

W&B Sweeps

import wandb

# Define sweep config
sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'train_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': -9.21,  # ln(1e-4)
            'max': -6.91   # ln(1e-3)
        },
        'lora_rank': {
            'values': [8, 16, 32, 64]
        },
        'batch_size': {
            'values': [2, 4, 8]
        }
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="llm-tuning")

# Run sweep
def train_sweep():
    wandb.init()
    config = wandb.config

    # Train with config
    trainer = SFTTrainer(...)
    trainer.train()

wandb.agent(sweep_id, train_sweep, count=10)

Training Monitoring

Weights & Biases Integration

import wandb

# Initialize
wandb.init(
    project="llm-finetuning",
    config={
        "learning_rate": 2e-4,
        "lora_rank": 16,
        "batch_size": 8,
        "model": "Llama-3.2-7B"
    }
)

# Training args with W&B
training_args = TrainingArguments(
    output_dir="./outputs",
    report_to="wandb",  # Enable W&B logging
    logging_steps=10,
    run_name="medical-model-v1"
)

# W&B will automatically log:
# - Training loss
# - Learning rate
# - Gradient norms
# - System metrics (GPU, CPU, RAM)

# Custom logging
wandb.log({"custom_metric": value})

TensorBoard Integration

training_args = TrainingArguments(
    output_dir="./outputs",
    logging_dir="./logs",
    report_to="tensorboard",
    logging_steps=10
)

# View with: tensorboard --logdir=./logs

Loss Curve Interpretation

# Monitoring during training

# Good signs:
# ✓ Steady decrease in loss
# ✓ Smooth curve (no spikes)
# ✓ Validation loss tracks training loss
# ✓ Learning rate schedule working

# Warning signs:
# ✗ Loss plateaus early → increase LR or model capacity
# ✗ Loss spikes → reduce LR or clip gradients
# ✗ Val loss >> train loss → overfitting (see below)
# ✗ Loss explodes → reduce LR, check data

def analyze_training(log_history):
    """Analyze training progress"""

    losses = [log['loss'] for log in log_history if 'loss' in log]

    # Check convergence
    recent_losses = losses[-10:]
    improvement = (recent_losses[0] - recent_losses[-1]) / recent_losses[0]

    if improvement < 0.01:
        print("⚠️ Training has plateaued")
        print("Consider: increase LR, train longer, add data")

    # Check stability
    loss_std = np.std(losses[-50:])
    if loss_std > 0.1:
        print("⚠️ Training is unstable")
        print("Consider: reduce LR, clip gradients")

    # Check overfitting
    # (Requires validation loss - see below)

Preventing Overfitting

Early Stopping

from transformers import EarlyStoppingCallback

# Stop if validation loss doesn't improve
training_args = TrainingArguments(
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

Regularization Techniques

# 1. LoRA Dropout (if not using Unsloth optimization)
model = FastLanguageModel.get_peft_model(
    model,
    lora_dropout=0.05,  # 5% dropout
    # Note: Unsloth recommends 0 for optimal speed
)

# 2. Weight Decay
training_args = TrainingArguments(
    weight_decay=0.01,  # L2 regularization
    # Default: 0.0
    # Typical range: 0.01-0.1
)

# 3. Gradient Noise
# Adds noise to gradients (reduces overfitting)
# Not directly supported, but can be implemented

Data Augmentation

# See dataset-engineering skill for:
# - Paraphrase augmentation
# - Back-translation
# - Synthetic data generation

# During training, use data augmentation to increase diversity

Validation Strategy

# Monitor validation loss

training_args = TrainingArguments(
    evaluation_strategy="steps",
    eval_steps=100,  # Evaluate every 100 steps
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True  # Load best checkpoint
)

# Train/val split
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(
    dataset,
    test_size=0.1,
    random_state=42
)

trainer = SFTTrainer(
    train_dataset=train_data,
    eval_dataset=val_data,
    # ...
)

Advanced Techniques

Curriculum Learning

def curriculum_training(model, tokenizer, dataset):
    """Train on easy examples first, then harder ones"""

    # Sort by difficulty (e.g., output length)
    sorted_dataset = sorted(
        dataset,
        key=lambda x: len(x['output'])
    )

    # Train in stages
    stages = [
        (0, len(sorted_dataset) // 3, 1),      # Easy, 1 epoch
        (0, 2 * len(sorted_dataset) // 3, 1),  # Easy+Medium, 1 epoch
        (0, len(sorted_dataset), 2)             # All, 2 epochs
    ]

    for start, end, epochs in stages:
        print(f"Stage: examples {start}-{end}, {epochs} epochs")

        stage_dataset = sorted_dataset[start:end]

        trainer = SFTTrainer(
            model=model,
            train_dataset=stage_dataset,
            args=TrainingArguments(num_train_epochs=epochs)
        )

        trainer.train()

Progressive Sequence Length

# Start with shorter sequences, gradually increase

def progressive_training(model, tokenizer, dataset):
    """Increase sequence length during training"""

    stages = [
        (512, 1),   # 512 tokens, 1 epoch
        (1024, 1),  # 1024 tokens, 1 epoch
        (2048, 2)   # 2048 tokens, 2 epochs
    ]

    for seq_len, epochs in stages:
        print(f"Training with max_seq_length={seq_len}")

        trainer = SFTTrainer(
            model=model,
            tokenizer=tokenizer,
            train_dataset=dataset,
            max_seq_length=seq_len,
            args=TrainingArguments(num_train_epochs=epochs)
        )

        trainer.train()

Learning Rate Warmup + Decay

# Optimal schedule for most cases
training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,  # 3% of total steps
    # This creates:
    # 1. Linear warmup (0 → 2e-4) for 3% of steps
    # 2. Cosine decay (2e-4 → 0) for remaining 97%
)

# Manual warmup steps
training_args = TrainingArguments(
    learning_rate=2e-4,
    warmup_steps=100,  # Explicit number
    # Instead of warmup_ratio
)

Memory Optimization

Reduce Memory Usage

# 1. Gradient checkpointing
use_gradient_checkpointing="unsloth"

# 2. Smaller batch size + gradient accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=8

# 3. Lower precision
bf16=True  # or fp16=True

# 4. Optimize target modules (fewer = less memory)
target_modules=["q_proj", "v_proj"]  # Instead of all 7

# 5. Lower LoRA rank
r=8  # Instead of r=16 or r=32

# 6. Reduce sequence length
max_seq_length=1024  # Instead of 2048

# 7. Use 8-bit optimizer
optim="adamw_8bit"

# 8. Quantized model
load_in_4bit=True  # Already using this with Unsloth

Memory Calculation

def estimate_memory(
    model_size_b: float,
    lora_rank: int,
    batch_size: int,
    seq_length: int,
    precision: str = "bf16"
):
    """Estimate GPU memory requirements"""

    # Model memory (4-bit quantized)
    model_memory = model_size_b * 0.5  # GB (4-bit = 0.5 bytes/param)

    # LoRA adapters
    lora_params = lora_rank * 2 * 4096 * 7  # Approximate
    lora_memory = (lora_params * 2) / 1e9  # FP16

    # Activations (depends on batch size and seq length)
    activation_memory = batch_size * seq_length * model_size_b * 0.002

    # Optimizer states (8-bit)
    optimizer_memory = lora_memory * 1  # 8-bit Adam

    total = model_memory + lora_memory + activation_memory + optimizer_memory

    return {
        'model': model_memory,
        'lora': lora_memory,
        'activations': activation_memory,
        'optimizer': optimizer_memory,
        'total': total,
        'recommended_gpu': '16GB' if total < 14 else '24GB' if total < 22 else '48GB'
    }

# Example
mem = estimate_memory(
    model_size_b=7,
    lora_rank=16,
    batch_size=2,
    seq_length=2048
)
print(f"Estimated memory: {mem['total']:.1f} GB")
print(f"Recommended GPU: {mem['recommended_gpu']}")

Troubleshooting

Issue: Training is Slow

Solutions:

# 1. Disable gradient checkpointing (if you have VRAM)
use_gradient_checkpointing=False

# 2. Increase batch size
per_device_train_batch_size=4  # Instead of 2

# 3. Use BF16/FP16
bf16=True

# 4. Reduce validation frequency
eval_steps=500  # Instead of 100

# 5. Use fewer target modules
target_modules=["q_proj", "v_proj"]  # Instead of all 7

# 6. Use Unsloth optimizations (already included)

Issue: Out of Memory

Solutions:

# See "Memory Optimization" section above

# Quick fix:
per_device_train_batch_size=1
gradient_accumulation_steps=8
use_gradient_checkpointing="unsloth"
max_seq_length=1024  # Instead of 2048

Issue: Loss Not Decreasing

Solutions:

# 1. Increase learning rate
learning_rate=5e-4  # Instead of 2e-4

# 2. Increase LoRA rank
r=32  # Instead of r=16

# 3. Train longer
num_train_epochs=5  # Instead of 3

# 4. Check data quality (see dataset-engineering skill)

# 5. Remove weight decay
weight_decay=0.0

# 6. Try different scheduler
lr_scheduler_type="linear"  # Instead of cosine

Issue: Model Overfitting

Solutions:

# 1. Add more training data
# See dataset-engineering skill

# 2. Reduce model capacity
r=8  # Instead of r=16

# 3. Add weight decay
weight_decay=0.01

# 4. Use validation set + early stopping
# See "Preventing Overfitting" section

# 5. Train for fewer epochs
num_train_epochs=1  # Instead of 3

# 6. Use data augmentation
# See dataset-engineering skill

Best Practices

1. Start with Defaults

# Use the "Optimal Default Configuration" from Quick Start
# Only tune if you have specific issues

2. Monitor Everything

# Use W&B or TensorBoard
# Watch: loss, LR, gradient norms, memory usage

3. Save Checkpoints

training_args = TrainingArguments(
    save_strategy="epoch",
    save_total_limit=3,  # Keep last 3 checkpoints
    load_best_model_at_end=True
)

4. Validate During Training

# Always use validation set
# Catch overfitting early

5. Document Experiments

# Track what you tried
# Use W&B or experiment tracking tool
# Record: hyperparams, results, observations

Summary

Training optimization workflow:

✓ Start with optimal defaults
✓ Monitor training (W&B/TensorBoard)
✓ Adjust if needed (LR, batch size, LoRA rank)
✓ Prevent overfitting (validation, early stopping)
✓ Optimize memory (checkpointing, quantization)
✓ Fine-tune hyperparameters (grid search/Bayesian)
✓ Document everything

Remember: Most models work well with defaults. Only optimize if you have specific issues.

Default Recipe (works 90% of the time):

Learning rate: 2e-4
LoRA rank: 16
LoRA alpha: 16
Batch size: 8 (effective)
Scheduler: cosine with 3% warmup
Precision: BF16
Epochs: 3
Target modules: All attention + MLP

training-optimization

同仓库更多 Skills

同仓库更多 Skills

Training Optimization

Overview

Quick Start

Optimal Default Configuration

Learning Rate Optimization

Finding Optimal LR (Learning Rate Finder)

Learning Rate Schedules

LR Guidelines by Model Size

LoRA Configuration

LoRA Rank Optimization

LoRA Alpha Guidelines

Target Modules Selection

LoRA Parameter Count

Batch Size & Gradient Accumulation

Effective Batch Size

Optimal Batch Size Guidelines

Sequence Length Optimization

Dynamic Batch Size

Mixed Precision Training

BF16 vs FP16

Gradient Scaler (for FP16)

Gradient Optimization

Gradient Checkpointing

Gradient Clipping

Gradient Accumulation Steps

Hyperparameter Tuning

Grid Search

Bayesian Optimization (Optuna)

W&B Sweeps

Training Monitoring

Weights & Biases Integration

TensorBoard Integration

Loss Curve Interpretation

Preventing Overfitting

Early Stopping

Regularization Techniques

Data Augmentation

Validation Strategy

Advanced Techniques

Curriculum Learning

Progressive Sequence Length

Learning Rate Warmup + Decay

Memory Optimization

Reduce Memory Usage

Memory Calculation

Troubleshooting

Issue: Training is Slow

Issue: Out of Memory

Issue: Loss Not Decreasing

Issue: Model Overfitting

Best Practices

1. Start with Defaults

2. Monitor Everything

3. Save Checkpoints

4. Validate During Training

5. Document Experiments

Summary

Training Optimization

Overview

Quick Start

Optimal Default Configuration

Learning Rate Optimization

Finding Optimal LR (Learning Rate Finder)

Learning Rate Schedules

LR Guidelines by Model Size

LoRA Configuration

LoRA Rank Optimization

LoRA Alpha Guidelines

Target Modules Selection

LoRA Parameter Count

Batch Size & Gradient Accumulation

Effective Batch Size

Optimal Batch Size Guidelines

Sequence Length Optimization

Dynamic Batch Size

Mixed Precision Training

BF16 vs FP16