Run any Skill in Manus with one click

$pwd:

unsloth-finetuning

Name: Unsloth Finetuning
Author: ScientiaCapital

// Fine-tune LLMs 2x faster with 80% less memory using Unsloth. Use when the user wants to fine-tune models like Llama, Mistral, Phi, or Gemma. Handles model loading, LoRA configuration, training, and model export.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:November 7, 2025 at 17:32

SKILL.md

readonly

related-skills.json

same repository

adaptive-workflows.md

from "ScientiaCapital/unsloth-mcp-server"

Self-learning workflow system that tracks what works best for your use cases. Records experiment results, suggests optimizations, creates custom templates, and builds a personal knowledge base. Use to learn from experience and optimize your LLM workflows over time.

2025-11-071

dataset-engineering.md

from "ScientiaCapital/unsloth-mcp-server"

Create, clean, and optimize datasets for LLM fine-tuning. Covers formats (Alpaca, ShareGPT, ChatML), synthetic data generation, quality assessment, and augmentation. Use when preparing data for training.

2025-11-071

model-deployment.md

from "ScientiaCapital/unsloth-mcp-server"

Export and deploy fine-tuned models to production. Covers GGUF/Ollama, vLLM, HuggingFace Hub, Docker, quantization, and platform selection. Use after fine-tuning when you need to deploy models efficiently.

2025-11-071

training-optimization.md

from "ScientiaCapital/unsloth-mcp-server"

Advanced techniques for optimizing LLM fine-tuning. Covers learning rates, LoRA configuration, batch sizes, gradient strategies, hyperparameter tuning, and monitoring. Use when fine-tuning models for best performance.

2025-11-071

superbpe.md

from "ScientiaCapital/unsloth-mcp-server"

Train and use SuperBPE tokenizers for 20-33% token reduction across any project. Covers training, optimization, validation, and integration with any LLM framework. Use when you need efficient tokenization, want to reduce API costs, or maximize context windows.

2025-11-071

unsloth-tokenizer.md

from "ScientiaCapital/unsloth-mcp-server"

Analyze, compare, and work with tokenizers using Unsloth tools. Compare different tokenizers, analyze token efficiency, and integrate with Unsloth models. For SuperBPE training, see the 'superbpe' skill.

2025-11-071

package.json

"author": "ScientiaCapital"

"repository": "ScientiaCapital/unsloth-mcp-server"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	unsloth-finetuning
description	Fine-tune LLMs 2x faster with 80% less memory using Unsloth. Use when the user wants to fine-tune models like Llama, Mistral, Phi, or Gemma. Handles model loading, LoRA configuration, training, and model export.

Unsloth Fine-Tuning

Expert guidance for fine-tuning Large Language Models using Unsloth's optimized library.

Core Capabilities

Load models with 4-bit quantization and gradient checkpointing
Configure LoRA/QLoRA for efficient fine-tuning
Train on custom or Hugging Face datasets
Export models to GGUF, Ollama, vLLM, or Hugging Face formats
Monitor training with progress tracking
Optimize for different hardware configurations

Quick Start

1. Load a Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth"
)

Supported Models:

Llama 3.3 (70B), 3.2 (1B, 3B), 3.1 (8B)
Mistral v0.3 (7B), Small Instruct
Phi 3.5 mini, Phi 3 medium
Gemma 2 (9B, 27B)
Qwen 2.5 (7B)

2. Apply LoRA

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank (8-64)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,          # Scaling factor
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length=2048
)

3. Configure Training

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="./output",
        optim="adamw_8bit",
        seed=3407
    )
)

4. Train

trainer.train()

5. Export

# GGUF format
model.save_pretrained_gguf(
    "model",
    tokenizer,
    quantization_method="q4_k_m"
)

# Hugging Face format
model.save_pretrained("./hf_model")
tokenizer.save_pretrained("./hf_model")

Performance Optimization

Memory Optimization

Out of Memory? Try:

Reduce per_device_train_batch_size to 1
Increase gradient_accumulation_steps to 8
Reduce max_seq_length to 1024
Use smaller model (1B instead of 3B)

Speed Optimization

Training too slow? Check:

GPU is being used: nvidia-smi
Batch size isn't too small
Using load_in_4bit=True
Using use_gradient_checkpointing="unsloth"

Quality Optimization

Poor results? Adjust:

Increase max_steps to 500-1000
Try learning rates: 1e-4, 2e-4, 5e-4
Increase dataset quality/size
Use larger model if resources allow

Hardware Requirements

Minimum (1B models)

GPU: RTX 3060 (12GB VRAM)
RAM: 16GB
Training time: 20-40 min for 100 steps

Recommended (3B-7B models)

GPU: RTX 4090 or A100
RAM: 32GB+
Training time: 10-30 min for 100 steps

Budget (Small experiments)

GPU: RTX 3060 Ti (8GB)
Use: Llama-1B or Phi-3-mini
Reduce batch_size=1, max_seq_length=1024

Common Patterns

Pattern 1: Quick Prototype

# Minimal setup for fast experimentation
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=1024,    # Shorter for speed
    load_in_4bit=True
)

model = FastLanguageModel.get_peft_model(model, r=8)  # Lower rank

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2,
        max_steps=50,        # Few steps
        learning_rate=2e-4,
        output_dir="./quick_test"
    )
)

Pattern 2: Production Quality

# Full setup for best results
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth"
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                   # Standard rank
    lora_alpha=16,
    use_gradient_checkpointing="unsloth"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=500,       # More steps
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_steps=100,
        output_dir="./production_model"
    )
)

Pattern 3: Large Model (70B)

# Special settings for very large models
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth"
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Fewer targets
    use_gradient_checkpointing="unsloth"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=1,    # Must be 1
        gradient_accumulation_steps=8,    # Compensate
        max_steps=200,
        learning_rate=1e-4,               # Lower LR
        output_dir="./large_model"
    )
)

Troubleshooting

Error: "CUDA out of memory"

Solution:

# Reduce memory usage
batch_size = 1
max_seq_length = 1024
gradient_accumulation_steps = 8
# Or use smaller model

Error: "Model not found"

Solution:

Check model name spelling
Verify internet connection
Try with Hugging Face token: export HF_TOKEN=your_token

Error: "Training loss not decreasing"

Solution:

# Adjust hyperparameters
learning_rate = 5e-4  # Try higher
max_steps = 500       # Train longer
# Or check dataset quality

Best Practices

Always use 4-bit quantization unless you have >80GB VRAM
Start with small models (1B) for experimentation
Monitor GPU usage with nvidia-smi
Save checkpoints every 100 steps
Validate on test set before exporting
Use appropriate LoRA rank: 8 for experiments, 16 for production, 32 for complex tasks

Dataset Format

Unsloth works with Hugging Face datasets. Example format:

{
  "text": "### Instruction: Explain quantum computing\n### Response: Quantum computing uses quantum bits..."
}

Or instruction format:

{
  "instruction": "Explain quantum computing",
  "input": "",
  "output": "Quantum computing uses quantum bits..."
}

Performance Benchmarks

Model	VRAM	Speed (vs standard)	Memory Reduction
Llama 3.2 1B	~2GB	2x faster	80% less
Llama 3.2 3B	~4GB	2x faster	75% less
Llama 3.1 8B	~6GB	2x faster	70% less
Llama 3.3 70B	~40GB	2x faster	75% less

Additional Resources

For more advanced topics, see:

ADVANCED.md - Multi-GPU, custom optimizers, advanced LoRA
DATASETS.md - Dataset preparation and formatting
EXPORT.md - Detailed export options and formats

Version Compatibility

Python: 3.10, 3.11, 3.12 (not 3.13)
CUDA: 11.8 or 12.1+
PyTorch: 2.0+
Transformers: 4.37+

unsloth-finetuning

More from this repository

Unsloth Fine-Tuning

Core Capabilities

Quick Start

1. Load a Model

2. Apply LoRA

3. Configure Training

4. Train

5. Export

Performance Optimization

Memory Optimization

Speed Optimization

Quality Optimization

Hardware Requirements

Minimum (1B models)

Recommended (3B-7B models)

Budget (Small experiments)

Common Patterns

Pattern 1: Quick Prototype

Pattern 2: Production Quality

Pattern 3: Large Model (70B)

Troubleshooting

Error: "CUDA out of memory"

Error: "Model not found"

Error: "Training loss not decreasing"

Best Practices

Dataset Format

Performance Benchmarks

Additional Resources

Version Compatibility

Unsloth Fine-Tuning

Core Capabilities

Quick Start

1. Load a Model

2. Apply LoRA

3. Configure Training

4. Train

5. Export

Performance Optimization

Memory Optimization

Speed Optimization

Quality Optimization

Hardware Requirements

Minimum (1B models)

Recommended (3B-7B models)

Budget (Small experiments)

Common Patterns

Pattern 1: Quick Prototype

Pattern 2: Production Quality

Pattern 3: Large Model (70B)

Troubleshooting

Error: "CUDA out of memory"

Error: "Model not found"

Error: "Training loss not decreasing"

Best Practices

Dataset Format

Performance Benchmarks

Additional Resources

Version Compatibility

More from this repository