Run any Skill in Manus with one click

$pwd:

lora

Name: Lora
Author: itsmostafa

// Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.

Run Skill in Manus

$ git log --oneline --stat

stars:24

forks:0

updated:May 6, 2026 at 04:35

File Explorer

2 files

SKILL.md

readonly

package.json

"author": "itsmostafa"

"repository": "itsmostafa/llm-engineering-skills"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	lora
description	Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.

Using LoRA for Fine-tuning

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.

Core Concepts
Basic Setup
Configuration Parameters
QLoRA (Quantized LoRA)
Training Patterns
Saving and Loading
Merging Adapters
Best Practices
References

Core Concepts

How LoRA Works

Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:

W' = W + BA

Where:

W is the frozen pretrained weight matrix (d × k)
B is a trainable matrix (d × r)
A is a trainable matrix (r × k)
r is the rank, much smaller than d and k

The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.

Why Use LoRA

Aspect	Full Fine-tuning	LoRA
Trainable params	100%	~0.1-1%
Memory usage	High	Low
Adapter size	Full model	~3-100 MB
Training speed	Slower	Faster
Multiple tasks	Separate models	Swap adapters

Basic Setup

Installation

pip install peft transformers accelerate

Minimal Example

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%

Configuration Parameters

LoraConfig Options

from peft import LoraConfig, TaskType

config = LoraConfig(
    # Core parameters
    r=16,                          # Rank of update matrices
    lora_alpha=32,                 # Scaling factor (alpha/r applied to updates)
    target_modules=["q_proj", "v_proj"],  # Layers to adapt

    # Regularization
    lora_dropout=0.05,             # Dropout on LoRA layers
    bias="none",                   # "none", "all", or "lora_only"

    # Task configuration
    task_type=TaskType.CAUSAL_LM,  # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS

    # Advanced
    modules_to_save=None,          # Additional modules to train (e.g., ["lm_head"])
    layers_to_transform=None,      # Specific layer indices to adapt
    rank_pattern=None,             # Per-module rank overrides
    alpha_pattern=None,            # Per-module alpha overrides
    trainable_token_indices=None,  # Train only selected new token embeddings
    target_parameters=None,        # Target nn.Parameter weights in some MoE layers
    use_rslora=False,              # Rank-stabilized LoRA scaling
    use_dora=False,                # Weight-Decomposed LoRA
)

Target Modules by Architecture

# Llama, Mistral, Qwen
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2, GPT-J
target_modules = ["c_attn", "c_proj", "c_fc"]

# BERT, RoBERTa
target_modules = ["query", "key", "value", "dense"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Phi
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]

# MoE expert parameters stored as nn.Parameter tensors
target_parameters = ["feed_forward.experts.gate_up_proj", "feed_forward.experts.down_proj"]

For MoE models where expert weights are not nn.Linear modules, use target_parameters rather than target_modules. Merge adapters before latency-sensitive inference because materializing expert LoRA updates can add overhead.

Finding Target Modules

# Print all linear layer names
from peft.utils import get_peft_model_state_dict

def find_target_modules(model):
    linear_modules = set()
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
            layer_name = name.split(".")[-1]
            linear_modules.add(layer_name)
    return list(linear_modules)

print(find_target_modules(model))

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.

Setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # Normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)

Memory Requirements

Model Size	Full FT (16-bit)	LoRA (16-bit)	QLoRA (4-bit)
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
70B	~560 GB	~160 GB	~48 GB

Training Patterns

With Hugging Face Trainer

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset

# Prepare dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_prompt(example):
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_prompt)

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding=False,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

# Training arguments (note higher learning rate)
training_args = TrainingArguments(
    output_dir="./lora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,              # Higher than full fine-tuning
    bf16=True,
    logging_steps=10,
    save_steps=500,
    warmup_ratio=0.03,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

With SFTTrainer (TRL)

from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="./sft-lora",
    max_length=1024,
    dataset_text_field="text",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=lora_config,      # Pass config directly, SFTTrainer applies it
)

trainer.train()

Classification Task

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"],  # Train classification head fully
)

model = get_peft_model(model, lora_config)

Saving and Loading

Save Adapter

# Save only LoRA weights (small file)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

# Push to Hub
model.push_to_hub("username/my-lora-adapter")

Load Adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    dtype=torch.bfloat16,
    device_map="auto",
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# For inference
model.eval()

Switch Between Adapters

# Load multiple adapters
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")

# Switch active adapter
model.set_adapter("task1")
output = model.generate(**inputs)

model.set_adapter("task2")
output = model.generate(**inputs)

# Disable adapter (use base model)
with model.disable_adapter():
    output = model.generate(**inputs)

Per-Sample Adapter Selection

For inference batches that mix tasks or languages, pass adapter_names aligned with each sample. Use "__base__" for rows that should run the base model without an adapter.

inputs = tokenizer(["Hello", "Bonjour", "Hallo"], return_tensors="pt", padding=True).to(model.device)
adapter_names = ["__base__", "french", "german"]
outputs = model.generate(**inputs, adapter_names=adapter_names, max_new_tokens=32)

Merging Adapters

Merge LoRA weights into the base model for deployment without adapter overhead.

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    dtype=torch.bfloat16,
    device_map="cpu",  # Merge on CPU to avoid memory issues
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Push merged model to Hub
merged_model.push_to_hub("username/my-merged-model")

Best Practices

Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is alpha/r

Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
```
model.gradient_checkpointing_enable()
```
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
Save checkpoints frequently: LoRA adapters are small, so save often
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
Consider modules_to_save for task heads: For classification, train the classifier fully:
```
modules_to_save=["classifier", "score"]
```
Train new tokens selectively: After adding special tokens, prefer trainable_token_indices over training the whole embedding matrix when only a few tokens need adaptation.
Merge adapters for high-throughput serving: merge_and_unload() removes PEFT runtime overhead, especially for MoE parameter-targeted adapters.

References

See reference/ for detailed documentation:

advanced-techniques.md - DoRA, rsLoRA, adapter composition, and debugging

External documentation:

PEFT LoRA developer guide

name	lora
description	Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.

Core Concepts

How LoRA Works

Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:

W' = W + BA

Where:

W is the frozen pretrained weight matrix (d × k)
B is a trainable matrix (d × r)
A is a trainable matrix (r × k)
r is the rank, much smaller than d and k

The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.

Why Use LoRA

Aspect	Full Fine-tuning	LoRA
Trainable params	100%	~0.1-1%
Memory usage	High	Low
Adapter size	Full model	~3-100 MB
Training speed	Slower	Faster
Multiple tasks	Separate models	Swap adapters

Basic Setup

Installation

pip install peft transformers accelerate

Minimal Example

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%

Configuration Parameters

LoraConfig Options

from peft import LoraConfig, TaskType

config = LoraConfig(
    # Core parameters
    r=16,                          # Rank of update matrices
    lora_alpha=32,                 # Scaling factor (alpha/r applied to updates)
    target_modules=["q_proj", "v_proj"],  # Layers to adapt

    # Regularization
    lora_dropout=0.05,             # Dropout on LoRA layers
    bias="none",                   # "none", "all", or "lora_only"

    # Task configuration
    task_type=TaskType.CAUSAL_LM,  # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS

    # Advanced
    modules_to_save=None,          # Additional modules to train (e.g., ["lm_head"])
    layers_to_transform=None,      # Specific layer indices to adapt
    rank_pattern=None,             # Per-module rank overrides
    alpha_pattern=None,            # Per-module alpha overrides
    trainable_token_indices=None,  # Train only selected new token embeddings
    target_parameters=None,        # Target nn.Parameter weights in some MoE layers
    use_rslora=False,              # Rank-stabilized LoRA scaling
    use_dora=False,                # Weight-Decomposed LoRA
)

Target Modules by Architecture

# Llama, Mistral, Qwen
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2, GPT-J
target_modules = ["c_attn", "c_proj", "c_fc"]

# BERT, RoBERTa
target_modules = ["query", "key", "value", "dense"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Phi
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]

# MoE expert parameters stored as nn.Parameter tensors
target_parameters = ["feed_forward.experts.gate_up_proj", "feed_forward.experts.down_proj"]

Finding Target Modules

# Print all linear layer names
from peft.utils import get_peft_model_state_dict

def find_target_modules(model):
    linear_modules = set()
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
            layer_name = name.split(".")[-1]
            linear_modules.add(layer_name)
    return list(linear_modules)

print(find_target_modules(model))

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.

Setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # Normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)

Memory Requirements

Model Size	Full FT (16-bit)	LoRA (16-bit)	QLoRA (4-bit)
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
70B	~560 GB	~160 GB	~48 GB

Training Patterns

With Hugging Face Trainer

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset

# Prepare dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_prompt(example):
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_prompt)

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding=False,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

# Training arguments (note higher learning rate)
training_args = TrainingArguments(
    output_dir="./lora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,              # Higher than full fine-tuning
    bf16=True,
    logging_steps=10,
    save_steps=500,
    warmup_ratio=0.03,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

With SFTTrainer (TRL)

from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="./sft-lora",
    max_length=1024,
    dataset_text_field="text",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=lora_config,      # Pass config directly, SFTTrainer applies it
)

trainer.train()

Classification Task

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"],  # Train classification head fully
)

model = get_peft_model(model, lora_config)

Saving and Loading

Save Adapter

# Save only LoRA weights (small file)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

# Push to Hub
model.push_to_hub("username/my-lora-adapter")

Load Adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    dtype=torch.bfloat16,
    device_map="auto",
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# For inference
model.eval()

Switch Between Adapters

# Load multiple adapters
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")

# Switch active adapter
model.set_adapter("task1")
output = model.generate(**inputs)

model.set_adapter("task2")
output = model.generate(**inputs)

# Disable adapter (use base model)
with model.disable_adapter():
    output = model.generate(**inputs)

Per-Sample Adapter Selection

For inference batches that mix tasks or languages, pass adapter_names aligned with each sample. Use "__base__" for rows that should run the base model without an adapter.

inputs = tokenizer(["Hello", "Bonjour", "Hallo"], return_tensors="pt", padding=True).to(model.device)
adapter_names = ["__base__", "french", "german"]
outputs = model.generate(**inputs, adapter_names=adapter_names, max_new_tokens=32)

Merging Adapters

Merge LoRA weights into the base model for deployment without adapter overhead.

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    dtype=torch.bfloat16,
    device_map="cpu",  # Merge on CPU to avoid memory issues
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Push merged model to Hub
merged_model.push_to_hub("username/my-merged-model")

Best Practices

Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is alpha/r

Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
```
model.gradient_checkpointing_enable()
```
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
Save checkpoints frequently: LoRA adapters are small, so save often
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
Consider modules_to_save for task heads: For classification, train the classifier fully:
```
modules_to_save=["classifier", "score"]
```
Train new tokens selectively: After adding special tokens, prefer trainable_token_indices over training the whole embedding matrix when only a few tokens need adaptation.
Merge adapters for high-throughput serving: merge_and_unload() removes PEFT runtime overhead, especially for MoE parameter-targeted adapters.

References

See reference/ for detailed documentation:

advanced-techniques.md - DoRA, rsLoRA, adapter composition, and debugging

External documentation:

PEFT LoRA developer guide

lora

Using LoRA for Fine-tuning

Table of Contents

Core Concepts

How LoRA Works

Why Use LoRA

Basic Setup

Installation

Minimal Example

Configuration Parameters

LoraConfig Options

Target Modules by Architecture

Finding Target Modules

QLoRA (Quantized LoRA)

Setup

Memory Requirements

Training Patterns

With Hugging Face Trainer

With SFTTrainer (TRL)

Classification Task

Saving and Loading

Save Adapter

Load Adapter

Switch Between Adapters

Per-Sample Adapter Selection

Merging Adapters

Best Practices

References

Using LoRA for Fine-tuning

Table of Contents

Core Concepts

How LoRA Works

Why Use LoRA

Basic Setup

Installation

Minimal Example

Configuration Parameters

LoraConfig Options

Target Modules by Architecture

Finding Target Modules

QLoRA (Quantized LoRA)

Setup

Memory Requirements

Training Patterns

With Hugging Face Trainer

With SFTTrainer (TRL)

Classification Task

Saving and Loading

Save Adapter

Load Adapter

Switch Between Adapters

Per-Sample Adapter Selection

Merging Adapters

Best Practices

References