| name | lora |
| description | Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base. |
Using LoRA for Fine-tuning
LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.
Table of Contents
Core Concepts
How LoRA Works
Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
Where:
W is the frozen pretrained weight matrix (d × k)
B is a trainable matrix (d × r)
A is a trainable matrix (r × k)
r is the rank, much smaller than d and k
The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.
Why Use LoRA
| Aspect | Full Fine-tuning | LoRA |
|---|
| Trainable params | 100% | ~0.1-1% |
| Memory usage | High | Low |
| Adapter size | Full model | ~3-100 MB |
| Training speed | Slower | Faster |
| Multiple tasks | Separate models | Swap adapters |
Basic Setup
Installation
pip install peft transformers accelerate
Minimal Example
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Configuration Parameters
LoraConfig Options
from peft import LoraConfig, TaskType
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
modules_to_save=None,
layers_to_transform=None,
rank_pattern=None,
alpha_pattern=None,
trainable_token_indices=None,
target_parameters=None,
use_rslora=False,
use_dora=False,
)
Target Modules by Architecture
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["c_attn", "c_proj", "c_fc"]
target_modules = ["query", "key", "value", "dense"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
target_parameters = ["feed_forward.experts.gate_up_proj", "feed_forward.experts.down_proj"]
For MoE models where expert weights are not nn.Linear modules, use target_parameters rather than target_modules. Merge adapters before latency-sensitive inference because materializing expert LoRA updates can add overhead.
Finding Target Modules
from peft.utils import get_peft_model_state_dict
def find_target_modules(model):
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
layer_name = name.split(".")[-1]
linear_modules.add(layer_name)
return list(linear_modules)
print(find_target_modules(model))
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
Memory Requirements
| Model Size | Full FT (16-bit) | LoRA (16-bit) | QLoRA (4-bit) |
|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Training Patterns
With Hugging Face Trainer
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example):
if example["input"]:
text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding=False,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
training_args = TrainingArguments(
output_dir="./lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=500,
warmup_ratio=0.03,
gradient_checkpointing=True,
optim="adamw_torch_fused",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
With SFTTrainer (TRL)
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="./sft-lora",
max_length=1024,
dataset_text_field="text",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
peft_config=lora_config,
)
trainer.train()
Classification Task
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
modules_to_save=["classifier"],
)
model = get_peft_model(model, lora_config)
Saving and Loading
Save Adapter
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
model.push_to_hub("username/my-lora-adapter")
Load Adapter
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
model.eval()
Switch Between Adapters
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")
model.set_adapter("task1")
output = model.generate(**inputs)
model.set_adapter("task2")
output = model.generate(**inputs)
with model.disable_adapter():
output = model.generate(**inputs)
Per-Sample Adapter Selection
For inference batches that mix tasks or languages, pass adapter_names aligned with each sample. Use "__base__" for rows that should run the base model without an adapter.
inputs = tokenizer(["Hello", "Bonjour", "Hallo"], return_tensors="pt", padding=True).to(model.device)
adapter_names = ["__base__", "french", "german"]
outputs = model.generate(**inputs, adapter_names=adapter_names, max_new_tokens=32)
Merging Adapters
Merge LoRA weights into the base model for deployment without adapter overhead.
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
dtype=torch.bfloat16,
device_map="cpu",
)
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
merged_model.push_to_hub("username/my-merged-model")
Best Practices
-
Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
-
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is alpha/r
-
Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
-
Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
-
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
model.gradient_checkpointing_enable()
-
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
-
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
-
Save checkpoints frequently: LoRA adapters are small, so save often
-
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
-
Consider modules_to_save for task heads: For classification, train the classifier fully:
modules_to_save=["classifier", "score"]
-
Train new tokens selectively: After adding special tokens, prefer trainable_token_indices over training the whole embedding matrix when only a few tokens need adaptation.
-
Merge adapters for high-throughput serving: merge_and_unload() removes PEFT runtime overhead, especially for MoE parameter-targeted adapters.
References
See reference/ for detailed documentation:
advanced-techniques.md - DoRA, rsLoRA, adapter composition, and debugging
External documentation: