| name | transformers |
| description | Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks. |
Using Hugging Face Transformers
Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.
Table of Contents
Core Concepts
The Three Core Classes
Every model in Transformers has three core components:
from transformers import AutoConfig, AutoModel, AutoTokenizer, AutoProcessor
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
The from_pretrained Pattern
All loading uses from_pretrained() which handles downloading, caching, and device placement:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto",
)
Transformers v5 examples use dtype. On Transformers v4, the equivalent argument is torch_dtype.
Auto Classes
Use task-specific Auto classes for the correct model head:
from transformers import (
AutoModelForCausalLM,
AutoModelForSeq2SeqLM,
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForQuestionAnswering,
AutoModelForMaskedLM,
AutoModelForImageClassification,
AutoModelForSpeechSeq2Seq,
)
Pipeline API
The pipeline() function provides high-level inference with minimal code:
Text Tasks
from transformers import pipeline
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("The secret to success is", max_new_tokens=50)
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York City.")
qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
Chat/Conversational
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="meta-llama/Llama-3.2-3B-Instruct",
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])
Vision Tasks
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
Audio Tasks
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")
Multimodal Tasks
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
Model Loading
Device Placement
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map="cuda:0",
)
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 1,
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
Loading from Local Path
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")
model = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")
Trust Remote Code
Some models require executing custom code from the Hub:
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
trust_remote_code=True,
)
Prefer models with safetensors weights when available. Safetensors avoids pickle execution risks and typically loads faster than legacy .bin checkpoints.
Inference Patterns
Text Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto",
)
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
)
Chat Templates
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Getting Embeddings
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embeddings(texts: list[str]) -> torch.Tensor:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
attention_mask = inputs["attention_mask"]
embeddings = outputs.last_hidden_state
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = (embeddings * mask_expanded).sum(1)
sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
return sum_embeddings / sum_mask
embeddings = get_embeddings(["Hello world", "How are you?"])
Classification
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("I love this movie!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
print(f"{labels[idx]}: {prob:.4f}")
Fine-tuning with Trainer
Basic Fine-tuning
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
)
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
Pushing to Hub
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")
trainer.push_to_hub()
See reference/fine-tuning.md for advanced patterns including LoRA, custom data collators, and evaluation metrics.
Working with Modalities
Use AutoProcessor or modality-specific processors for non-text models. Processors handle images, audio, video, and multimodal chat formatting before tensors are sent to the model.
| Modality | Processor | Model Class |
|---|
| Vision | AutoImageProcessor | AutoModelForImageClassification, AutoModelForObjectDetection |
| Audio | AutoProcessor | AutoModelForSpeechSeq2Seq, AutoModelForAudioClassification |
| Vision-language | AutoProcessor | AutoModelForVision2Seq, task-specific VLM classes |
Memory and Performance
Quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
dtype="auto",
)
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
dtype="auto",
)
Flash Attention
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
torch.compile
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")
Batched Inference
texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Best Practices
- Use bfloat16 over float16: Better numerical stability on modern GPUs
- Use
dtype="auto" when unsure: It follows model metadata or checkpoint weight dtype without wasting memory
- Set pad token for generation:
tokenizer.pad_token = tokenizer.eos_token
- Use device_map="auto": Let Accelerate handle device placement
- Prefer safetensors checkpoints: Avoid pickle-based loading when possible
- Enable Flash Attention: Significant speedup for long sequences
- Batch when possible: Amortize fixed costs across multiple inputs
- Use pipeline for quick prototyping: Switch to manual control for production
- Cache models locally: Set
HF_HOME environment variable for model cache location
- Check model license: Verify usage rights before deployment
References
See reference/ for detailed documentation:
fine-tuning.md - Advanced fine-tuning patterns with LoRA, PEFT, and custom training
External documentation: