| name | blip-2-vision-language |
| description | Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| tags | ["Multimodal","Vision-Language","Image Captioning","VQA","Zero-Shot"] |
| dependencies | ["transformers>=4.30.0","torch>=1.10.0","Pillow"] |
BLIP-2: Vision-Language Pre-training
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
When to use BLIP-2
Use BLIP-2 when:
- Need high-quality image captioning with natural descriptions
- Building visual question answering (VQA) systems
- Require zero-shot image-text understanding without task-specific training
- Want to leverage LLM reasoning for visual tasks
- Building multimodal conversational AI
- Need image-text retrieval or matching
Key features:
- Q-Former architecture: Lightweight query transformer bridges vision and language
- Frozen backbone efficiency: No need to fine-tune large vision/language models
- Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
- Zero-shot capabilities: Strong performance without task-specific training
- Efficient training: Only trains Q-Former (~188M parameters)
- State-of-the-art results: Beats larger models on VQA benchmarks
Use alternatives instead:
- LLaVA: For instruction-following multimodal chat
- InstructBLIP: For improved instruction-following (BLIP-2 successor)
- GPT-4V/Claude 3: For production multimodal chat (proprietary)
- CLIP: For simple image-text similarity without generation
- Flamingo: For few-shot visual learning
Quick start
Installation
pip install transformers accelerate torch Pillow
pip install salesforce-lavis
Basic image captioning
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
image = Image.open("photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
Visual question answering
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
Using LAVIS library
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
caption = model.generate({"image": image})
print(caption)
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
Core concepts
Architecture overview
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘
Model variants
| Model | LLM Backend | Size | Use Case |
|---|
blip2-opt-2.7b | OPT-2.7B | ~4GB | General captioning, VQA |
blip2-opt-6.7b | OPT-6.7B | ~8GB | Better reasoning |
blip2-flan-t5-xl | FlanT5-XL | ~5GB | Instruction following |
blip2-flan-t5-xxl | FlanT5-XXL | ~13GB | Best quality |
Q-Former components
| Component | Description | Parameters |
|---|
| Learned queries | Fixed set of learnable embeddings | 32 × 768 |
| Image transformer | Cross-attention to vision features | ~108M |
| Text transformer | Self-attention for text | ~108M |
| Linear projection | Maps to LLM dimension | Varies |
Advanced usage
Batch processing
from PIL import Image
import torch
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
Controlling generation
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5,
no_repeat_ngram_size=2,
top_p=0.9,
temperature=0.7,
do_sample=True,
)
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
Memory optimization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
Image-text matching
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
Feature extraction
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds
image_features = features.image_embeds_proj
Common workflows
Workflow 1: Image captioning pipeline
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
captioner = ImageCaptioner()
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
Workflow 2: Visual Q&A system
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""Load image for multiple questions."""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""Ask a question about the current image."""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""Ask multiple questions about current image."""
return {q: self.ask(q) for q in questions}
vqa = VisualQA()
vqa.set_image("scene.jpg")
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
results = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
Workflow 3: Image search/retrieval
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""Build index from images."""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""Search images by text query."""
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
Output format
Generation output
generated_ids = model.generate(**inputs, max_new_tokens=50)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
Feature extraction output
features = model.extract_features({"image": image}, mode="image")
features.image_embeds
features.image_embeds_proj
features.text_embeds
features.text_embeds_proj
Performance optimization
GPU memory requirements
| Model | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
Speed optimization
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
device_map="auto"
)
model = torch.compile(model)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
Common issues
| Issue | Solution |
|---|
| CUDA OOM | Use INT8/INT4 quantization, smaller model |
| Slow generation | Use greedy decoding, reduce max_new_tokens |
| Poor captions | Try FlanT5 variant, use prompts |
| Hallucinations | Lower temperature, use beam search |
| Wrong answers | Rephrase question, provide context |
References
Resources