| name | text-to-image-search |
| description | Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find. |
Text-to-Image Search
Search images using natural language descriptions — find visuals by describing what you're looking for.
When to Activate
Activate this skill when:
- User wants to find images by describing them in natural language
- User mentions "find image of", "search for pictures of", "describe and find"
- User has complex visual queries ("red car turning right at an intersection")
- User's queries are descriptive sentences, not just keywords
Do NOT activate when:
- User has an image to find similar ones → use
image-search
- User wants Q&A on documents with images → use
multimodal-rag
- User needs video search → use
video-search
Interactive Flow
Step 1: Assess Query Complexity
"What type of text queries will users make?"
A) Simple queries ("cat", "sunset beach", "red car")
- CLIP direct encoding works well
- Fast, no extra API costs
B) Complex queries ("a red car turning right at an intersection at night")
- May need VLM-generated captions
- Better semantic understanding
C) Domain-specific ("tumor in left lung lobe", "fault line in seismic data")
- May need specialized models or VLM descriptions
- Domain vocabulary matters
Which describes your queries? (A/B/C)
Step 2: Choose Architecture
"Based on query complexity, here are your options:"
| Option | Approach | Pros | Cons |
|---|
| A: CLIP Direct | Text → CLIP → Search | Fast, free | Weak on complex queries |
| B: VLM Captions | Image → VLM → Caption → Text embedding | Better semantics | Slow indexing, API cost |
Step 3: Confirm Configuration
"Based on your requirements:
- Architecture: [CLIP Direct / VLM Captions]
- Model: [clip-ViT-B-32 / BGE + GPT-4o]
- Index: AUTOINDEX with COSINE
Proceed? (yes / adjust [what])"
Core Concepts
Mental Model: Two Approaches
Think of text-to-image search as two different libraries:
Option A: CLIP (Shared Language)
- Both images and text are translated to the same "language" (vector space)
- Works because CLIP was trained on image-text pairs
- Like having a bilingual librarian who speaks both "image" and "text"
Option B: VLM Captions (Description Matching)
- Each image gets a detailed text description
- Search matches query to descriptions
- Like having someone describe every image in words first
┌─────────────────────────────────────────────────────────────┐
│ Option A: CLIP Direct │
│ │
│ Indexing: Search: │
│ Image → CLIP → Vector Text → CLIP → Vector │
│ │ │
│ Same vector space! ────────────→│ │
│ ▼ │
│ Find similar │
│ vectors │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Option B: VLM + Text Embedding │
│ │
│ Indexing: │
│ Image → VLM → "A red car..." → BGE → Vector │
│ │
│ Search: │
│ Query: "red vehicle" → BGE → Vector → Find similar │
│ │
│ Matching happens in text embedding space │
└─────────────────────────────────────────────────────────────┘
When to Use Each
| Scenario | Best Option | Why |
|---|
| Simple queries, high volume | CLIP Direct | Fast, no API cost |
| Complex descriptions | VLM Captions | Better understanding |
| Domain-specific (medical, legal) | VLM Captions | Can prompt for domain terms |
| Budget constrained | CLIP Direct | Free |
| Quality critical | VLM Captions | More accurate |
Implementation
Option A: CLIP Direct
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from PIL import Image
class CLIPTextToImageSearch:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.model = SentenceTransformer('clip-ViT-B-32')
self.dim = 512
self.collection_name = "clip_image_search"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=self.dim)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def add_images(self, image_paths: list):
"""Index images with CLIP embeddings."""
images = [Image.open(p).convert('RGB') for p in image_paths]
embeddings = self.model.encode(images).tolist()
data = [{"image_path": path, "embedding": emb}
for path, emb in zip(image_paths, embeddings)]
self.client.insert(collection_name=self.collection_name, data=data)
def search(self, text_query: str, limit: int = 10):
"""Search images with text description."""
embedding = self.model.encode(text_query).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["image_path"]
)
return [{"path": hit["entity"]["image_path"], "score": hit["distance"]}
for hit in results[0]]
search = CLIPTextToImageSearch()
search.add_images(["beach.jpg", "city.jpg", "forest.jpg"])
results = search.search("sunset over the ocean")
Option B: VLM Captions
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import base64
class VLMTextToImageSearch:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.text_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.openai = OpenAI()
self.dim = 1024
self.collection_name = "vlm_image_search"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("caption", DataType.VARCHAR, max_length=4096)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=self.dim)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def _generate_caption(self, image_path: str) -> str:
"""Generate detailed caption using VLM."""
with open(image_path, "rb") as f:
b64_image = base64.standard_b64encode(f.read()).decode()
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail. Include objects, actions, colors, setting, and any text visible."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
]
}],
max_tokens=500
)
return response.choices[0].message.content
def add_images(self, image_paths: list):
"""Index images with VLM-generated captions."""
data = []
for path in image_paths:
caption = self._generate_caption(path)
embedding = self.text_model.encode(caption).tolist()
data.append({
"image_path": path,
"caption": caption,
"embedding": embedding
})
self.client.insert(collection_name=self.collection_name, data=data)
def search(self, text_query: str, limit: int = 10):
"""Search images with text description."""
embedding = self.text_model.encode(text_query).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["image_path", "caption"]
)
return [{
"path": hit["entity"]["image_path"],
"caption": hit["entity"]["caption"],
"score": hit["distance"]
} for hit in results[0]]
search = VLMTextToImageSearch()
search.add_images(["traffic.jpg"])
results = search.search("a red car turning right at an intersection")
Comparison Table
| Aspect | CLIP Direct | VLM Captions |
|---|
| Indexing speed | Fast (ms per image) | Slow (seconds per image) |
| Query speed | Fast | Fast |
| API cost | Free | ~$0.01 per image |
| Simple queries | ★★★★ | ★★★★★ |
| Complex queries | ★★★ | ★★★★★ |
| Domain-specific | ★★ | ★★★★ |
| Storage | 512d vector only | 1024d vector + text |
Common Pitfalls
❌ Pitfall 1: Expecting CLIP to Understand Complex Queries
Problem: "red car turning right at night" returns random cars
Why: CLIP wasn't trained on such specific scene descriptions
Fix: Use VLM captions for complex queries
❌ Pitfall 2: VLM Captions Too Generic
Problem: All captions say "This is an image of..."
Why: Default prompts generate generic descriptions
Fix: Use specific prompts
prompt = """Describe this image with:
1. Main objects and their colors
2. Actions or movements
3. Setting/environment
4. Time of day if visible
5. Any text in the image"""
❌ Pitfall 3: Mixing Vector Spaces
Problem: Search returns nothing
Why: Used BGE to embed query but CLIP for images
Fix: Use same model for query as for indexing
❌ Pitfall 4: High VLM Costs
Problem: $100+ API bill for 10K images
Why: Using GPT-4o for everything
Fix: Use cheaper models or local VLMs
When to Level Up
| Need | Upgrade To |
|---|
| Also search by image | Add image-search capability |
| Q&A on image content | multimodal-rag |
| Filter by metadata | Add filtered-search pattern |
| Video content | video-search |
References
- VLM model comparison: See OpenAI, Anthropic, Google docs
- CLIP variants:
image-search/references/image-embeddings.md
- Batch processing:
core:ray