원클릭으로
text-to-image-search
// Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
// Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.
Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.
Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.
| name | text-to-image-search |
| description | Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find. |
Search images using natural language descriptions — find visuals by describing what you're looking for.
Activate this skill when:
Do NOT activate when:
image-searchmultimodal-ragvideo-search"What type of text queries will users make?"
A) Simple queries ("cat", "sunset beach", "red car")
B) Complex queries ("a red car turning right at an intersection at night")
C) Domain-specific ("tumor in left lung lobe", "fault line in seismic data")
Which describes your queries? (A/B/C)
"Based on query complexity, here are your options:"
| Option | Approach | Pros | Cons |
|---|---|---|---|
| A: CLIP Direct | Text → CLIP → Search | Fast, free | Weak on complex queries |
| B: VLM Captions | Image → VLM → Caption → Text embedding | Better semantics | Slow indexing, API cost |
"Based on your requirements:
Proceed? (yes / adjust [what])"
Think of text-to-image search as two different libraries:
Option A: CLIP (Shared Language)
Option B: VLM Captions (Description Matching)
┌─────────────────────────────────────────────────────────────┐
│ Option A: CLIP Direct │
│ │
│ Indexing: Search: │
│ Image → CLIP → Vector Text → CLIP → Vector │
│ │ │
│ Same vector space! ────────────→│ │
│ ▼ │
│ Find similar │
│ vectors │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Option B: VLM + Text Embedding │
│ │
│ Indexing: │
│ Image → VLM → "A red car..." → BGE → Vector │
│ │
│ Search: │
│ Query: "red vehicle" → BGE → Vector → Find similar │
│ │
│ Matching happens in text embedding space │
└─────────────────────────────────────────────────────────────┘
| Scenario | Best Option | Why |
|---|---|---|
| Simple queries, high volume | CLIP Direct | Fast, no API cost |
| Complex descriptions | VLM Captions | Better understanding |
| Domain-specific (medical, legal) | VLM Captions | Can prompt for domain terms |
| Budget constrained | CLIP Direct | Free |
| Quality critical | VLM Captions | More accurate |
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from PIL import Image
class CLIPTextToImageSearch:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.model = SentenceTransformer('clip-ViT-B-32')
self.dim = 512
self.collection_name = "clip_image_search"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=self.dim)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def add_images(self, image_paths: list):
"""Index images with CLIP embeddings."""
images = [Image.open(p).convert('RGB') for p in image_paths]
embeddings = self.model.encode(images).tolist()
data = [{"image_path": path, "embedding": emb}
for path, emb in zip(image_paths, embeddings)]
self.client.insert(collection_name=self.collection_name, data=data)
def search(self, text_query: str, limit: int = 10):
"""Search images with text description."""
# CLIP encodes text into same space as images
embedding = self.model.encode(text_query).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["image_path"]
)
return [{"path": hit["entity"]["image_path"], "score": hit["distance"]}
for hit in results[0]]
# Usage
search = CLIPTextToImageSearch()
search.add_images(["beach.jpg", "city.jpg", "forest.jpg"])
results = search.search("sunset over the ocean")
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import base64
class VLMTextToImageSearch:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.text_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.openai = OpenAI()
self.dim = 1024
self.collection_name = "vlm_image_search"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("caption", DataType.VARCHAR, max_length=4096)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=self.dim)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def _generate_caption(self, image_path: str) -> str:
"""Generate detailed caption using VLM."""
with open(image_path, "rb") as f:
b64_image = base64.standard_b64encode(f.read()).decode()
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail. Include objects, actions, colors, setting, and any text visible."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
]
}],
max_tokens=500
)
return response.choices[0].message.content
def add_images(self, image_paths: list):
"""Index images with VLM-generated captions."""
data = []
for path in image_paths:
caption = self._generate_caption(path)
embedding = self.text_model.encode(caption).tolist()
data.append({
"image_path": path,
"caption": caption,
"embedding": embedding
})
self.client.insert(collection_name=self.collection_name, data=data)
def search(self, text_query: str, limit: int = 10):
"""Search images with text description."""
embedding = self.text_model.encode(text_query).tolist()
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["image_path", "caption"]
)
return [{
"path": hit["entity"]["image_path"],
"caption": hit["entity"]["caption"],
"score": hit["distance"]
} for hit in results[0]]
# Usage
search = VLMTextToImageSearch()
search.add_images(["traffic.jpg"])
results = search.search("a red car turning right at an intersection")
| Aspect | CLIP Direct | VLM Captions |
|---|---|---|
| Indexing speed | Fast (ms per image) | Slow (seconds per image) |
| Query speed | Fast | Fast |
| API cost | Free | ~$0.01 per image |
| Simple queries | ★★★★ | ★★★★★ |
| Complex queries | ★★★ | ★★★★★ |
| Domain-specific | ★★ | ★★★★ |
| Storage | 512d vector only | 1024d vector + text |
Problem: "red car turning right at night" returns random cars
Why: CLIP wasn't trained on such specific scene descriptions
Fix: Use VLM captions for complex queries
Problem: All captions say "This is an image of..."
Why: Default prompts generate generic descriptions
Fix: Use specific prompts
prompt = """Describe this image with:
1. Main objects and their colors
2. Actions or movements
3. Setting/environment
4. Time of day if visible
5. Any text in the image"""
Problem: Search returns nothing
Why: Used BGE to embed query but CLIP for images
Fix: Use same model for query as for indexing
# CLIP indexing → CLIP query
# BGE caption indexing → BGE query
# Never mix!
Problem: $100+ API bill for 10K images
Why: Using GPT-4o for everything
Fix: Use cheaper models or local VLMs
# Cheaper options:
# - gpt-4o-mini (~$0.003 per image)
# - Local LLaVA (free, requires GPU)
# - Qwen-VL API (cheaper for Chinese)
| Need | Upgrade To |
|---|---|
| Also search by image | Add image-search capability |
| Q&A on image content | multimodal-rag |
| Filter by metadata | Add filtered-search pattern |
| Video content | video-search |
image-search/references/image-embeddings.mdcore:ray