// Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching".
| name | clip-aware-embeddings |
| description | Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching". |
| allowed-tools | Read,Write,Edit,Bash |
Smart image-text matching that knows when CLIP works and when to use alternatives.
| MCP | Purpose |
|---|---|
| Firecrawl | Research latest CLIP alternatives and benchmarks |
| Hugging Face (if configured) | Access model cards and documentation |
Your task:
โโ Semantic search ("find beach images") โ CLIP โ
โโ Zero-shot classification (broad categories) โ CLIP โ
โโ Counting objects โ DETR, Faster R-CNN โ
โโ Fine-grained ID (celebrities, car models) โ Specialized model โ
โโ Spatial relations ("cat left of dog") โ GQA, SWIG โ
โโ Compositional ("red car AND blue truck") โ DCSMs, PC-CLIP โ
โ Use for:
โ Do NOT use for:
pip install transformers pillow torch sentence-transformers --break-system-packages
Validation: Run python scripts/validate_setup.py
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)
# Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
# Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)
โ Wrong:
# Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
# CLIP cannot count - it will give nonsense results
Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.
โ Right:
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
# Detect objects
results = model(**processor(images=image, return_tensors="pt"))
# Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)
How to detect: If query contains "how many", "count", or numeric questions โ Use object detection
โ Wrong:
# Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
# CLIP will perform poorly - not trained for fine-grained face ID
Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.
โ Right:
# Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"microsoft/resnet-50" # Then fine-tune on celebrity dataset
)
# Or use dedicated face recognition: ArcFace, CosFace
How to detect: If query asks to distinguish between similar items in same category โ Use specialized model
โ Wrong:
# CLIP cannot understand spatial relationships
prompts = [
"cat to the left of dog",
"cat to the right of dog"
]
# Will give nearly identical scores
Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.
โ Right:
# Use a spatial reasoning model
# Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel
model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
# Returns: "left", "right", "above", "below", etc.
How to detect: If query contains directional words (left, right, above, under, next to) โ Use spatial model
โ Wrong:
prompts = [
"red car and blue truck",
"blue car and red truck"
]
# CLIP often gives similar scores for both
Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.
โ Right - Use PC-CLIP or DCSMs:
# PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
# Or use DCSMs (Dense Cosine Similarity Maps)
How to detect: If query has multiple objects with different attributes โ Use compositional model
LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.
Before using CLIP, check if it's appropriate:
python scripts/validate_clip_usage.py \
--query "your query here" \
--check-all
Returns:
# Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
# Works well for broad semantic concepts
# Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
# CLIP excels at this
# Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
# See /references/object_detection.md
# Use specialized models
# See /references/fine_grained_models.md
# Use spatial relation models
# See /references/spatial_models.md
Check:
Validation:
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"
Possible causes:
Solution: Try broader query or use alternative model
| Model | Best For | Avoid For |
|---|---|---|
| CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial |
| DETR | Object detection, counting | Semantic similarity |
| DINOv2 | Fine-grained features | Text-image matching |
| PC-CLIP | Attribute binding, comparisons | General embedding |
| DCSMs | Compositional reasoning | Simple similarity |
CLIP models:
Inference time (single image, CPU):
/references/clip_limitations.md - Detailed analysis of CLIP's failures/references/alternatives.md - When to use what model/references/compositional_reasoning.md - DCSMs and PC-CLIP deep dive/scripts/validate_clip_usage.py - Pre-flight validation tool/scripts/diagnose_clip_issue.py - Debug unexpected resultsSee CHANGELOG.md for version history.