with one click
multimodal-rag
// Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
// Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.
Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.
Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.
| name | multimodal-rag |
| description | Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures. |
Handle Q&A on documents containing images, tables, charts, and text ā answer questions that require understanding both visual and textual content.
Activate this skill when:
Do NOT activate when:
image-searchrag-toolkit:ragvideo-search"What type of documents are you processing?"
A) Technical manuals (product docs, installation guides)
B) Reports with charts (financial, research, analytics)
C) Mixed content (presentations, marketing materials)
Which describes your documents? (A/B/C)
"How should we handle images?"
| Strategy | When to Use |
|---|---|
| VLM Description | Charts, diagrams, complex visuals |
| OCR Extraction | Screenshots with text, tables |
| Caption Only | Photos, simple images |
For most cases, VLM Description is recommended.
"Based on your requirements:
Proceed? (yes / adjust [what])"
Think of multimodal RAG as building a searchable illustrated encyclopedia:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Multimodal RAG Pipeline ā
ā ā
ā Document (PDF with images) ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā
ā ā Text Chunks ā ā Images ā ā
ā ā ā ā ā ā
ā ā "Section 1 ā ā [diagram.png]ā ā
ā ā describes..." ā ā [chart.png] ā ā
ā āāāāāāāā¬āāāāāāāā āāāāāāāā¬āāāāāāāā ā
ā ā ā ā
ā ā ā¼ ā
ā ā āāāāāāāāāāāāāāāā ā
ā ā ā VLM Caption ā ā
ā ā ā "This chart ā ā
ā ā ā shows..." ā ā
ā ā āāāāāāāā¬āāāāāāāā ā
ā ā ā ā
ā ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā Unified Text Embeddings ā ā
ā ā (both text chunks and image captions) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāā ā
ā ā Milvus ā ā
ā āāāāāāāā¬āāāāāāāā ā
ā ā ā
ā ā¼ ā
ā Query: "What does the chart show?" ā
ā ā ā
ā ā¼ ā
ā Retrieved: [text chunk] + [image caption + image] ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā VLM Answer Generation (with images) ā ā
ā ā "Based on the chart, revenue increased..." ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
| Question Type | Text-Only RAG | Multimodal RAG |
|---|---|---|
| "What does paragraph 3 say?" | ā Works | ā Works |
| "What does the chart show?" | ā Can't see | ā Understands |
| "Summarize the diagram" | ā Can't see | ā Describes |
| "What's the value in the table?" | ā ļø If OCR'd | ā Reads directly |
from pymilvus import MilvusClient, DataType
from openai import OpenAI
import fitz # PyMuPDF
import base64
import os
import uuid
class MultimodalRAG:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.openai = OpenAI()
self.collection_name = "multimodal_rag"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("content_type", DataType.VARCHAR, max_length=16) # text/image
schema.add_field("content", DataType.VARCHAR, max_length=65535)
schema.add_field("image_path", DataType.VARCHAR, max_length=512)
schema.add_field("source", DataType.VARCHAR, max_length=512)
schema.add_field("page", DataType.INT32)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1536)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def _embed(self, text: str) -> list:
"""Generate embedding using OpenAI API."""
response = self.openai.embeddings.create(
model="text-embedding-3-small",
input=[text]
)
return response.data[0].embedding
def _describe_image(self, image_path: str) -> str:
"""Generate description of image using VLM."""
with open(image_path, "rb") as f:
b64_image = base64.standard_b64encode(f.read()).decode()
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail. If it's a chart or diagram, explain what it shows. If there's text, include it."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
def ingest_pdf(self, pdf_path: str, image_output_dir: str = "./images"):
"""Process PDF and index text + images."""
os.makedirs(image_output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
source = os.path.basename(pdf_path)
data = []
for page_num, page in enumerate(doc):
# Extract text
text = page.get_text()
if text.strip():
chunks = self._chunk_text(text)
for chunk in chunks:
data.append({
"id": str(uuid.uuid4()),
"content_type": "text",
"content": chunk,
"image_path": "",
"source": source,
"page": page_num + 1,
"embedding": self._embed(chunk)
})
# Extract images
images = page.get_images()
for img_idx, img in enumerate(images):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_path = f"{image_output_dir}/{source}_p{page_num+1}_img{img_idx}.png"
with open(image_path, "wb") as f:
f.write(image_bytes)
# Generate description
description = self._describe_image(image_path)
data.append({
"id": str(uuid.uuid4()),
"content_type": "image",
"content": description,
"image_path": image_path,
"source": source,
"page": page_num + 1,
"embedding": self._embed(description)
})
self.client.insert(collection_name=self.collection_name, data=data)
return len(data)
def _chunk_text(self, text: str, chunk_size: int = 500) -> list:
"""Split text into chunks."""
words = text.split()
chunks = []
current_chunk = []
for word in words:
current_chunk.append(word)
if len(" ".join(current_chunk)) > chunk_size:
chunks.append(" ".join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def retrieve(self, query: str, limit: int = 10) -> list:
"""Retrieve relevant text and images."""
embedding = self._embed(query)
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
limit=limit,
output_fields=["content_type", "content", "image_path", "source", "page"]
)
return [{
"type": hit["entity"]["content_type"],
"content": hit["entity"]["content"],
"image_path": hit["entity"]["image_path"],
"source": hit["entity"]["source"],
"page": hit["entity"]["page"],
"score": hit["distance"]
} for hit in results[0]]
def query(self, question: str, use_images: bool = True) -> dict:
"""Answer question using retrieved context."""
contexts = self.retrieve(question, limit=10)
text_contexts = [c for c in contexts if c["type"] == "text"]
image_contexts = [c for c in contexts if c["type"] == "image"]
# Build message with text and images
messages = [{"role": "user", "content": []}]
# Add text context
context_text = "\n\n".join([
f"[{c['source']} Page {c['page']}]\n{c['content']}"
for c in text_contexts[:5]
])
messages[0]["content"].append({
"type": "text",
"text": f"Context:\n{context_text}\n"
})
# Add images
if use_images and image_contexts:
for img in image_contexts[:3]:
if os.path.exists(img["image_path"]):
with open(img["image_path"], "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
messages[0]["content"].append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"}
})
# Add question
messages[0]["content"].append({
"type": "text",
"text": f"\nQuestion: {question}\nAnswer based on the provided context and images:"
})
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3
)
return {
"answer": response.choices[0].message.content,
"sources": list(set(c["source"] for c in contexts)),
"pages": list(set(c["page"] for c in contexts))
}
# Usage
rag = MultimodalRAG()
# Ingest PDF with images
rag.ingest_pdf("product_manual.pdf")
# Ask questions
result = rag.query("What are the installation steps shown in the diagram?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
| Document Type | Text Extraction | Image Processing | VLM Prompt |
|---|---|---|---|
| Technical manual | Full text | Diagram description | "Explain what this diagram shows step by step" |
| Financial report | Full text | Chart data extraction | "Extract all data points from this chart" |
| Medical report | Full text | Image analysis | "Describe any medical findings visible" |
| Presentation | Slide text | Screenshot description | "Summarize what this slide conveys" |
Problem: VLM can't read chart text
Why: PDF images extracted at low resolution
Fix: Extract at higher resolution
mat = fitz.Matrix(2.0, 2.0) # 2x zoom
pix = page.get_pixmap(matrix=mat)
Problem: Image description doesn't mention surrounding context
Why: Image processed in isolation
Fix: Include nearby text in prompt
prompt = f"This image appears near the text: '{nearby_text}'. Describe the image and how it relates to this text."
Problem: Processing costs explode
Why: Calling VLM for every image
Fix: Batch processing, caching, or use cheaper models
# Use gpt-4o-mini for initial pass
# Only use gpt-4o for complex charts
Problem: User asks about chart but can't see it
Why: Only returning text answer
Fix: Include image references in response
return {
"answer": answer,
"referenced_images": [c["image_path"] for c in image_contexts[:3]]
}
| Need | Upgrade To |
|---|---|
| Better chunking | Add core:chunking |
| Higher precision | Add core:rerank |
| Video content | video-search |
| Pure text documents | rag-toolkit:rag |
core:ray