원클릭으로
video-search
// Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.
// Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.
Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.
Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.
Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.
Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.
Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.
Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.
| name | video-search |
| description | Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video. |
Semantic search on video content — find specific moments by describing what you're looking for.
Activate this skill when:
Do NOT activate when:
image-searchtext-to-image-searchmultimodal-rag"What type of videos are you searching?"
A) Speech-heavy (tutorials, meetings, lectures)
B) Visual-heavy (surveillance, sports, vlogs)
C) Mixed (documentaries, how-to videos)
Which describes your videos? (A/B/C)
"How precise should search results be?"
| Granularity | Segment Length | Use Case |
|---|---|---|
| Coarse | 5-10 minutes | "Find the meeting about budget" |
| Medium | 30-60 seconds | "Find where they discuss pricing" |
| Fine | 5-15 seconds | "Find exactly when John mentioned the deadline" |
"Based on your requirements:
Proceed? (yes / adjust [what])"
Think of video processing as converting a video into a searchable book:
┌─────────────────────────────────────────────────────────┐
│ Video Search Pipeline │
│ │
│ Original Video (2 hours) │
│ │ │
│ ├────────────────────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Audio Track │ │ Video Track │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Whisper │ │ Keyframe │ │
│ │ ASR │ │ Extraction │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ [Transcript segments] [Keyframe images] │
│ "At 0:30, John said..." [img1] [img2] ... │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ BGE │ │ CLIP │ │
│ │ Encoder │ │ Encoder │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └────────────┬─────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Milvus │ │
│ │ Storage │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
| Approach | What it Searches | Best For |
|---|---|---|
| Transcript | Spoken words | "What did they say about X?" |
| Keyframes | Visual content | "Find the scene with Y" |
Both can be combined for comprehensive search.
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
class VideoSearch:
def __init__(self, uri: str = "./milvus.db"):
self.client = MilvusClient(uri=uri)
self.text_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
self.collection_name = "video_search"
self._init_collection()
def _init_collection(self):
if self.client.has_collection(self.collection_name):
return
schema = self.client.create_schema()
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("video_path", DataType.VARCHAR, max_length=512)
schema.add_field("content_type", DataType.VARCHAR, max_length=16) # transcript/frame
schema.add_field("content", DataType.VARCHAR, max_length=65535)
schema.add_field("start_time", DataType.FLOAT)
schema.add_field("end_time", DataType.FLOAT)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)
index_params = self.client.prepare_index_params()
index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")
self.client.create_collection(
collection_name=self.collection_name,
schema=schema,
index_params=index_params
)
def search(self, query: str, limit: int = 10, search_type: str = "all") -> list:
"""Search video clips
search_type: "all" | "transcript" | "frame"
"""
embedding = self.text_model.encode(query).tolist()
filter_expr = ""
if search_type == "transcript":
filter_expr = 'content_type == "transcript"'
elif search_type == "frame":
filter_expr = 'content_type == "frame"'
results = self.client.search(
collection_name=self.collection_name,
data=[embedding],
filter=filter_expr if filter_expr else None,
limit=limit,
output_fields=["video_path", "content", "start_time", "end_time", "content_type"]
)
return [{
"video": hit["entity"]["video_path"],
"type": hit["entity"]["content_type"],
"content": hit["entity"]["content"][:200] + "..." if len(hit["entity"]["content"]) > 200 else hit["entity"]["content"],
"start": hit["entity"]["start_time"],
"end": hit["entity"]["end_time"],
"score": hit["distance"]
} for hit in results[0]]
def format_timestamp(self, seconds: float) -> str:
"""Convert seconds to HH:MM:SS"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
if hours > 0:
return f"{hours}:{minutes:02d}:{secs:02d}"
return f"{minutes}:{secs:02d}"
# Usage
search = VideoSearch()
results = search.search("how to configure the database connection")
for r in results:
start = search.format_timestamp(r['start'])
end = search.format_timestamp(r['end'])
print(f"[{start} - {end}] ({r['type']})")
print(f" {r['content']}")
print(f" Score: {r['score']:.3f}")
print()
import whisper
import subprocess
def extract_audio(video_path: str, audio_path: str):
"""Extract audio track from video."""
subprocess.run([
'ffmpeg', '-i', video_path, '-vn', '-acodec', 'pcm_s16le',
'-ar', '16000', '-ac', '1', audio_path, '-y'
], check=True)
def transcribe_audio(audio_path: str, segment_length: int = 30):
"""Transcribe and segment audio."""
model = whisper.load_model("base")
result = model.transcribe(audio_path)
segments = []
current_segment = {"text": "", "start": 0, "end": 0}
for segment in result["segments"]:
if segment["end"] - current_segment["start"] > segment_length:
if current_segment["text"]:
segments.append(current_segment)
current_segment = {
"text": segment["text"],
"start": segment["start"],
"end": segment["end"]
}
else:
current_segment["text"] += " " + segment["text"]
current_segment["end"] = segment["end"]
if current_segment["text"]:
segments.append(current_segment)
return segments
import cv2
def extract_keyframes(video_path: str, interval_seconds: int = 30):
"""Extract keyframes at regular intervals."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps * interval_seconds)
frames = []
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
timestamp = frame_count / fps
frames.append({
"frame": frame,
"timestamp": timestamp
})
frame_count += 1
cap.release()
return frames
| Video Type | ASR | Keyframes | Segment Length |
|---|---|---|---|
| Tutorials | ✅ Primary | Every 30s | 30s |
| Meetings | ✅ Primary | Every 60s | 60s |
| Surveillance | ❌ Skip | Every 5s | 10s |
| Movies/Shows | ✅ Subtitles | Every 10s | 30s |
| Sports | ⚠️ Commentary | Every 3s | 15s |
Problem: Processing takes forever on 4K videos
Why: Video processing is compute-intensive
Fix: Downscale for processing
# Extract at 720p for processing
subprocess.run([
'ffmpeg', '-i', input_path, '-vf', 'scale=-1:720',
output_path, '-y'
])
Problem: Storage explodes with frequent keyframes
Why: Every 5 seconds on a 2-hour video = 1440 frames
Fix: Use scene detection or longer intervals
# Scene-change detection
from scenedetect import detect, ContentDetector
scenes = detect(video_path, ContentDetector())
Problem: Transcription errors make search miss results
Why: Speech recognition isn't perfect
Fix: Store both raw and corrected transcripts, or use phonetic search
Problem: Can't quickly seek to result in video player
Why: Only stored content, not timestamps
Fix: Always store start_time and end_time for each segment
| Need | Upgrade To |
|---|---|
| Search by uploading image | Combine with image-search |
| Q&A on video content | Add RAG layer |
| Real-time streaming | Consider specialized tools |
| Speaker identification | Add speaker diarization |
references/frame-sampling.mdcore:ray