원클릭으로 Manus에서 모든 스킬 실행

$pwd:

video-search

Name: Video Search
Author: zilliztech

// Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.

Manus에서 실행

$ git log --oneline --stat

stars:2

forks:2

updated:2026년 2월 4일 02:42

파일 탐색기

6 개 파일

SKILL.md

readonly

related-skills.json

같은 저장소

clustering.md

from "zilliztech/milvus-marketplace"

Use when user needs to group similar items together. Triggers on: clustering, group similar, topic modeling, user segmentation, categorization, automatic classification, unsupervised grouping.

2026-02-042

duplicate-detection.md

from "zilliztech/milvus-marketplace"

Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.

2026-02-042

chat-memory.md

from "zilliztech/milvus-marketplace"

Use when user needs long-term memory for chatbots. Triggers on: chat memory, conversation history, long-term memory, chatbot memory, memory retrieval, persistent memory, remember conversations.

2026-02-042

image-search.md

from "zilliztech/milvus-marketplace"

Use when user wants to build image search or similar image finding. Triggers on: image search, similar image, visual search, image retrieval, CLIP, reverse image search, image matching, find similar photos.

2026-02-042

multimodal-rag.md

from "zilliztech/milvus-marketplace"

Use when user needs RAG on documents with images and text. Triggers on: multimodal RAG, image-text mixed, document with images, PDF with charts, visual RAG, visual Q&A, documents with figures.

2026-02-042

text-to-image-search.md

from "zilliztech/milvus-marketplace"

Use when user needs to search images using natural language descriptions. Triggers on: text to image, describe and find, natural language image search, image caption search, find image by description, describe to find.

2026-02-042

package.json

"author": "zilliztech"

"repository": "zilliztech/milvus-marketplace"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 개발자컴퓨터 및 수학직15-1252L4

name	video-search
description	Use when user needs to search video content by text or image. Triggers on: video search, video retrieval, video clips, meeting recordings, tutorial videos, surveillance playback, find moment in video.

Video Search

Semantic search on video content — find specific moments by describing what you're looking for.

When to Activate

Activate this skill when:

User wants to search within video content by description
User mentions "find in video", "video search", "meeting recordings"
User has tutorial videos, meetings, or surveillance footage to search
User needs to find specific scenes or discussions in long videos

Do NOT activate when:

User only needs image search → use image-search
User wants to search images by text → use text-to-image-search
User has static documents with images → use multimodal-rag

Interactive Flow

Step 1: Understand Video Type

"What type of videos are you searching?"

A) Speech-heavy (tutorials, meetings, lectures)

Primary content is spoken words
ASR (speech-to-text) is key

B) Visual-heavy (surveillance, sports, vlogs)

Actions and scenes matter more than speech
Keyframe extraction is key

C) Mixed (documentaries, how-to videos)

Both speech and visuals important
Need both approaches

Which describes your videos? (A/B/C)

Step 2: Determine Search Granularity

"How precise should search results be?"

Granularity	Segment Length	Use Case
Coarse	5-10 minutes	"Find the meeting about budget"
Medium	30-60 seconds	"Find where they discuss pricing"
Fine	5-15 seconds	"Find exactly when John mentioned the deadline"

Step 3: Confirm Configuration

"Based on your requirements:

Processing: ASR (Whisper) + Keyframes every 30s
Segment size: 30 seconds
Embedding: BGE for text, CLIP for frames

Proceed? (yes / adjust [what])"

Core Concepts

Mental Model: Video as Searchable Book

Think of video processing as converting a video into a searchable book:

Each chapter = video segment (30-60 seconds)
Each chapter has text (transcript) + illustrations (keyframes)
Search finds the right "chapter"

┌─────────────────────────────────────────────────────────┐
│                    Video Search Pipeline                 │
│                                                          │
│  Original Video (2 hours)                                │
│       │                                                  │
│       ├────────────────────────────────────┐            │
│       │                                    │            │
│       ▼                                    ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │ Audio Track  │                  │ Video Track  │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         ▼                                  ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │    Whisper   │                  │   Keyframe   │    │
│  │     ASR      │                  │  Extraction  │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         ▼                                  ▼            │
│  [Transcript segments]             [Keyframe images]    │
│  "At 0:30, John said..."           [img1] [img2] ...   │
│         │                                  │            │
│         ▼                                  ▼            │
│  ┌──────────────┐                  ┌──────────────┐    │
│  │     BGE      │                  │    CLIP      │    │
│  │   Encoder    │                  │   Encoder    │    │
│  └──────┬───────┘                  └──────┬───────┘    │
│         │                                  │            │
│         └────────────┬─────────────────────┘            │
│                      ▼                                   │
│              ┌──────────────┐                           │
│              │    Milvus    │                           │
│              │   Storage    │                           │
│              └──────────────┘                           │
└─────────────────────────────────────────────────────────┘

Two Search Approaches

Approach	What it Searches	Best For
Transcript	Spoken words	"What did they say about X?"
Keyframes	Visual content	"Find the scene with Y"

Both can be combined for comprehensive search.

Implementation

from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer

class VideoSearch:
    def __init__(self, uri: str = "./milvus.db"):
        self.client = MilvusClient(uri=uri)
        self.text_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
        self.collection_name = "video_search"
        self._init_collection()

    def _init_collection(self):
        if self.client.has_collection(self.collection_name):
            return

        schema = self.client.create_schema()
        schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
        schema.add_field("video_path", DataType.VARCHAR, max_length=512)
        schema.add_field("content_type", DataType.VARCHAR, max_length=16)  # transcript/frame
        schema.add_field("content", DataType.VARCHAR, max_length=65535)
        schema.add_field("start_time", DataType.FLOAT)
        schema.add_field("end_time", DataType.FLOAT)
        schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)

        index_params = self.client.prepare_index_params()
        index_params.add_index("embedding", index_type="AUTOINDEX", metric_type="COSINE")

        self.client.create_collection(
            collection_name=self.collection_name,
            schema=schema,
            index_params=index_params
        )

    def search(self, query: str, limit: int = 10, search_type: str = "all") -> list:
        """Search video clips
        search_type: "all" | "transcript" | "frame"
        """
        embedding = self.text_model.encode(query).tolist()

        filter_expr = ""
        if search_type == "transcript":
            filter_expr = 'content_type == "transcript"'
        elif search_type == "frame":
            filter_expr = 'content_type == "frame"'

        results = self.client.search(
            collection_name=self.collection_name,
            data=[embedding],
            filter=filter_expr if filter_expr else None,
            limit=limit,
            output_fields=["video_path", "content", "start_time", "end_time", "content_type"]
        )

        return [{
            "video": hit["entity"]["video_path"],
            "type": hit["entity"]["content_type"],
            "content": hit["entity"]["content"][:200] + "..." if len(hit["entity"]["content"]) > 200 else hit["entity"]["content"],
            "start": hit["entity"]["start_time"],
            "end": hit["entity"]["end_time"],
            "score": hit["distance"]
        } for hit in results[0]]

    def format_timestamp(self, seconds: float) -> str:
        """Convert seconds to HH:MM:SS"""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        if hours > 0:
            return f"{hours}:{minutes:02d}:{secs:02d}"
        return f"{minutes}:{secs:02d}"

# Usage
search = VideoSearch()
results = search.search("how to configure the database connection")

for r in results:
    start = search.format_timestamp(r['start'])
    end = search.format_timestamp(r['end'])
    print(f"[{start} - {end}] ({r['type']})")
    print(f"  {r['content']}")
    print(f"  Score: {r['score']:.3f}")
    print()

Video Processing Pipeline

Audio Processing (Transcription)

import whisper
import subprocess

def extract_audio(video_path: str, audio_path: str):
    """Extract audio track from video."""
    subprocess.run([
        'ffmpeg', '-i', video_path, '-vn', '-acodec', 'pcm_s16le',
        '-ar', '16000', '-ac', '1', audio_path, '-y'
    ], check=True)

def transcribe_audio(audio_path: str, segment_length: int = 30):
    """Transcribe and segment audio."""
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)

    segments = []
    current_segment = {"text": "", "start": 0, "end": 0}

    for segment in result["segments"]:
        if segment["end"] - current_segment["start"] > segment_length:
            if current_segment["text"]:
                segments.append(current_segment)
            current_segment = {
                "text": segment["text"],
                "start": segment["start"],
                "end": segment["end"]
            }
        else:
            current_segment["text"] += " " + segment["text"]
            current_segment["end"] = segment["end"]

    if current_segment["text"]:
        segments.append(current_segment)

    return segments

Frame Extraction

import cv2

def extract_keyframes(video_path: str, interval_seconds: int = 30):
    """Extract keyframes at regular intervals."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps * interval_seconds)

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_count % frame_interval == 0:
            timestamp = frame_count / fps
            frames.append({
                "frame": frame,
                "timestamp": timestamp
            })

        frame_count += 1

    cap.release()
    return frames

Processing Strategy by Video Type

Video Type	ASR	Keyframes	Segment Length
Tutorials	✅ Primary	Every 30s	30s
Meetings	✅ Primary	Every 60s	60s
Surveillance	❌ Skip	Every 5s	10s
Movies/Shows	✅ Subtitles	Every 10s	30s
Sports	⚠️ Commentary	Every 3s	15s

Common Pitfalls

❌ Pitfall 1: Processing Full Resolution

Problem: Processing takes forever on 4K videos

Why: Video processing is compute-intensive

Fix: Downscale for processing

# Extract at 720p for processing
subprocess.run([
    'ffmpeg', '-i', input_path, '-vf', 'scale=-1:720',
    output_path, '-y'
])

❌ Pitfall 2: Too Many Keyframes

Problem: Storage explodes with frequent keyframes

Why: Every 5 seconds on a 2-hour video = 1440 frames

Fix: Use scene detection or longer intervals

# Scene-change detection
from scenedetect import detect, ContentDetector
scenes = detect(video_path, ContentDetector())

❌ Pitfall 3: Ignoring ASR Errors

Problem: Transcription errors make search miss results

Why: Speech recognition isn't perfect

Fix: Store both raw and corrected transcripts, or use phonetic search

❌ Pitfall 4: No Timestamp Indexing

Problem: Can't quickly seek to result in video player

Why: Only stored content, not timestamps

Fix: Always store start_time and end_time for each segment

When to Level Up

Need	Upgrade To
Search by uploading image	Combine with `image-search`
Q&A on video content	Add RAG layer
Real-time streaming	Consider specialized tools
Speaker identification	Add speaker diarization

References

Frame extraction tools: references/frame-sampling.md
ASR models: Whisper, FunASR, Azure Speech
Batch processing: core:ray

video-search

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Video Search

When to Activate

Interactive Flow

Step 1: Understand Video Type

Step 2: Determine Search Granularity

Step 3: Confirm Configuration

Core Concepts

Mental Model: Video as Searchable Book

Two Search Approaches

Implementation

Video Processing Pipeline

Audio Processing (Transcription)

Frame Extraction

Processing Strategy by Video Type

Common Pitfalls

❌ Pitfall 1: Processing Full Resolution

❌ Pitfall 2: Too Many Keyframes

❌ Pitfall 3: Ignoring ASR Errors

❌ Pitfall 4: No Timestamp Indexing

When to Level Up

References

Video Search

When to Activate

Interactive Flow

Step 1: Understand Video Type

Step 2: Determine Search Granularity

Step 3: Confirm Configuration

Core Concepts

Mental Model: Video as Searchable Book

Two Search Approaches

Implementation

Video Processing Pipeline

Audio Processing (Transcription)

Frame Extraction

Processing Strategy by Video Type

Common Pitfalls

❌ Pitfall 1: Processing Full Resolution

❌ Pitfall 2: Too Many Keyframes

❌ Pitfall 3: Ignoring ASR Errors

❌ Pitfall 4: No Timestamp Indexing

When to Level Up

References