mit einem Klick
video-perception
// Use when the user mentions a video file (.mp4, .mov, .avi, .mkv, .webm), a YouTube URL, asks to watch/analyze/review a video, or references video content in conversation
// Use when the user mentions a video file (.mp4, .mov, .avi, .mkv, .webm), a YouTube URL, asks to watch/analyze/review a video, or references video content in conversation
| name | video-perception |
| description | Use when the user mentions a video file (.mp4, .mov, .avi, .mkv, .webm), a YouTube URL, asks to watch/analyze/review a video, or references video content in conversation |
You have access to video understanding tools via the claude-video-vision MCP server.
video_analyze — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.video_watch — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.video_detail — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.video_info — Get video metadata without processing.video_configure — Change settings (backend, resolution, enable_index, etc.).video_setup — Check/install dependencies.IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
Always start with video_info to get duration, resolution, and audio presence.
If the user gives a YouTube URL, pass the URL directly as path.
The MCP server downloads it with yt-dlp, prefers YouTube subtitles/auto-captions
for transcription, and falls back to the configured audio backend only when
captions are missing, empty, or suspiciously incomplete.
REQUIRED for videos > 30s: Call video_analyze BEFORE extracting any frames.
This is NOT optional — it gives you structural data to make smart extraction decisions.
Select filters relevant to the user's question:
| User intent | Filters to select |
|---|---|
| "What happens in this video?" | scene_changes, silence, transcription |
| "Find the scene transitions" | scene_changes, black_intervals |
| "Are there frozen/stuck parts?" | freeze, blur |
| "Is this a talking head or action?" | motion |
| "When does the music start?" | silence, loudness |
| "Analyze the lighting" | exposure |
| "Summarize this lecture" | transcription, scene_changes, silence |
| General / unclear intent | scene_changes, silence, transcription |
Always include transcription: true when the video has audio — the transcription
tells you WHERE to look visually.
Use the analysis results and transcription to plan your frame extraction strategy:
Call video_watch to extract frames:
fps: "auto" without view_sample — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.segments based on analysis data with variable FPS, and view_sample to limit initial frame count. You can always drill deeper with video_detail.Use video_detail to drill into specific moments:
view_sample: 3 to preview (first, middle, last frame)view if you need more detailWhen the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
YouTube URLs: Pass supported YouTube URLs directly as path. Treat
transcription_source: "youtube_subtitles" as stronger than
youtube_auto_captions; auto-captions can still have recognition errors.
You receive:
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.