with one click
with one click
Read, search, send, and manage messages across Gmail, Outlook, Telegram, and other platforms
Join a Google Meet call to take notes; only when the user explicitly asks.
Migrate from ChatGPT, Claude, OpenClaw, Hermes, Manus, and other AI assistants into Vellum by inspecting their data exports, conversation archives, files, prompts, custom instructions, memory, saved memories, tools, GPTs, workflows, integrations, and relationships, then mapping as much as safely possible into Vellum primitives. Handles single-source and multi-source migrations with a unified, deduplicated inventory.
Send notifications through the unified notification router
Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
Recurring and one-shot scheduling - cron, RRULE, or single fire-at time
| name | media-processing |
| description | Ingest and process media files (video, audio, image) |
| compatibility | Designed for Vellum personal assistants |
| metadata | {"emoji":"🎬","vellum":{"display-name":"Media Processing","activation-hints":["User wants to ingest, process, or analyze a local video, audio, or image file","User wants to ask natural-language questions about a video's content (e.g. 'what happens at 3:14', 'find all the goals', 'who appears in this footage')","User wants to extract a clip or short segment from a video","User wants to run keyframe extraction, segmentation, or frame-by-frame analysis on a video"],"avoid-when":["User only wants a plain audio transcript — use the transcribe skill instead","User just wants to play or open a media file in their default system player — that does not need this pipeline"]}} |
Ingest and track processing of media files (video, audio, images) through a configurable 3-phase pipeline.
The processing pipeline follows a sequential 3-phase flow:
ingest_media) - Register a media file, detect MIME type, extract duration, deduplicate by content hash.extract_keyframes) - Detect dead time, segment the video into windows, extract downscaled keyframes, build a subject registry, and write a pipeline manifest.analyze_keyframes) - Send each segment's frames to Gemini 2.5 Flash with assistant-provided extraction instructions and a JSON Schema for guaranteed structured output. Supports concurrency pooling, cost tracking, resumability, and automatic retries.query_media) - Send all map output to Claude for intelligent analysis and Q&A. Supports arbitrary natural language queries about video content.generate_clip) - Extract video clips around specific moments.The processing pipeline service (services/processing-pipeline.ts) orchestrates phases 2-4 automatically with retries, resumability, and cancellation support.
Register a media file for processing. Accepts an absolute file path, validates the file exists, detects MIME type, extracts duration (for video/audio via ffprobe), and registers the asset with content-hash deduplication.
Query the processing status of a media asset. Returns the asset metadata along with per-stage progress details. Use this to monitor pipeline progress.
Preprocess a video asset: detect dead time via mpdecimate, segment the video into windows, extract downscaled keyframes at regular intervals, build a subject registry, and write a pipeline manifest.
Parameters:
asset_id (required) - ID of the media asset.interval_seconds - Interval between keyframes (default: 1s). Use 0.5s for sports/action content where frame density matters.segment_duration - Duration of each segment window (default: 15s).dead_time_threshold - Sensitivity for dead-time detection (default: 0.02).section_config - Path to a JSON file with manual section boundaries.detect_dead_time - Whether to detect and skip dead time (default: false). Dead-time detection can be too aggressive for continuous action video like sports - it may incorrectly skip live play. Enable only for content with clear idle periods (e.g., lectures, surveillance footage).short_edge - Short edge resolution for downscaled frames in pixels (default: 480).include_audio - Whether to extract and transcribe audio for each segment (default: false). When enabled, each segment's audio is transcribed using the configured STT service and stored alongside visual frames.Map video segments through Gemini's structured output API. Supports two modes:
keyframes (default) - Reads frames from the preprocess manifest, sends each segment's images to Gemini. Requires extract_keyframes to be run first. Best for longer videos (> 1 hour) or when you need fine-grained control over frame selection (interval, segment duration, dead-time skipping).direct_video - Uploads the video file directly to Gemini's Files API. Gemini sees actual motion and temporal context instead of static frames. Best for shorter videos (< 1 hour) where temporal context matters (detecting actions, transitions, motion patterns). Has a 2 GB file size limit. Does not require extract_keyframes preprocessing.Both modes produce the same MapOutput format, so query_media works identically regardless of which mode was used.
Parameters:
asset_id (required) - ID of the media asset.system_prompt (required) - Extraction instructions for Gemini.output_schema (required) - JSON Schema for structured output.mode - Analysis mode: 'keyframes' (default) or 'direct_video'.context - Additional context to include in the prompt.model - Gemini model to use (default: gemini-2.5-flash).concurrency - Maximum concurrent API requests (default: 10, keyframes mode only).max_retries - Retry attempts per segment on failure (default: 3).Query video analysis data using natural language. Sends map output (from analyze_keyframes) to Claude for intelligent analysis and Q&A. Supports arbitrary questions about video content.
Parameters:
asset_id (required) - ID of the media asset.query (required) - Natural language query about the video data.system_prompt - Optional system prompt for Claude.model - LLM model to use (default: claude-sonnet-4-6).Extract a video clip from a media asset using ffmpeg. Applies configurable pre/post-roll padding (clamped to file boundaries), outputs the clip as a temporary file.
Orchestrates the full processing pipeline with reliability features:
cancelled and the pipeline stops between stages.Handles dead-time detection, video segmentation, keyframe extraction, and subject registry building. Writes a pipeline manifest consumed by the Map phase.
Sends video segments to Gemini 2.5 Flash with structured output schemas. Handles concurrency pooling, cost tracking, resumability, and retries.
Sends Map output to Claude as text for analysis. Two modes:
Limits concurrent API calls during the Map phase to avoid rate limiting.
Tracks estimated API costs during pipeline execution.
When include_audio is enabled on extract_keyframes, the pipeline transcribes each segment's audio track using the configured STT service and attaches the transcript to the segment data. During the Map phase (analyze_keyframes), Gemini receives both the visual frames and the audio transcript for each segment, enabling multimodal analysis that combines what is seen with what is said.
This is useful for:
Audio transcription uses the STT service configured in assistant settings. If no STT service is configured or transcription fails for a segment (no audio track, service errors), the segment gracefully degrades to visual-only analysis.
The single most important insight: always use a broad, descriptive map prompt instead of a targeted one.
A targeted prompt like "find turnovers" locks you into one topic. If the user later wants to ask about defense, formations, or specific players, you'd need to reprocess the entire video. Instead, run a general-purpose descriptive prompt that captures everything visible, creating a rich, reusable dataset. Then all follow-up questions can be handled via query_media with no reprocessing.
One map run, many queries.
The map output will be larger (more tokens per segment), but Gemini Flash is cheap enough that this is a good tradeoff. Only use a targeted prompt if the user explicitly asks for something narrow.
Use this as a starting point for the system_prompt parameter in analyze_keyframes:
You are analyzing keyframes from a video. For each segment, describe everything you can observe:
- People visible: count, positions, identifying features (jersey numbers, clothing, names if visible)
- Actions and movements: what people are doing, direction of movement, interactions
- Objects of interest: ball location, equipment, vehicles, on-screen graphics
- Environment: setting, lighting, weather if outdoors
- Text on screen: scores, captions, titles, signs, timestamps
- Scene composition: camera angle, zoom level, any transitions between shots
- Any stoppages, pauses, or changes in activity
Be specific and factual. Describe what you see, not what you infer happened between frames.
{
"type": "object",
"properties": {
"scene_description": { "type": "string" },
"people": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"position": { "type": "string" },
"action": { "type": "string" }
}
}
},
"objects_of_interest": { "type": "array", "items": { "type": "string" } },
"on_screen_text": { "type": "array", "items": { "type": "string" } },
"camera": { "type": "string" },
"notable_events": { "type": "array", "items": { "type": "string" } }
}
}
The generate_clip tool automatically opens clips in the user's default video player after extraction (handled internally - do not run open via host_bash). Clips are saved persistently in the asset's pipeline directory (pipeline/<assetId>/clips/), falling back to a temp directory when the source location is read-only. Each clip gets a unique filename so concurrent or repeated extractions at the same range never collide. The clipPath field in the tool response contains the absolute file path.
The tool handles high-bitrate and incompatible codec sources automatically - it tries stream copy first for speed, then falls back to H.264 re-encoding if needed. Always use generate_clip rather than manual ffmpeg commands.
Always provide a descriptive title parameter (e.g. "snow-dive-closeup", "goal-celebration") so clips get meaningful filenames instead of timestamp-based names.
Gemini performs well at spatial/descriptive analysis from static keyframes:
Gemini hallucinates when asked to detect fast temporal events from static frames (keyframes mode), regardless of frame density:
The model is good at describing what is there but bad at detecting what happened from static frames. For content where temporal context matters, consider using mode: 'direct_video' which lets Gemini see actual motion. For keyframes mode, structure your map prompts and queries accordingly - ask the model to describe scenes, then use query_media (Claude) to reason about patterns and events across the descriptive data.
Use media_status to check the current state of any asset:
The response includes per-stage progress (0-100%) so you can see exactly where processing stands.
Use media_status to check processing stages:
stages array for any stage with status: "failed".lastError field for that stage to understand what went wrong.durationMs to see if a stage timed out or ran unusually long.After fixing the root cause, re-run the failed stage. The pipeline is resumable - it picks up from where it left off.
The Map phase (Gemini) is the primary cost driver - it scales with video duration and keyframe interval. The Q&A phase (Claude) is negligible per query.
ingest_media call processes one file. Batch ingestion is not yet supported.| Symptom | Likely Cause | Fix |
|---|---|---|
| "No keyframes found" | extract_keyframes not run or failed | Check preprocess stage status; re-run if needed |
| "No map output found" | analyze_keyframes not run | Run analyze_keyframes with appropriate system_prompt and output_schema |
| "No LLM provider available" | API key not configured | Add one in Settings |
| Map phase slow | Large video, small interval | Increase interval_seconds or reduce concurrency |
| Gemini returns errors | Rate limits or schema issues | Check max_retries setting; simplify output_schema if needed |
| Pipeline stuck at "processing" | Stage crashed without updating status | Use media_status to check stage progress; re-run manually |
ingest_media tool requires an absolute path to a local file.analyze_keyframes tool is marked as medium risk because it makes external API calls to Gemini, which incur costs.