원클릭으로
transcribe-audio
// Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos.
// Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos.
Skill for processing footage (video clips, sounds, photos, etc). Use this when creating a new library, adding new footage (videos) to an existing library, or resuming processing on an existing library.
Build a cut from a library — scene, selects, roughcut, or custom task. Starts by asking what kind of cut the user wants, then works with them to determine what they want to create. Always exports a file for Final Cut, Premiere, or Resolve at the end. Use when the user asks for a "roughcut", "sequence", "scene", "selects", or any other cut-shaped output.
Full footage analysis pipeline — audio transcripts, contact sheets, and Sonnet-written summaries. Produces every artifact the cut skill reads. Orchestrated from the main thread.
Backs up user libraries and all their contents (external video excluded). This skill can also be useful when you need to restore a library.
Builds a contact sheet from a video clip — evenly spaced frames laid out in a single grid image, each with its hh:mm:ss timestamp burned in. Use when the user asks for a "contact sheet", "grid", "film strip", or wants a one-image overview of part of a clip.
Exports all dialogue from every clip in a library into a single text file. One clip per block — filename, then its spoken words. Use when the user asks for a "full transcript", "full script", or wants all the dialogue from a library in one place.
| name | transcribe-audio |
| description | Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos. |
Transcribes video audio using WhisperX and produces a clean JSON transcript with word-level timing.
SKILL.md is the parent's dispatch brief. The sub-agent's working prompt lives in agent_prompt.md — inline its contents when launching the Task agent. Don't pass SKILL.md.
Launch at most 2 in parallel. WhisperX is already multithreaded internally (~4 CPU threads via CTranslate2); 2 processes is the throughput-vs-RAM sweet spot on a 16GB Mac.
The parent reads library.yaml and settings.yaml and passes these values inline in each agent's prompt:
video_path — absolute path to the video filetranscript_output_dir — where to write the transcript JSON (e.g. libraries/<library>/transcripts)language_code — ISO 639-1 code (e.g. en, es) — parent maps from library.yaml's language namewhisper_model — model size from settings.yaml (e.g. small, medium, turbo)transcript_refinement — boolean from library.yaml. If true, also pass:
user_context (may be empty string)footage_summary (may be empty string)After the agent returns, update library.yaml with transcript: <filename>.json.
Once all videos have audio transcripts, dispatch analyze-video for visual descriptions.
WhisperX must be installed. Use the setup skill to verify.