with one click
transcribe
// Produce a word-level timestamped transcript from a video using OpenAI whisper-1.
// Produce a word-level timestamped transcript from a video using OpenAI whisper-1.
Burn TikTok-style word-grouped captions over a video using whisper word timestamps and the ASS subtitle format.
Shape a 30-90 second editorial cut from a longer interview/talk using a word-timestamped transcript.
ffmpeg recipes for rendering the final cut from raw.mp4 + cuts.json, plus encoder settings and validation.
| name | transcribe |
| description | Produce a word-level timestamped transcript from a video using OpenAI whisper-1. |
| when-to-use | When you need to make cut decisions on dialogue and don't already have transcript.json in the run dir. |
Word-level timing is what makes clean cuts possible — every cut boundary
should land between words, never mid-word. gpt-4o-transcribe produces
better text but does not support timestamp_granularities, so for
editing we always use whisper-1 with verbose_json.
If runs/<take>/transcript.json already exists, do not re-transcribe.
The harness pre-bakes it before handing off. Re-running costs ~$0.06/min.
Extract audio with ffmpeg (mono, 16 kHz, 32 kbps mp3 — keeps under the 25 MB upload cap and is plenty for whisper):
ffmpeg -y -i raw.mp4 -vn -ac 1 -ar 16000 -b:a 32k audio.mp3
Call the API (OpenAI key is in the env):
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header 'Content-Type: multipart/form-data' \
--form file=@audio.mp3 \
--form model=whisper-1 \
--form response_format=verbose_json \
--form 'timestamp_granularities[]=word' \
--form 'timestamp_granularities[]=segment' \
> transcript.json
transcript.json contains:
text — full transcript as a single stringsegments[] — phrase-level groupings: {id, start, end, text, …}. Skim
these first to map the territory.words[] — every word with {word, start, end} in seconds. Use these
to lock exact cut times.start / end from the words
array. Do not interpolate — the next word's start is the right
boundary if you want to preserve a trailing breath.--form language=<iso639> to bias the model.--form prompt="..."
containing the spelling. Whisper only reads the last 224 tokens of the
prompt, so keep it short.