| name | ppt-audio-to-video |
| description | Convert narration audio plus slide decks into a narrated video. Use when the user has an audio-only `mp4/m4a/mp3/wav` and a `ppt/pptx/pdf` deck, and needs slide images, transcript extraction, slide timing planning, or final `mp4` rendering with `whisper-cpp` and `ffmpeg`. |
PPT Audio To Video
Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.
Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.
Core workflow
-
Inventory inputs.
- Confirm which of these exist: audio-only
mp4/m4a/mp3/wav, ppt/pptx, pdf, and any pre-rendered slide images.
- Prefer an existing
pdf or image directory for rendering. Treat pptx as the source of slide text and as a fallback for export.
-
Prepare tools.
- Required for deterministic steps:
ffmpeg, ffprobe, pdftoppm.
- Required for transcription:
whisper-cli from whisper-cpp plus a multilingual model such as ggml-small.bin.
- If only
pptx exists and no pdf/images exist, prefer Keynote or PowerPoint export on macOS. Use soffice only as fallback because profile or rendering issues are common.
-
Produce slide images.
-
Extract slide text.
-
Extract clean audio for ASR.
-
Transcribe with whisper-cli.
-
Build slide_timings.csv.
- Do not average slide durations unless the user explicitly asks for it.
- Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
- Use this schema:
slide,start_sec,end_sec,duration_sec,reason
1,0.000,15.000,15.000,opening title and agenda
2,15.000,100.000,85.000,architecture overview starts here
- Keep slide numbers sequential and ensure
duration_sec = end_sec - start_sec.
- Validate that the last
end_sec matches the audio duration or is within a small tolerance.
-
Render the final video.
-
Verify and iterate.
- Check output duration with
ffprobe.
- If a slide cuts too early or too late, edit only the affected rows in
slide_timings.csv and rerun the render script.
- Keep the transcript, outline, and timing CSV as reproducible working files.
Heuristics for timing alignment
- Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
- Use the first segment that clearly switches topic as the next slide start.
- Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
- Let the model infer timings, but keep the render step deterministic through
slide_timings.csv.
- When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.
Common commands
Install dependencies on macOS if missing:
brew install ffmpeg poppler whisper-cpp
Typical multilingual model download:
mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin
Bundled scripts
scripts/extract_slide_outline.py
Extract slide text from pptx into CSV or JSON for timing analysis.
scripts/render_from_timing_csv.py
Validate a timing CSV, generate an ffconcat, and render the final video with ffmpeg.