| name | app-demo-agent |
| description | Use this skill to turn screen recordings into polished app demo videos with AI voiceover and music. Triggers: "app demo", "demo video from recording", "screen recording demo", "narrate this recording", "add voiceover to screen recording", "make a demo from this recording", "polish this screen recording", "app walkthrough video", "product demo video", "turn this recording into a demo", "screen recording with voiceover", "app showcase video", "demo with narration" Orchestrates: screen analysis, voiceover script, device framing, TTS, background music, and final assembly. NOTE: This skill requires an existing screen recording as input. It does NOT generate screen recordings.
|
App Demo Producer
Turn raw screen recordings into polished app demo videos with AI-generated voiceover, device frames, and background music.
This is an orchestrator skill that combines:
- Screen recording analysis (FFmpeg frame extraction + Claude vision)
- Voiceover script generation (Claude)
- Device framing (device-framer)
- Voice synthesis (Gemini TTS / OpenAI TTS / ElevenLabs)
- Background music (Lyria / Suno / Udio)
- Final assembly (FFmpeg via media-utils)
Workflow
Step 1: Gather Requirements (REQUIRED)
⚠️ DO NOT skip this step. DO NOT start processing until you have ALL answers.
Use interactive questioning — ask ONE question at a time, wait for the response, then ask the next.
Question Flow
⚠️ Use the AskUserQuestion tool for each question below.
Q1: Screen Recording
"I'll turn that into a polished demo! First — where's the screen recording?
(provide the file path, e.g., ~/Desktop/recording.mp4)"
Wait for response. Verify the file exists.
Q2: What does the app do?
"Give me a quick overview of what this app does and what the recording shows.
(e.g., 'It's a task management app. The recording shows creating a task, setting a due date, and marking it complete.')"
Wait for response. This context dramatically improves the voiceover script.
Q3: Device Frame
"Should I wrap it in a device frame?
- Yes — iPhone 16 Pro (default)
- Yes, specific device — (I'll show you options)
- No — Keep as-is"
Wait for response. If they want a specific device, run --list-devices and let them pick.
Q3b: Device Color (if device frame selected)
"What color for the [device name]?
(list the available colors for the selected device)"
Wait for response. Show the actual color options from the device registry.
Q3c: Background Color (if device frame selected)
"What background color for the video?
- Dark (
#0a0a0a) — cinematic, great for social media
- White (
#ffffff) — clean, good for presentations/websites
- Transparent — for compositing (note: output will be WebM, not MP4)
- Custom — specify a hex color"
Wait for response. If transparent, set output format to WebM with alpha. If custom, validate the hex color.
Q4: Voiceover
"How should we handle the voiceover?
- Generate — I'll analyze the recording and write a script that narrates what's on screen
- You provide — Give me the script text
- None — No voiceover"
Wait for response.
Q5: Voice Style (if voiceover enabled)
"What kind of voice?
Tone:
- Professional / polished
- Friendly / conversational
- Energetic / excited
- Calm / reassuring
- Or describe your own tone
Gender preference:
- Male
- Female
- No preference"
Wait for response.
Q6: Voice Selection (if voiceover enabled)
Based on the user's tone and gender preference, recommend a specific voice and let them confirm or pick another:
"Based on your preferences, I'd recommend [voice name] ([provider]). Here are some options:
| Voice | Provider | Tone |
|---|
| Kore | Gemini | Friendly, clear, female |
| Charon | Gemini | Professional, authoritative, male |
| Puck | Gemini | Upbeat, energetic, male |
| Aoede | Gemini | Breezy, warm, female |
| nova | OpenAI | Friendly, natural, female |
| onyx | OpenAI | Deep, professional, male |
Want to go with [recommendation], or pick a different one?"
Wait for response.
Q7: Background Music
"Want background music? If so, what vibe?
- Modern / minimal — clean, tech-forward
- Upbeat / positive — energetic, fun
- Corporate / professional — polished, confident
- Ambient / calm — soft, relaxed
- Custom — describe the vibe you want
- No music — voiceover only (or silent)"
Wait for response.
Q8: Output
"Where should I save the final video?
(default: same directory as the input, with _demo suffix)"
Wait for response.
Quick Reference
| Question | Determines |
|---|
| Screen Recording | Input file |
| App Overview | Context for voiceover script |
| Device Frame | Whether to use device-framer and which device |
| Device Color | Color variant for the selected device |
| Background Color | Background behind the device frame |
| Voiceover | Generate script vs user-provided vs none |
| Voice Style | Tone and gender preference |
| Voice Selection | Specific TTS voice and provider |
| Background Music | Whether to generate music and what style |
| Output | Where to save final video |
Step 2: Analyze the Screen Recording
Extract keyframes from the recording. Use scene detection to capture frames at actual screen transitions rather than fixed intervals — this gives you frames at the moments that matter.
python3 ${CLAUDE_PLUGIN_ROOT}/skills/app-demo-agent/scripts/extract_frames.py \
INPUT_VIDEO \
-o ~/demo_project/frames/ \
--scene-detect \
--timestamps
If scene detection doesn't find enough transitions (e.g., scrolling content), fall back to fixed intervals:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/app-demo-agent/scripts/extract_frames.py \
INPUT_VIDEO \
-o ~/demo_project/frames/ \
--interval 2 \
--max-frames 20 \
--timestamps
Options:
| Option | Default | Description |
|---|
--scene-detect | off | Detect actual screen transitions instead of fixed intervals |
--scene-threshold | 0.3 | Scene sensitivity 0.0-1.0, lower = more sensitive |
--interval | 2 | Seconds between frame captures (fixed interval mode) |
--max-frames | 20 | Maximum number of frames to extract |
-o, --output | ./frames/ | Output directory for frames |
--timestamps | off | Also output a timestamps.json mapping |
This creates numbered PNG screenshots: frame_001.png, frame_002.png, etc. The timestamps.json maps each frame to its exact timestamp in the video.
Now read each frame image using the Read tool to understand the screen flow. Build a mental timeline:
0s - App opens, home screen visible
2s - User taps "New Task" button
4s - Task creation form appears
6s - User types task name
8s - User sets due date
10s - User taps "Save"
12s - Task appears in list with checkmark
Use the user's app description (Q2) combined with what you see in the frames to understand the full flow.
Step 3: Write the Voiceover Script
⚠️ This is the most important step. The voiceover must match what's happening on screen at every moment. A generic script that doesn't align with the visuals will feel disconnected and unprofessional.
First, get the exact video duration:
ffprobe -v quiet -show_entries format=duration -of csv=p=0 INPUT_VIDEO
Then write a timed script that maps narration to the frame timeline you built in Step 2. Every line of the script should correspond to what's visible on screen at that moment.
Script Rules
- Content must match the screen — if the user is tapping a button at 4s, the narration at 4s should describe that action. Never narrate something that isn't visible.
- Fill the full video duration — the voiceover should naturally span the entire recording. Not too short (dead silence at the end), not too long (audio gets cut off). Aim for the TTS output to be within 2-3 seconds of the video length.
- Flow with transitions — when the screen transitions between views, use that moment for a brief pause or transitional phrase. Don't talk over a screen change.
- Be natural and conversational — not robotic marketing speak.
- Highlight key features — call out what makes the app special as those features appear on screen.
Write the script as a timed outline first
Map each line to the timestamp where it should be spoken:
[0-3s] "Meet TaskFlow — the simplest way to stay on top of your day."
[3-6s] "Tap the plus button to create a new task."
[6-10s] "Give it a name... set a due date... and you're done."
[10-14s] "Your tasks show up right on your home screen, organized by priority."
[14-17s] "One tap to mark it complete."
[17-20s] "That's it — no clutter, no complexity. TaskFlow."
Then convert to a single block for TTS
Remove the timestamps and join into flowing text. Use punctuation to control pacing:
Meet TaskFlow — the simplest way to stay on top of your day.
Tap the plus button to create a new task.
Give it a name... set a due date... and you're done.
Your tasks show up right on your home screen, organized by priority.
One tap to mark it complete.
That's it — no clutter, no complexity. TaskFlow.
Pacing guide
| Video Duration | Target Word Count | Pace |
|---|
| 15s | 30-40 words | Fast, punchy |
| 30s | 60-80 words | Standard |
| 45s | 90-120 words | Comfortable |
| 60s | 120-150 words | Relaxed |
| 90s+ | ~150 words/min | Natural conversational |
Tips:
- Short sentences work better for TTS pacing
- Periods create natural pauses (~0.5s)
- Ellipsis (...) creates longer pauses (~1s)
- Em dashes (—) create brief pauses (~0.3s)
- Front-load the hook — first sentence matters most
- End with the app name or a clear CTA
Present the timed script to the user for approval before generating audio. Show both the timed outline and the final text block. Ask if they want changes to wording, pacing, or emphasis.
Step 4: Frame the Video (Optional)
If the user wants a device frame, wrap the recording:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/device-framer/scripts/frame_video.py \
INPUT_VIDEO \
-o ~/demo_project/framed.mp4 \
-d iphone-16-pro --color natural-titanium \
--bg '#0a0a0a' --scale 0.5 --padding 100
Recommended combos for demos:
| Vibe | Device | Color | Background |
|---|
| Premium dark | iphone-16-pro | natural-titanium | #0a0a0a |
| Modern vibrant | iphone-17-pro | cosmic-orange | #1a1a2e |
| Clean light | iphone-air | cloud-white | #f0f4f8 |
| Bold | iphone-16 | ultramarine | #0a0a0a |
| Classic | iphone-16-pro-max | black-titanium | #111111 |
⚠️ Use --scale 0.5 for most demos — native resolution is very large.
The framed video becomes the input for the rest of the pipeline.
Step 5: Generate Voiceover Audio
Use one of the voice generation scripts:
Gemini TTS (recommended — free, high quality):
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
--text "Your voiceover script here..." \
--voice Orus \
--style "Clear, confident, app demo narration" \
-o ~/demo_project/voiceover.wav
OpenAI TTS:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
--text "Your voiceover script here..." \
--voice nova \
-o ~/demo_project/voiceover.mp3
ElevenLabs:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
--text "Your voiceover script here..." \
--voice rachel \
-o ~/demo_project/voiceover.mp3
⚠️ You can also use --text-file instead of --text to read the script from a file:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
--text-file ~/demo_project/script.md \
--voice Kore \
--style "Friendly, clear, app demo narration" \
-o ~/demo_project/voiceover.wav
Duration Check (CRITICAL)
After generating the voiceover, immediately check if it matches the video duration:
VO_DUR=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 ~/demo_project/voiceover.wav)
VID_DUR=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 INPUT_VIDEO)
echo "Voiceover: ${VO_DUR}s, Video: ${VID_DUR}s"
If there's a mismatch greater than 3 seconds, fix it before proceeding:
| Situation | Fix |
|---|
| Voiceover too long | Shorten the script — remove filler words, tighten phrasing, cut a line. Regenerate. |
| Voiceover slightly long (< 5s over) | Extend the video with a freeze frame at the end: ffmpeg -i video.mp4 -vf "tpad=stop_mode=clone:stop_duration=5" -c:a copy extended.mp4 |
| Voiceover too short | Add more descriptive lines, slow the pace with pauses (ellipsis), or add a closing line. Regenerate. |
| Voiceover slightly short (< 3s under) | Acceptable — the video will have a brief silent outro which can feel natural. |
Always prefer adjusting the script over stretching the video. The script should be written to fit the video, not the other way around.
Voice Recommendations for Demos
| Tone | Gemini Voice | OpenAI Voice | ElevenLabs Voice |
|---|
| Default / Professional | Orus (recommended) | onyx | josh |
| Friendly | Kore | nova | rachel |
| Energetic | Puck | echo | domi |
| Calm | Aoede | shimmer | bella |
| Authoritative | Charon | alloy | adam |
Default voice: Orus (Gemini TTS) — firm, clear, professional. Works well for most app demos.
Step 6: Generate Background Music (Optional)
If the user wants music:
⚠️ Always match the video duration exactly. Get the video duration first, then pass it to the music generator:
DURATION=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 INPUT_VIDEO | cut -d. -f1)
python3 ${CLAUDE_PLUGIN_ROOT}/skills/music-generation/scripts/lyria.py \
--prompt "modern, minimal, tech product demo, clean, upbeat, positive" \
--duration $DURATION \
-o ~/demo_project/music.wav
Music prompts for demos:
| App Type | Music Prompt |
|---|
| Productivity | "minimal, modern, clean, focused, light electronic" |
| Social/Fun | "upbeat, playful, positive, acoustic, indie" |
| Finance/Business | "professional, confident, modern, ambient, corporate" |
| Health/Wellness | "calm, organic, warm, gentle, ambient" |
| Gaming | "energetic, electronic, dynamic, exciting, bass" |
| Creative tool | "inspiring, flowing, ambient, creative, modern" |
Step 7: Mix Audio
If you have both voiceover and music, mix them:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/media-utils/scripts/audio_mix.py \
--voice ~/demo_project/voiceover.wav \
--music ~/demo_project/music.wav \
-o ~/demo_project/final_audio.mp3 \
--music-volume 0.2 \
--fade-in 1.0 \
--fade-out 2.0
Recommended settings for demos:
--music-volume 0.15 to 0.25 — music should be subtle under narration
--fade-in 1.0 — gentle music intro
--fade-out 2.0 — clean ending
If voiceover only (no music), skip this step and use the voiceover file directly.
Step 8: Merge Audio with Video
Strip existing audio from the video (if any) and merge the new audio:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/media-utils/scripts/video_strip_audio.py \
-i ~/demo_project/framed.mp4 \
-o ~/demo_project/silent_video.mp4
python3 ${CLAUDE_PLUGIN_ROOT}/skills/media-utils/scripts/video_audio_merge.py \
--video ~/demo_project/silent_video.mp4 \
--audio ~/demo_project/final_audio.mp3 \
-o ~/demo_project/output/app_demo_final.mp4
If the video has no audio to strip, merge directly:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/media-utils/scripts/video_audio_merge.py \
--video ~/demo_project/framed.mp4 \
--audio ~/demo_project/final_audio.mp3 \
-o ~/demo_project/output/app_demo_final.mp4
Step 9: Deliver
Present the final video to the user:
- Tell them where the output file is
- Mention the duration and file size
- Offer adjustments:
- "Want me to change the voiceover script?"
- "Different voice or music style?"
- "Different device frame or background color?"
- "Adjust the music volume?"
Project Structure
demo_project/
├── frames/ # Extracted keyframes for analysis
│ ├── frame_001.png
│ ├── frame_002.png
│ └── timestamps.json # Frame timestamp mapping
├── framed.mp4 # Device-framed video (if applicable)
├── voiceover.wav # Generated voiceover
├── music.wav # Generated background music
├── final_audio.mp3 # Mixed voiceover + music
├── silent_video.mp4 # Video with audio stripped
├── script.md # Voiceover script (for reference)
└── output/
└── app_demo_final.mp4 # Final deliverable
Full Pipeline (Quick Reference)
For quick reference, here's the full pipeline in order:
SKILL="${CLAUDE_PLUGIN_ROOT}/skills"
PROJECT=~/demo_project
INPUT=recording.mp4
python3 $SKILL/app-demo-agent/scripts/extract_frames.py $INPUT -o $PROJECT/frames/ --timestamps
python3 $SKILL/device-framer/scripts/frame_video.py $INPUT -o $PROJECT/framed.mp4 \
-d iphone-16-pro --color natural-titanium --bg '#0a0a0a' --scale 0.5
python3 $SKILL/voice-generation/scripts/gemini_tts.py \
--text-file $PROJECT/script.md --voice Orus \
--style "Clear, confident, app demo narration" -o $PROJECT/voiceover.wav
python3 $SKILL/music-generation/scripts/lyria.py \
--prompt "minimal, modern, tech demo" --duration 30 -o $PROJECT/music.wav
python3 $SKILL/media-utils/scripts/audio_mix.py \
--voice $PROJECT/voiceover.wav --music $PROJECT/music.wav \
-o $PROJECT/final_audio.mp3 --music-volume 0.2
python3 $SKILL/media-utils/scripts/video_strip_audio.py -i $PROJECT/framed.mp4 -o $PROJECT/silent.mp4
python3 $SKILL/media-utils/scripts/video_audio_merge.py \
--video $PROJECT/silent.mp4 --audio $PROJECT/final_audio.mp3 \
-o $PROJECT/output/app_demo_final.mp4
What You Can Create
| Input | Output |
|---|
| Raw screen recording | Polished demo with voiceover |
| App walkthrough capture | Narrated product tour |
| Feature demo recording | Marketing-ready feature highlight |
| Tutorial screen capture | Professional tutorial with narration |
| Bug reproduction video | Annotated bug report video |
| Prototype recording | Investor demo with voiceover |
Prerequisites
- FFmpeg —
brew install ffmpeg (required for all media processing)
- Pillow —
pip install Pillow (required if using device framing)
- Voice generation — at least one of:
GOOGLE_API_KEY or GOOGLE_CLOUD_PROJECT (Gemini TTS — recommended)
OPENAI_API_KEY (OpenAI TTS)
ELEVENLABS_API_KEY (ElevenLabs)
- Music generation (optional) —
GOOGLE_API_KEY or GOOGLE_CLOUD_PROJECT for Lyria
Voice & Music Pairing Guide
| Demo Style | Voice (Gemini) | Music Prompt |
|---|
| General / default | Orus (firm) | "modern, clean, minimal, professional" |
| SaaS product tour | Orus (firm) | "modern, clean, minimal, professional" |
| Mobile app showcase | Puck (upbeat) | "upbeat, positive, mobile, fresh" |
| Enterprise demo | Charon (authoritative) | "corporate, confident, ambient, subtle" |
| Creative tool | Aoede (breezy) | "inspiring, creative, flowing, warm" |
| Developer tool | Orus (firm) | "tech, minimal, electronic, focused" |
| Consumer app | Kore (friendly) | "playful, modern, light, accessible" |
Limitations
- Requires existing screen recording — this skill does not capture screens
- Frame analysis depends on image clarity — blurry recordings produce weaker scripts
- Video duration limits TTS — very long recordings (5+ min) may need the script split into segments
- Device framer adds processing time — especially for long videos
- Music generation — Lyria clips max at ~30s, may need to loop for longer videos
Error Handling
| Error | Solution |
|---|
| "FFmpeg not found" | Install: brew install ffmpeg |
| "No frames extracted" | Check input video is valid, try lower --interval |
| "GOOGLE_API_KEY not set" | Set up API key per main README |
| "Pillow not installed" | Run: pip install Pillow |
| Audio/video duration mismatch | Voiceover script may be too long/short — adjust and regenerate |
| Device framer fails | Check frame PNG files exist in device-framer/frames/ |
Example Prompts
Simple:
"Here's a screen recording of my app — turn it into a polished demo video"
With details:
"I have a screen recording at ~/Desktop/myapp.mov. It's a fitness app showing the workout tracking feature. Put it in an iPhone 16 Pro frame, add a friendly voiceover, and some upbeat background music."
Minimal:
"Add voiceover narration to this screen recording: ~/recordings/demo.mp4"
Full control:
"Take ~/Desktop/recording.mp4, frame it in iPhone 17 Pro cosmic orange on dark background, generate a professional voiceover script, use Gemini TTS with Charon voice, add minimal electronic background music, save to ~/Desktop/final_demo.mp4"