| name | podcast-edit |
| description | Edit podcast audio — trim pre/post-show chat, remove filler words, cut silences, and enhance audio quality. Use when the user asks to edit a podcast, clean up audio, remove fillers, trim a recording, or improve voice quality. |
| user_invocable | true |
Podcast Edit Skill
Process raw podcast/meeting recordings into polished podcast episodes.
Capabilities
- Smart trimming — Find where the actual podcast starts/ends by transcribing and detecting intros/outros
- Filler word removal — Remove verbal tics: 嗯, 呃, 啊, 哦, 对对对, um, uh, etc.
- Silence trimming — Cut long dead air (>2s) down to natural pauses (~0.6s)
- Audio enhancement — Noise reduction, EQ, multi-speaker volume balancing, loudness normalization to podcast standard (−16 LUFS)
Prerequisites
ffmpeg and ffprobe installed
OPENAI_API_KEY in environment (for Whisper API transcription)
- Python 3 with stdlib only (no extra deps for the helper script)
Workflow
Step 1: Inspect the audio file
ffprobe -v quiet -print_format json -show_format -show_streams "INPUT_FILE"
Note: duration, sample rate, channels, codec, bitrate.
Step 2: Find podcast start/end (if user says to trim front/back)
Split into 5-minute chunks and transcribe via OpenAI Whisper API with segment-level timestamps:
ffmpeg -y -i "INPUT_FILE" -ss OFFSET -t 300 -ar 16000 -ac 1 /tmp/chunk_OFFSET.mp3
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@/tmp/chunk_OFFSET.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F language="LANG" \
-F 'timestamp_granularities[]=segment' > /tmp/transcript_OFFSET.json
Scan transcriptions for:
- Start markers: "welcome", "hello everyone", "大家好", "欢迎", intro music, first substantive topic sentence
- End markers: "see you next time", "bye", "下期见", "感谢收听", followed by post-show chat
Do an initial trim with -ss START -to END and -c copy (no re-encode) to create a working file.
Step 3: Remove filler words
Split the trimmed file into 5-minute chunks and transcribe each with word-level timestamps:
for i in $(seq 0 300 DURATION); do
ffmpeg -y -i "TRIMMED_FILE" -ss $i -t 300 -ar 16000 -ac 1 /tmp/wchunk_${i}.mp3
done
for i in $(seq 0 300 DURATION); do
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@/tmp/wchunk_${i}.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F language="LANG" \
-F 'timestamp_granularities[]=word' \
-F 'timestamp_granularities[]=segment' > /tmp/wtranscript_${i}.json &
done
wait
Then run the filler removal script that ships with this skill:
python3 ./filler_removal.py \
--total-duration DURATION \
--end-at END_TIMESTAMP \
--cut START1:END1 --cut START2:END2 \
--chunk-offsets 0,300,600,900,...
Arguments:
--total-duration: Duration of the trimmed input file in seconds (required)
--end-at: Cut everything after this timestamp (e.g., post-show chat start)
--cut START:END: Cut a specific range. Can be repeated.
--chunk-offsets: Comma-separated chunk offsets (default: auto 0,300,600,…)
The script outputs /tmp/ffmpeg_filter.txt with an atrim+concat filter.
Apply the filter in two passes:
ffmpeg -y -i "TRIMMED_FILE" \
-filter_complex_script /tmp/ffmpeg_filter.txt \
-map '[out]' -c:a pcm_s16le -ar 44100 /tmp/podcast_cut.wav
ffmpeg -y -i /tmp/podcast_cut.wav \
-af "ENHANCEMENT_CHAIN" \
-c:a libmp3lame -b:a 192k "OUTPUT_FILE"
Limitations: Whisper word-level timestamps for Chinese can miss fillers that are blended into adjacent speech. The script catches standalone fillers reliably but may miss ~10–20% of embedded ones.
Step 4: Audio enhancement filter chain
Default chain (guest-friendly — handles multi-speaker volume imbalance). The biggest mistake in past runs is using a noise gate (agate) that silences the quieter guest entirely. Never add agate back to the default chain.
highpass=f=80, # Remove room rumble
lowpass=f=12000, # Remove hiss (use 7500 for 16kHz sources)
afftdn=nf=-25:nr=8:nt=w, # Gentle FFT noise reduction
equalizer=f=180:t=q:w=1.5:g=-2, # Cut mud
equalizer=f=2500:t=q:w=1.2:g=3, # Boost presence
equalizer=f=4500:t=q:w=1.5:g=1.5, # Boost clarity
dynaudnorm=f=200:g=5:p=0.95:m=5:s=0, # Rolling-window normalization — lifts the quieter speaker independently
acompressor=threshold=-20dB:ratio=2:attack=5:release=200:makeup=1, # Gentle glue
loudnorm=I=-16:TP=-1.5:LRA=13 # Podcast standard loudness
Why dynaudnorm is the star: it normalizes in 200 ms rolling windows, so when the guest is speaking, that window gets lifted independently of the host's louder windows. Order matters — run dynaudnorm BEFORE acompressor so the compressor sees a balanced signal.
Never add these to the default chain:
agate (noise gate) — cuts off any speaker quieter than the threshold; kills the guest.
- Heavy compression (ratio >3:1, makeup >2 dB) — flattens dynamics and makes the guest sound pumped.
- Narrow LRA (<12) in
loudnorm — crushes natural speech dynamics.
Adjust lowpass based on source sample rate:
- 16kHz source →
lowpass=7500
- 44.1kHz+ source →
lowpass=12000 (or skip)
Verify guest audibility after rendering: run ffmpeg -i OUTPUT -af "ebur128=peak=true" -f null - and check I: is near −16 LUFS and LRA: is 4–6 LU (tighter LRA is fine because dynaudnorm did per-window balancing first). If the output sounds like the guest was cut, suspect a gate or aggressive compressor crept back in.
Step 5: Verify output
ls -lh "OUTPUT_FILE"
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "OUTPUT_FILE"
Report: duration, file size, what was removed (filler count, silence count, time saved).
Output conventions
- Format: MP3, 192 kbps, mono (unless source is stereo with separate speakers per channel)
- Loudness: −16 LUFS (podcast standard)
- Always two-pass: cut to WAV first, then enhance to MP3
Show notes — bilingual writing (if applicable)
If the host is producing bilingual Chinese/English show notes, the Chinese section must be written in actual Chinese — not Chinese grammar with English verbs and nouns sprinkled in. Code-switching like "close 了一个 deal", "build 出来的 agent", or "PR 不是 buy 来的" reads like a draft and is the #1 mistake to avoid.
Translation rules
Translate these common startup/tech English loanwords into Chinese:
- close deal → 拿下订单 / 成交 / 签下
- build (a product) → 搭建 / 做出 / 打造
- integration → 集成
- view (video/page views) → 播放 / 浏览
- stack (tech stack) → 体系 / 技术栈
- category leader → 品类领导者
- front-end / front end (product sense) → 外壳 / 前端
- success story → 客户案例 / 成功故事
- SMB → 中小企业
- Enterprise (segment) → 大型企业 / 企业级
- aha moment → 顿悟时刻
- onboarding → 上手 / 入门
- retention → 留存
- churn → 流失
- pipeline → 销售漏斗 / 业务线
What to KEEP in English inside Chinese text
- Brand and product names — company / product / person names stay as-is
- Very common startup acronyms — CEO, CTO, CMO, PMF, ARR, MRR, PR, AI, AI Agent, SaaS, API
- Currency with numeric prefix —
$20K, $200K, or 200 美金 (either form is fine when paired with a number)
Before finalizing
Re-read the Chinese section as a Chinese reader. If any sentence feels like it was half-translated — e.g., contains "build", "close", "deal", "view", "stack", "leader" as standalone English words — rewrite those words in Chinese. The only English that should survive a re-read is brand names and the acronyms above.
Name verification (CRITICAL)
Whisper frequently mangles company names, product names, and personal names. Before generating show notes or any output that includes names and links:
- After transcription, extract all proper nouns — company names, product names, personal names, URLs mentioned.
- Ask the user to confirm/correct them — Whisper hears similar-sounding but wrong tokens for brand names.
- Never guess URLs from transcribed names — a name that sounds like "Acme" could be
acme.com, acmehq.com, or something else entirely. Always ask.
- Use confirmed names consistently in show notes, titles, episode metadata, and all outputs.
This is especially important when generating backlinks or social posts — a misspelled domain is a wasted link.
Show notes structure (recommended)
Two separate sections — Chinese first, then English (or whichever languages the show targets). Do NOT interleave or put them side-by-side.
Heading rule: only use H2 (##). Avoid H3 or deeper — flatten all sub-sections to H2.
Timestamp format: always MM:SS with leading zeros (e.g., 08:25, 00:00, 42:10). Never 0:00 or 1:05.
EP{NNN}: {Episode title}
---
## 中文
**嘉宾:** {中文姓名 English Name}, {中文职位} {公司} (URL)
## 简介
{完整中文段落}
## 时间轴
- 00:00 — {中文描述}
- 08:25 — {中文描述}
## 核心要点
- {中文要点}
## 相关链接
- {品牌名}:{URL}
---
## English
**Guest:** {English Name}, {Title} at {Company} (URL)
## Summary
{Full English paragraph}
## Timestamps
- 00:00 — {English description}
- 08:25 — {English description}
## Key Takeaways
- {English takeaway}
## Links
- {Brand}: {URL}
Why two sections instead of bilingual bullets: Chinese readers want clean Chinese prose, English readers want clean English prose. Alternating "中文 / English" on every bullet makes both halves harder to read. Write each section as if it were the only one.
Quick trim (no filler removal)
If the user just wants a simple trim (e.g., "cut the first 3s"):
ffmpeg -y -i "INPUT" -ss 3 -c copy "OUTPUT"
Use -c copy for instant lossless trim when no audio processing is needed.