| name | whisper |
| description | OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["openai-whisper","transformers","torch"] |
| metadata | {"hermes":{"tags":["Whisper","Speech Recognition","ASR","Multimodal","Multilingual","OpenAI","Speech-To-Text","Transcription","Translation","Audio Processing"]}} |
Whisper - Robust Speech Recognition
OpenAI's multilingual speech recognition model.
When to use Whisper
Use when:
- Speech-to-text transcription (99 languages)
- Podcast/video transcription
- Meeting notes automation
- Translation to English
- Noisy audio transcription
- Multilingual audio processing
Metrics:
- 72,900+ GitHub stars
- 99 languages supported
- Trained on 680,000 hours of audio
- MIT License
Use alternatives instead:
- AssemblyAI: Managed API, speaker diarization
- Deepgram: Real-time streaming ASR
- Google Speech-to-Text: Cloud-based
Quick start
Installation
uv venv /home/lxgxdx/whisper-venv --python 3.11
uv pip install --python /home/lxgxdx/whisper-venv/bin/python openai-whisper
source /home/lxgxdx/whisper-venv/bin/activate
Pitfall — venv pip module missing: If python3 -m pip fails with No module named pip, the venv was created without pip. Recreate with uv venv above. Do NOT try to pip install pip or modify the venv — just recreate it.
Pitfall — model download on first load: whisper.load_model() downloads the model (~461MB for small) on first run. This can take 1-2 minutes on slow connections — this is normal, not a hang.
Basic transcription
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
Model sizes
models = ["tiny", "base", "small", "medium", "large", "turbo"]
model = whisper.load_model("turbo")
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|---|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
Recommendation: Use turbo for best speed/quality, base for prototyping
Transcription options
Chinese government meeting transcription
Government meetings involve multiple speakers, formal vocabulary, and policy terminology.
Supplying a domain-specific initial_prompt dramatically improves accuracy:
result = model.transcribe(
"meeting.wav",
language="zh",
initial_prompt=(
"这是一段政府部务会会议录音,与会人员包括多位领导干部,"
"讨论统战工作相关议题。"
)
)
Chinese meeting transcription workflow:
- Pre-check audio duration with
ffprobe -v quiet -show_entries format=duration -of csv=p=0 file.wav
- Files <1 min may be truncated/incomplete — check size before processing
- Long files (>30 min) may degrade; consider splitting with
ffmpeg -i in.wav -ss 0 -t 3600 part1.wav -ss 3600 part2.wav
- Use
small model for Chinese (good quality/speed balance on CPU)
- Output raw text first, then use an LLM to structure into meeting minutes format
Task selection
result = model.transcribe("audio.mp3", task="transcribe")
result = model.transcribe("spanish.mp3", task="translate")
Initial prompt
result = model.transcribe(
"audio.mp3",
initial_prompt="This is a technical podcast about machine learning and AI."
)
Quick file inspection before transcribing
ffprobe -v quiet -show_entries format=duration -of csv=p=0 file.wav
ffprobe -v quiet -show_entries stream=channels,sample_rate,codec_name -of csv=p=0 file.wav
ffmpeg -i file.wav -af volumedetect -f null /dev/null 2>&1 | grep -E "max_volume|mean_volume"
Heuristics:
- <5 seconds → truncated/invalid, skip
- ADPCM codec (adpcm_ms) → low quality, expect poor accuracy with small model
- mean_volume < -25 dB → too quiet, preprocess with volume boost
- audio check is essential before starting a long transcription; a 60-minute transcription failing due to a bad file wastes an hour
Timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
Temperature fallback
result = model.transcribe(
"audio.mp3",
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
Command line usage
whisper audio.mp3
whisper audio.mp3 --model turbo
whisper audio.mp3 --output_format txt
whisper audio.mp3 --output_format srt
whisper audio.mp3 --output_format vtt
whisper audio.mp3 --output_format json
whisper audio.mp3 --language Spanish
whisper spanish.mp3 --task translate
Batch processing
import os
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
for audio_file in audio_files:
print(f"Transcribing {audio_file}...")
result = model.transcribe(audio_file)
output_file = audio_file.replace(".mp3", ".txt")
with open(output_file, "w") as f:
f.write(result["text"])
CPU-optimized transcription (faster-whisper)
For CPU-based transcription (no GPU), always use faster-whisper instead of openai-whisper.
It supports int8 quantization and runs 4× faster with comparable accuracy.
uv pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="zh")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Recommended models by use case (CPU only)
| Scenario | Model | faster-whisper compute_type | Notes |
|---|
| Quick test / short clip | tiny or base | int8 | ~instant |
| English podcast/lecture | small | int8 | Good balance |
| Chinese meeting (low quality) | medium | int8 | Minimum for government/formal context |
| Chinese meeting (clean audio) | small | int8 | Acceptable with initial_prompt |
| Max accuracy (any language) | large-v3 | int8 | Slow but best |
Critical finding — Chinese government meeting audio: Low-bitrate WAV files (ADPCM, 32kHz mono) produce extremely poor results with small model — names, policy terms, and numbers are consistently misrecognized. medium int8 is the practical minimum for usable quality. Even with medium, expect ~70-80% accuracy; use an LLM pass to correct terminology afterward.
Audio preprocessing (required for low-quality recordings)
Before transcribing low-bitrate or noisy audio, always normalize and resample:
ffmpeg -y -i original.wav \
-af "highpass=f=200,lowpass=f=8000,volume=1.5,alimiter=limit=0.95" \
-ar 16000 -ac 1 -acodec pcm_s16le \
output_norm.wav
The filter chain: highpass removes rumble (f=200), lowpass removes high-frequency noise (f=8000), volume boosts quiet speech, alimiter prevents clipping. Output: 16kHz mono PCM — Whisper's optimal input.
Real-time transcription
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
GPU acceleration
import whisper
model = whisper.load_model("turbo")
model = whisper.load_model("turbo", device="cpu")
model = whisper.load_model("turbo", device="cuda")
Integration with other tools
Subtitle generation
whisper video.mp4 --output_format srt --language English
With LangChain
from langchain.document_loaders import WhisperTranscriptionLoader
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
whisper audio.wav
Best practices
- Use turbo model - Best speed/quality for English
- Specify language - Faster than auto-detect
- Add initial prompt - Improves technical terms
- Use GPU - 10-20× faster
- Batch process - More efficient
- Convert to WAV - Better compatibility
- Split long audio - <30 min chunks
- Check language support - Quality varies by language
- Use faster-whisper - 4× faster than openai-whisper
- Monitor VRAM - Scale model size to hardware
Performance
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|---|
| tiny | ~0.32 | ~0.01 |
| base | ~0.16 | ~0.01 |
| turbo | ~0.08 | ~0.01 |
| large | ~1.0 | ~0.05 |
Real-time factor: 0.1 = 10× faster than real-time
Language support
Top-supported languages:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
Full list: 99 languages total
Limitations
- Hallucinations - May repeat or invent text
- Long-form accuracy - Degrades on >30 min audio
- Speaker identification - No diarization
- Accents - Quality varies
- Background noise - Can affect accuracy
- Real-time latency - Not suitable for live captioning
Resources