| name | whisper |
| description | OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["openai-whisper","transformers","torch"] |
| platforms | ["linux","macos"] |
| metadata | {"hermes":{"tags":["Whisper","Speech Recognition","ASR","Multimodal","Multilingual","OpenAI","Speech-To-Text","Transcription","Translation","Audio Processing"]}} |
Whisper - Robust Speech Recognition
OpenAI's multilingual speech recognition model.
When to use Whisper
Use when:
- Speech-to-text transcription (99 languages)
- Podcast/video transcription
- Meeting notes automation
- Translation to English
- Noisy audio transcription
- Multilingual audio processing
Metrics:
- 72,900+ GitHub stars
- 99 languages supported
- Trained on 680,000 hours of audio
- MIT License
Use alternatives instead:
- AssemblyAI: Managed API, speaker diarization
- Deepgram: Real-time streaming ASR
- Google Speech-to-Text: Cloud-based
Quick start
Installation
pip install -U openai-whisper
Basic transcription
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
Model sizes
models = ["tiny", "base", "small", "medium", "large", "turbo"]
model = whisper.load_model("turbo")
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|---|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
Recommendation: Use turbo for best speed/quality, base for prototyping
Transcription options
Language specification
result = model.transcribe("audio.mp3")
result = model.transcribe("audio.mp3", language="en")
Task selection
result = model.transcribe("audio.mp3", task="transcribe")
result = model.transcribe("spanish.mp3", task="translate")
Initial prompt
result = model.transcribe(
"audio.mp3",
initial_prompt="This is a technical podcast about machine learning and AI."
)
Timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
Temperature fallback
result = model.transcribe(
"audio.mp3",
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
Command line usage
whisper audio.mp3
whisper audio.mp3 --model turbo
whisper audio.mp3 --output_format txt
whisper audio.mp3 --output_format srt
whisper audio.mp3 --output_format vtt
whisper audio.mp3 --output_format json
whisper audio.mp3 --language Spanish
whisper spanish.mp3 --task translate
Batch processing
import os
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
for audio_file in audio_files:
print(f"Transcribing {audio_file}...")
result = model.transcribe(audio_file)
output_file = audio_file.replace(".mp3", ".txt")
with open(output_file, "w") as f:
f.write(result["text"])
Real-time transcription
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
GPU acceleration
import whisper
model = whisper.load_model("turbo")
model = whisper.load_model("turbo", device="cpu")
model = whisper.load_model("turbo", device="cuda")
Integration with other tools
Subtitle generation
whisper video.mp4 --output_format srt --language English
With LangChain
from langchain.document_loaders import WhisperTranscriptionLoader
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
whisper audio.wav
Best practices
- Use turbo model - Best speed/quality for English
- Specify language - Faster than auto-detect
- Add initial prompt - Improves technical terms
- Use GPU - 10-20× faster
- Batch process - More efficient
- Convert to WAV - Better compatibility
- Split long audio - <30 min chunks
- Check language support - Quality varies by language
- Use faster-whisper - 4× faster than openai-whisper
- Monitor VRAM - Scale model size to hardware
Performance
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|---|
| tiny | ~0.32 | ~0.01 |
| base | ~0.16 | ~0.01 |
| turbo | ~0.08 | ~0.01 |
| large | ~1.0 | ~0.05 |
Real-time factor: 0.1 = 10× faster than real-time
Language support
Top-supported languages:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
Full list: 99 languages total
Limitations
- Hallucinations - May repeat or invent text
- Long-form accuracy - Degrades on >30 min audio
- Speaker identification - No diarization
- Accents - Quality varies
- Background noise - Can affect accuracy
- Real-time latency - Not suitable for live captioning
Resources