| name | omnivoice |
| description | Local TTS, voice cloning, voice design, and video dubbing via the OmniVoice Studio MCP server (open-source ElevenLabs alternative; nothing leaves the machine, runs on MPS/CUDA/CPU). Use when: (1) generating speech from text in any of 646 languages, (2) cloning a voice from a 3-second reference clip, (3) designing a voice by gender/age/accent/pitch/style, (4) dubbing a video into another language, (5) listing voice profiles or personality presets, (6) producing narration where privacy, cost, or absent API keys matter, (7) non-English narration where Edge TTS/kokoro fall short, (8) batch audio for blog posts or content pipelines. Triggers: 'omnivoice', 'voice clone', 'clone this voice', 'tts', 'narrate', 'generate speech', 'voice synthesis', 'dub video', 'voice design', 'local tts', 'multilingual voice', 'narrate this post', 'elevenlabs alternative'. |
OmniVoice
Overview
Generate audio locally via the OmniVoice Studio MCP server. Tools: generate_speech, list_voices, list_personalities, list_languages, check_health. Resources: voice://{id}, history://recent.
Prerequisites — Backend Must Be Running
The MCP tools all hit $OMNIVOICE_API_URL (default http://localhost:3900). If the backend is down, every tool returns a connection error. Install + boot:
git clone https://github.com/debpalash/OmniVoice-Studio.git "$OMNIVOICE_HOME"
cd "$OMNIVOICE_HOME"
uv sync
VIRTUAL_ENV="$(pwd)/.venv" uv pip install 'mcp[cli]'
Then:
scripts/check-health.sh
scripts/start-backend.sh
First synthesis call lazy-downloads the k2-fsa/OmniVoice model (~2.4 GB) from HuggingFace — cached on subsequent boots.
Task Index — Pick the Right Tool
| Task | Tool | Notes |
|---|
| Verify backend is up | check_health | Returns `{"status":"ok","device":"mps |
| Text → audio with a saved voice | generate_speech(text, profile_id) | Returns base64 WAV. profile_id="demo0001" is the bundled demo voice |
| Text → audio without a clone (voice design) | generate_speech(text, instruct="…") | Omit profile_id; pass an instruct like "warm middle-aged female narrator, calm pace" |
| Multilingual narration | generate_speech(text, language="es") | Any ISO 639 code or "Auto" |
| List existing voices | list_voices | Returns id, name, type, personality |
| List personality presets | list_personalities | Returns narrator / casual / news-anchor / etc. with their instruct strings |
| List supported languages | list_languages | 646 total; returns 20 popular + the full count |
For non-trivial decisions (which engine to use, when to pick OmniVoice over kokoro / Edge TTS / ElevenLabs), see references/engines-comparison.md.
For MCP wiring details, backend lifecycle, troubleshooting, and a clean teardown, see references/mcp-setup.md.
Common Workflows
1. One-shot narration with the demo voice
result = generate_speech(
text="Hello — this is OmniVoice generating speech locally.",
profile_id="demo0001",
language="English",
steps=16,
)
Benchmark: 4.2 s of audio in ~24 s server-side on Apple Silicon MPS at 16 diffusion steps.
2. Save the WAV to disk and play
Tool returns base64 PCM WAV (16-bit, mono, 24 kHz). Decode + write:
import base64, json
payload = json.loads(result_text)
open("out.wav","wb").write(base64.b64decode(payload["wav_base64"]))
On macOS: afplay out.wav. Convert to MP3 with ffmpeg -i out.wav -codec:a libmp3lame -b:a 128k out.mp3.
3. Voice clone — end-to-end recipe
Cloning needs a 3-10 second reference clip the model will use as a speaker embedding. The MCP server does NOT expose profile creation — it only reads existing profiles. Two paths to create one:
Path A — bundled helper (macOS, recommended for fresh clones):
scripts/record-reference.sh ~/Downloads/my-ref.wav 12 1
The script gives audible countdown + start/stop cues via macOS say + /System/Library/Sounds/Ping.aiff so the user knows when to speak (terminal stdout is buffered — text "speak now" prompts arrive too late). It records a longer raw window, then trims to ~10 seconds of speech via silenceremove + atrim, plays back for verification, and prints the next-step curl command.
Path B — manual:
ffmpeg -f avfoundation -i ":1" -t 12 -ac 1 -ar 24000 raw.wav
ffmpeg -i raw.wav \
-af "silenceremove=start_periods=1:start_silence=0.05:start_threshold=-40dB,atrim=end=10" \
-ac 1 -ar 24000 ref.wav
ffmpeg -i ref.wav -af volumedetect -f null - 2>&1 | grep volume
afplay ref.wav
POST to /profiles (multipart/form-data — required fields: name, ref_audio):
curl -X POST http://127.0.0.1:3900/profiles \
-F "name=carlos-clone" \
-F "ref_audio=@ref.wav" \
-F "ref_text=The exact text spoken in the clip" \
-F "language=English" \
| python3 -m json.tool
Once created, pass profile_id to generate_speech (via MCP) or directly via POST /generate. Profiles persist in SQLite + reference-audio files at ~/Library/Application Support/OmniVoice/voices/<id>.<ext> (the backend preserves the uploaded extension — .wav if you uploaded a WAV, .mp3 if MP3, etc.). State persists across backend restarts.
Reference clip tips that materially affect quality:
| Factor | Why it matters |
|---|
| Single speaker | Mixed speakers blur the embedding |
| Clean speech, no music/noise | Model embeds the noise too |
| Natural prosody (avoid pangrams) | Diffusion samples replicate prosody, not just timbre |
| 3-10 sec is the sweet spot | < 3 s lacks information; > 10 s adds compute without quality gain |
Match ref_text to what's spoken | Improves alignment, especially on noisy refs |
language correct | Wrong language → cross-lingual transfer artifacts |
| Loudness peak ≥ -15 dB | Quiet refs work but normalize poorly |
4. Voice design (no reference clip)
Skip profile_id; provide an instruct string describing the desired voice:
generate_speech(
text="Welcome to the future of agentic systems.",
instruct="warm middle-aged female narrator, calm authoritative pace, documentary style",
)
Get pre-made instructs via list_personalities and copy the one matching the brief (narrator, casual, news-anchor, etc.).
5. Video dubbing (web UI only)
The MCP server does not expose the dubbing endpoint. The full transcribe → translate → re-voice → mux pipeline lives behind the desktop UI (bun run desktop in $OMNIVOICE_HOME) and the /dub/* REST routes. When the user asks to dub a video, point them to the UI; surface this skill only for the synthesis primitives above.
When NOT to use OmniVoice
- Fast English-only narration on weak hardware →
kokoro-tts is ~10× smaller and 2× realtime on CPU (see references/engines-comparison.md)
- Lowest-friction one-off TTS → Edge TTS needs no install or backend
- Highest possible quality regardless of cost → ElevenLabs still wins on English narration polish; OmniVoice ties or wins on multilingual + cloning
- Real-time streaming dictation → use the OmniVoice desktop widget (
⌘+⇧+Space), not the MCP server
Resources
Backend Swagger / OpenAPI: http://127.0.0.1:3900/docs (when backend is up).
Upstream: github.com/debpalash/OmniVoice-Studio — FSL-1.1-ALv2 (free for personal/internal/non-commercial; auto-converts to Apache-2.0 two years after each release).