en un clic
audio-generation
// Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.
// Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.
Run MassGen experiments and analyze logs using automation mode, logfire tracing, and SQL queries. Use this skill for performance analysis, debugging agent behavior, evaluating coordination patterns, and improving the logging structure, or whenever an ANALYSIS_REPORT.md is needed in a log directory.
Invoke MassGen's multi-agent system. Use when the user wants multiple AI agents on a task: writing, code, review, planning, specs, research, design, or any task where parallel iteration beats working alone.
Complete guide for integrating a new LLM backend into MassGen. Use when adding a new provider (e.g., Codex, Mistral, DeepSeek) or when auditing an existing backend for missing integration points. Covers all ~15 files that need touching.
Guide to image generation and editing in MassGen. Use when creating images, editing existing images, iterating on image designs, or choosing between image backends (OpenAI, Google Gemini/Imagen, Grok, OpenRouter).
Guide to video generation in MassGen. Use when creating videos from text prompts or images across Grok, Google Veo, and OpenAI Sora backends.
Reference guide for adding new media generation backends to MassGen's unified generate_media tool.
| name | audio-generation |
| description | Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends. |
Generate audio using generate_media with mode="audio". Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.
# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")
# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")
# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
audio_type="music", duration=30)
# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
audio_type="sound_effect", duration=5)
| Type | Backends | Description |
|---|---|---|
"speech" (default) | ElevenLabs, OpenAI | Text-to-speech with voice selection |
"music" | ElevenLabs only | Music generation from text prompt |
"sound_effect" | ElevenLabs only | Sound effect generation |
"voice_conversion" | ElevenLabs only | Change voice of existing audio (speech-to-speech) |
"audio_isolation" | ElevenLabs only | Remove background noise, isolate vocals |
"voice_design" | ElevenLabs only | Create a new synthetic voice from text description |
"voice_clone" | ElevenLabs only | Clone a voice from audio samples |
"dubbing" | ElevenLabs only | Translate and dub audio to another language |
| Backend | Default Model | Supports | API Key |
|---|---|---|---|
| ElevenLabs (priority 1) | eleven_multilingual_v2 | Speech, music, SFX | ELEVENLABS_API_KEY |
| OpenAI (priority 2) | gpt-4o-mini-tts | Speech only | OPENAI_API_KEY |
If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.
| Parameter | Description | Example |
|---|---|---|
prompt | Text to speak (speech) or description (music/SFX) | "Hello world!" |
voice | Voice name or ID | "Rachel", "nova", "alloy" |
audio_type | Type of audio | "speech", "music", "sound_effect" |
duration | Length in seconds (music/SFX only) | 30 |
instructions | Speaking style (OpenAI gpt-4o-mini-tts only) | "warm, reflective tone" |
audio_format | Output format | "mp3", "wav", "opus" |
ElevenLabs (top voices):
| Voice | Character |
|---|---|
| Rachel | Warm, conversational female |
| Sarah | Clear, professional female |
| Josh | Friendly male |
| Adam | Deep, authoritative male |
| Emily | Bright, energetic female |
OpenAI voices: alloy, echo, fable, onyx, nova, shimmer, coral, sage
For speech, prompt is the literal text to speak. Style guidance goes in instructions:
# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
prompt="Welcome to the annual report presentation.",
mode="audio",
voice="alloy",
instructions="warm, reflective tone with measured pacing",
backend_type="openai"
)
# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!
instructions only works with OpenAI gpt-4o-mini-tts. ElevenLabs uses voice selection for tone.
Use read_media (not generate_media) to analyze existing audio:
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")