원클릭으로
videoagent-audio-studio
// Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys.
// Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys.
| name | videoagent-audio-studio |
| version | 3.0.0 |
| author | wells |
| emoji | 🎙️ |
| tags | ["video","audio","tts","music","sfx","voice-clone","elevenlabs","fal"] |
| description | Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys. |
| homepage | https://github.com/pexoai/audiomind-skill |
| metadata | {"openclaw":{"emoji":"🎙️","primaryEnv":"ELEVENLABS_API_KEY","requires":{"env":["ELEVENLABS_API_KEY"]},"install":[{"id":"elevenlabs-mcp","kind":"npm","package":"@elevenlabs/mcp","label":"Install ElevenLabs MCP server"}]}} |
Use when: User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect.
VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL.
| Request Type | Best Model | Latency |
|---|---|---|
| Narrate text / Voice-over | elevenlabs-tts-v3 | ~3s |
| Low-latency TTS (real-time) | elevenlabs-tts-turbo | <1s |
| Background music | cassetteai-music | ~15s |
| Sound effect | elevenlabs-sfx | ~5s |
| Clone a voice from audio | elevenlabs-voice-clone | ~10s |
bash {baseDir}/tools/start_server.sh
This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation.
Analyze the user's request and call the appropriate tool via the MCP server:
Text-to-Speech (TTS)
When user asks to "narrate", "read aloud", "say", or "create a voice-over":
Use MCP tool: text_to_speech
text: "<the text to narrate>"
voice_id: "JBFqnCBsd6RMkjVDRZzb" # Default: "George" (professional, neutral)
model_id: "eleven_multilingual_v2" # Use "eleven_turbo_v2_5" for low latency
Music Generation
When user asks to "compose", "create background music", or "make a soundtrack":
Use MCP tool: text_to_sound_effects (via cassetteai-music on fal.ai)
prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
duration_seconds: <duration>
Sound Effect (SFX)
When user asks for a specific sound (e.g., "a door creaking", "rain on a window"):
Use MCP tool: text_to_sound_effects
text: "<sound description>"
duration_seconds: <1-22>
Voice Cloning
When user provides an audio sample and wants to clone the voice:
Use MCP tool: voice_add
name: "<voice name>"
files: ["<audio_file_url>"]
User: "Voice this text for me: Welcome to our product launch"
→ Route to: text_to_speech
text: "Welcome to our product launch"
voice_id: "JBFqnCBsd6RMkjVDRZzb"
model_id: "eleven_multilingual_v2"
🎙️ Voiceover done! Listen here
User: "Generate 60 seconds of relaxing background music for a podcast"
→ Route to: cassetteai-music (fal.ai)
prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
duration_seconds: 60
🎵 Background music ready! Listen here
User: "Generate a sci-fi style door opening sound effect"
→ Route to: text_to_sound_effects
text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
duration_seconds: 3
Set ELEVENLABS_API_KEY in ~/.openclaw/openclaw.json:
{
"skills": {
"entries": {
"videoagent-audio-studio": {
"enabled": true,
"env": {
"ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
}
}
}
}
}
Get your key at elevenlabs.io/app/settings/api-keys.
"FAL_KEY": "your_fal_key_here"
Get your key at fal.ai/dashboard/keys.
The cli.js connects to a hosted proxy by default. If you want full control — or need to serve users in regions where vercel.app is blocked — you can deploy your own instance from the proxy/ directory.
cd proxy
npm install
vercel --prod
Set these in your Vercel project (Dashboard → Settings → Environment Variables):
| Variable | Required For | Where to Get |
|---|---|---|
ELEVENLABS_API_KEY | TTS, SFX, Voice Clone | elevenlabs.io/app/settings/api-keys |
FAL_KEY | Music generation | fal.ai/dashboard/keys |
VALID_PRO_KEYS | (Optional) Restrict access | Comma-separated list of allowed client keys |
export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"
Or set it in ~/.openclaw/openclaw.json:
{
"skills": {
"entries": {
"videoagent-audio-studio": {
"env": {
"AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
}
}
}
}
}
If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with vercel.app.
| Model ID | Type | Provider | Notes |
|---|---|---|---|
eleven_multilingual_v2 | TTS | ElevenLabs | Best quality, supports 29 languages |
eleven_turbo_v2_5 | TTS | ElevenLabs | Ultra-low latency, ideal for real-time |
eleven_monolingual_v1 | TTS | ElevenLabs | English only, fastest |
cassetteai-music | Music | fal.ai | Reliable, fast music generation |
elevenlabs-sfx | SFX | ElevenLabs | High-quality sound effects (up to 22s) |
elevenlabs-voice-clone | Clone | ElevenLabs | Clone any voice from a short audio sample |
ELEVENLABS_API_KEY is all you need to get started. FAL_KEY is now optional.cassetteai-music by default, which completes synchronously.cassetteai-music as a stable alternative for music generation.AI video generation skill with auto model selection across Seedance 2, Kling 3.0, HappyHorse, and 10+ models. Produces finished multi-shot videos (5–120s) from text, images, URLs, scripts, or audio — including AI music, lip sync, and multi-shot sequencing. No prompts to write, no models to choose. USE FOR: video production, AI video, make a video, product video, brand video, promotional clip, explainer video, short video, TikTok video, Instagram Reel, YouTube Short, product ad, text-to-video, image-to-video, video generation, AI video agent.
Expert prompt engineering for Google Veo 3.2 (Artemis engine). Use when the user wants to generate a video with Veo 3.2, needs help crafting cinematic prompts, or mentions Veo, Google video generation, or Artemis engine.
AI creative director that turns a user's natural-language idea into a complete storyboard and generates all assets — images, video clips, and audio — automatically. The user only describes what they want; all prompt engineering is handled internally.
Generate short AI videos from text or images — text-to-video, image-to-video, and reference-based generation — with zero API key setup. Use when the user wants to create a video clip, animate an image, or generate video from a description.
Expert prompt engineering for Seedance 2.0. Use when the user wants to generate a video with multimodal assets (images, videos, audio) and needs the best possible prompt.
Tired of juggling 8 API keys? This skill gives you one-command access to Midjourney, Flux, Ideogram, and more, with zero setup. Use when you want to generate any image without worrying about API keys.