一键导入
videoagent-audio-studio
// Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys.
// Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys.
AI video generation skill with auto model selection across Seedance 2, Kling 3.0, HappyHorse, and 10+ models. Produces finished multi-shot videos (5–120s) from text, images, URLs, scripts, or audio — including AI music, lip sync, and multi-shot sequencing. No prompts to write, no models to choose. USE FOR: video production, AI video, make a video, product video, brand video, promotional clip, explainer video, short video, TikTok video, Instagram Reel, YouTube Short, product ad, text-to-video, image-to-video, video generation, AI video agent.
Expert prompt engineering for Google Veo 3.2 (Artemis engine). Use when the user wants to generate a video with Veo 3.2, needs help crafting cinematic prompts, or mentions Veo, Google video generation, or Artemis engine.
AI creative director that turns a user's natural-language idea into a complete storyboard and generates all assets — images, video clips, and audio — automatically. The user only describes what they want; all prompt engineering is handled internally.
Generate short AI videos from text or images — text-to-video, image-to-video, and reference-based generation — with zero API key setup. Use when the user wants to create a video clip, animate an image, or generate video from a description.
Expert prompt engineering for Seedance 2.0. Use when the user wants to generate a video with multimodal assets (images, videos, audio) and needs the best possible prompt.
Tired of juggling 8 API keys? This skill gives you one-command access to Midjourney, Flux, Ideogram, and more, with zero setup. Use when you want to generate any image without worrying about API keys.
| name | videoagent-audio-studio |
| version | 3.0.0 |
| author | wells |
| emoji | 🎙️ |
| tags | ["video","audio","tts","music","sfx","voice-clone","elevenlabs","fal"] |
| description | Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys. |
| homepage | https://github.com/pexoai/audiomind-skill |
| metadata | {"openclaw":{"emoji":"🎙️","primaryEnv":"ELEVENLABS_API_KEY","requires":{"env":["ELEVENLABS_API_KEY"]},"install":[{"id":"elevenlabs-mcp","kind":"npm","package":"@elevenlabs/mcp","label":"Install ElevenLabs MCP server"}]}} |
Use when: User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect.
VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL.
| Request Type | Best Model | Latency |
|---|---|---|
| Narrate text / Voice-over | elevenlabs-tts-v3 | ~3s |
| Low-latency TTS (real-time) | elevenlabs-tts-turbo | <1s |
| Background music | cassetteai-music | ~15s |
| Sound effect | elevenlabs-sfx | ~5s |
| Clone a voice from audio | elevenlabs-voice-clone | ~10s |
bash {baseDir}/tools/start_server.sh
This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation.
Analyze the user's request and call the appropriate tool via the MCP server:
Text-to-Speech (TTS)
When user asks to "narrate", "read aloud", "say", or "create a voice-over":
Use MCP tool: text_to_speech
text: "<the text to narrate>"
voice_id: "JBFqnCBsd6RMkjVDRZzb" # Default: "George" (professional, neutral)
model_id: "eleven_multilingual_v2" # Use "eleven_turbo_v2_5" for low latency
Music Generation
When user asks to "compose", "create background music", or "make a soundtrack":
Use MCP tool: text_to_sound_effects (via cassetteai-music on fal.ai)
prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
duration_seconds: <duration>
Sound Effect (SFX)
When user asks for a specific sound (e.g., "a door creaking", "rain on a window"):
Use MCP tool: text_to_sound_effects
text: "<sound description>"
duration_seconds: <1-22>
Voice Cloning
When user provides an audio sample and wants to clone the voice:
Use MCP tool: voice_add
name: "<voice name>"
files: ["<audio_file_url>"]
User: "Voice this text for me: Welcome to our product launch"
→ Route to: text_to_speech
text: "Welcome to our product launch"
voice_id: "JBFqnCBsd6RMkjVDRZzb"
model_id: "eleven_multilingual_v2"
🎙️ Voiceover done! Listen here
User: "Generate 60 seconds of relaxing background music for a podcast"
→ Route to: cassetteai-music (fal.ai)
prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
duration_seconds: 60
🎵 Background music ready! Listen here
User: "Generate a sci-fi style door opening sound effect"
→ Route to: text_to_sound_effects
text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
duration_seconds: 3
Set ELEVENLABS_API_KEY in ~/.openclaw/openclaw.json:
{
"skills": {
"entries": {
"videoagent-audio-studio": {
"enabled": true,
"env": {
"ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
}
}
}
}
}
Get your key at elevenlabs.io/app/settings/api-keys.
"FAL_KEY": "your_fal_key_here"
Get your key at fal.ai/dashboard/keys.
The cli.js connects to a hosted proxy by default. If you want full control — or need to serve users in regions where vercel.app is blocked — you can deploy your own instance from the proxy/ directory.
cd proxy
npm install
vercel --prod
Set these in your Vercel project (Dashboard → Settings → Environment Variables):
| Variable | Required For | Where to Get |
|---|---|---|
ELEVENLABS_API_KEY | TTS, SFX, Voice Clone | elevenlabs.io/app/settings/api-keys |
FAL_KEY | Music generation | fal.ai/dashboard/keys |
VALID_PRO_KEYS | (Optional) Restrict access | Comma-separated list of allowed client keys |
export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"
Or set it in ~/.openclaw/openclaw.json:
{
"skills": {
"entries": {
"videoagent-audio-studio": {
"env": {
"AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
}
}
}
}
}
If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with vercel.app.
| Model ID | Type | Provider | Notes |
|---|---|---|---|
eleven_multilingual_v2 | TTS | ElevenLabs | Best quality, supports 29 languages |
eleven_turbo_v2_5 | TTS | ElevenLabs | Ultra-low latency, ideal for real-time |
eleven_monolingual_v1 | TTS | ElevenLabs | English only, fastest |
cassetteai-music | Music | fal.ai | Reliable, fast music generation |
elevenlabs-sfx | SFX | ElevenLabs | High-quality sound effects (up to 22s) |
elevenlabs-voice-clone | Clone | ElevenLabs | Clone any voice from a short audio sample |
ELEVENLABS_API_KEY is all you need to get started. FAL_KEY is now optional.cassetteai-music by default, which completes synchronously.cassetteai-music as a stable alternative for music generation.