一键导入
speech-use
Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Perform autonomous, multi-step research using the Gemini Deep Research Agent (Interactions API). Supports web search, file/directory context, and resilient streaming.
Search and retrieve Google's developer documentation using the Developer Knowledge API. Query documentation chunks, get full document content, or batch retrieve multiple documents. Covers ai.google.dev, developer.android.com, docs.cloud.google.com, firebase.google.com, and more.
Generate, edit, and compose images using Gemini Nano Banana models via portable Python scripts. Handles authentication via API Key or Vertex AI environment variables. Available parameters: prompt, model, aspect-ratio, safety-filter-level. Always confirm parameters with the user or explicitly state defaults before running.
Create and edit videos using Google's Veo 2 and Veo 3 models. Supports Text-to-Video, Image-to-Video, Reference-to-Video, Inpainting, and Video Extension. Available parameters: prompt, image, mask, mode, duration, aspect-ratio. Always confirm parameters with the user or explicitly state defaults before running.
Generate and transcribe speech using Google's Gemini-TTS and Chirp 3 models. Supports Text-to-Speech (Single/Multi-speaker), Instant Custom Voice, and Speech-to-Text (Transcription/Diarization).
Generate and edit high-quality images using Gemini 2.5 Flash Image and Gemini 3 Pro Image (Nano Banana). Supports Text-to-Image, Style Transfer, Virtual Try-On, and Character Consistency.
| name | speech-use |
| description | Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice. |
Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.
This skill uses portable Python scripts managed by uv.
Environment Variables:
GOOGLE_API_KEY (for TTS via Gemini)GOOGLE_CLOUD_PROJECT (Required for STT and Voice Cloning)GOOGLE_APPLICATION_CREDENTIALS (Recommended for STT/Voice Cloning)APIs Enabled:
texttospeech.googleapis.com)speech.googleapis.com)Generate audio from text using Gemini-TTS.
Standard Voice:
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav
Custom Voice (Cloned):
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav
Generate a voiceCloningKey from a reference audio file and a consent file.
Requirements:
reference.wav: 10-30s of clear speech (the voice to clone).consent.wav: The speaker saying: "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav
Save the output key to use with generate_speech.py.
Transcribe audio files using Chirp 3.
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt
generate_speech.py
--voice: Prebuilt voice (e.g., Kore, Puck, Fenrir, Aoede).--voice-cloning-key: Key from create_custom_voice.py.--model: Default gemini-2.5-flash-preview-tts.transcribe_audio.py
--model: Default chirp_3.--language: Default auto.--location: Cloud region (default us).Before running scripts, review the reference guides for available voices and options.