| name | tts |
| description | Synthesize speech from text and play it through the macOS speakers.
Talks to a local LocalKin Service Audio Server (default :8001 / Kokoro).
When `record` is running with audio=true, the spoken audio is
captured into the recording โ use this in place of `shell say` for
high-quality multilingual narration in demo videos. Set the
TTS_ENDPOINT env var to point at a different server.
|
| command | ["sh","-c","T=\"$1\"\nS=\"$2\"\nW=\"$3\"\n# Kernel strips unsubstituted {{vars}} to \"\", so empty == \"param\n# not passed\". Don't add `[ \"$X\" = \"{{name}}\" ]` sentinels โ those\n# self-defeat when the caller passes a real value.\n# Default to a Chinese female voice when the text contains CJK\n# characters, otherwise let the server pick. The server silently\n# falls back to English-only Kokoro on missing speaker, which\n# mispronounces Chinese as the literal phrase \"Chinese letter\".\nif [ -z \"$S\" ] && printf '%s' \"$T\" | LC_ALL=C grep -q '[^[:print:][:space:]]'; then\n S=\"${TTS_DEFAULT_ZH_SPEAKER:-zf_xiaoxiao}\"\nfi\nif [ -n \"$S\" ]; then\n PAYLOAD=$(jq -nc --arg t \"$T\" --arg s \"$S\" '{text:$t,speaker:$s}')\nelse\n PAYLOAD=$(jq -nc --arg t \"$T\" '{text:$t}')\nfi\nOUT=$(mktemp -t kinclaw-tts).wav\nHTTP=$(printf '%s' \"$PAYLOAD\" \\\n | curl -sS -X POST \"${TTS_ENDPOINT:-http://localhost:8001}/synthesize\" \\\n -H 'Content-Type: application/json' \\\n --data-binary @- \\\n -o \"$OUT\" -w '%{http_code}')\nif [ \"$HTTP\" != \"200\" ]; then\n echo \"tts: server returned HTTP $HTTP\" >&2\n cat \"$OUT\" >&2\n rm -f \"$OUT\"\n exit 1\nfi\n# wait=false (default): play in background, return immediately so\n# the agent can continue acting while audio is still narrating.\n# During `record` this gives parallel narration + action without\n# the recording capturing dead air.\n# wait=true: block until afplay finishes โ use only when the next\n# action visually depends on what was just said.\nif [ \"$W\" = \"true\" ]; then\n afplay \"$OUT\" || exit $?\n printf 'spoken: %s\\nspeaker: %s\\nmode: blocking\\npath: %s\\n' \"$T\" \"${S:-<server default>}\" \"$OUT\"\nelse\n ( afplay \"$OUT\" >/dev/null 2>&1 ) &\n printf 'spoken: %s\\nspeaker: %s\\nmode: background pid=%d\\npath: %s\\n' \"$T\" \"${S:-<server default>}\" \"$!\" \"$OUT\"\nfi\n","_"] |
| args | ["{{text}}","{{speaker}}","{{wait}}"] |
| schema | {"text":{"type":"string","description":"Text to speak. Required. Captured into video by `record` when audio=true.","required":true},"speaker":{"type":"string","description":"Kokoro speaker id. **The server's field name is `speaker`, not `voice`** โ passing the wrong key is silently ignored and falls back to the English model, which mispronounces Chinese as \"chinese letter\". Examples:\n- Chinese female: `zf_xiaoxiao` (default for CJK text), `zf_xiaobei`, `zf_xiaoni`\n- Chinese male: `zm_yunxi`, `zm_yunjian`\n- English female: `af_bella`, `af_sarah`\n- English male: `am_adam`, `am_michael`\nOmit to let the skill auto-pick: CJK text gets `zf_xiaoxiao`, ASCII gets the server default.\n","required":false},"wait":{"type":"string","description":"\"true\" or \"false\" (default false). Default false plays in the background and returns immediately, so the agent keeps acting while the narration plays โ recommended during `record` to avoid burning recording time on dead air. Pass \"true\" only when the next action visually depends on what was just said (rare).\n","required":false}} |
| timeout | 120 |
tts โ speech synthesis via LocalKin Service Audio (Kokoro)
Wraps the LocalKin Service Audio API at :8001/synthesize and plays
the returned WAV through the macOS default output (afplay).
Why this is a SKILL.md and not a native skill
It's three lines of curl + afplay. Pushing it into pkg/skill/ would
violate the "thin kernel + fat skill" thesis and make it harder for
users to fork. As an external SKILL.md it's also a forge template:
the next HTTP service that needs wrapping can be modeled on this file.
How record captures the narration
record action=start audio=true enables ScreenCaptureKit's system-audio
tap. afplay writes to the default output device, which the tap
captures. End result: the spoken text shows up on the video's audio
track without any extra plumbing.
Examples
tts text="ๆฅไธๆฅๆไผๆๅผ่ฎก็ฎๅจ" voice="zf_xiaoxiao"
tts text="Now I'll open Safari and search for KinClaw"
Override the endpoint
TTS_ENDPOINT=http://otherbox:8001 kinclaw -soul souls/pilot.soul.md
Failure modes
tts: server returned HTTP 000 โ server isn't running on the configured port.
tts: server returned HTTP 4xx/5xx โ server is up but rejected the request; the body is echoed to stderr.
afplay: ... โ playback failed (no audio device, sandboxed environment).