con un clic
discord-voice
// Sutando joins a Discord voice channel and runs a 2-way Gemini Live conversation. Standalone TS process — discord.js + @discordjs/voice + bodhi VoiceSession.
// Sutando joins a Discord voice channel and runs a 2-way Gemini Live conversation. Standalone TS process — discord.js + @discordjs/voice + bodhi VoiceSession.
Rebuild last-session context from everything persisted to disk (session-state.md, conversation.log, sqlite, PRs, tasks, build_log). Run as the first action of a fresh session so the conversation buffer has context before the user types. Recall half of issue #1032.
Start Sutando's autonomous proactive loop. Monitors tasks, runs health checks, and builds missing capabilities on a recurring schedule.
Choose between the local Codex CLI and Gemini CLI from Claude Code. Use for automatic model selection when the user wants the best local delegate for code review, repo-wide analysis, planning, or implementation.
Make conversational phone calls and join Zoom meetings via Twilio + Gemini. Multi-turn AI conversations on the phone on behalf of the user.
Search phone-call history for when a feature regressed (find-regression.py) and drill into a single call to see what went wrong (diagnose-call.py). Skips reading 100+ transcripts by hand.
Install DB Browser for SQLite (if not already installed) and open a .sqlite file in it. macOS only.
| name | discord-voice |
| description | Sutando joins a Discord voice channel and runs a 2-way Gemini Live conversation. Standalone TS process — discord.js + @discordjs/voice + bodhi VoiceSession. |
| when_to_use | When the user (in a DM or task) asks Sutando to "join voice", "join the lounge", or generally to be present in a Discord voice channel for live conversation. |
Sutando joins a Discord voice channel and runs a real-time 2-way conversation via Gemini Live, reusing the same bodhi VoiceSession + tool wiring as skills/phone-conversation/scripts/conversation-server.ts (Twilio path).
<voice channel name>", or any equivalent.NOT for: silent presence (no Gemini), text-only Discord channels (use discord-bridge.py), Zoom/Meet/phone (use the respective skills).
One process, all in TypeScript:
Discord user voice
↓
@discordjs/voice receiver (opus packets per speaking user)
↓ prism opus.Decoder → PCM s16le 48k stereo
↓ downsample48StereoTo16Mono
↓
bodhi VoiceSession.handleAudioFromClient (PCM 16k mono)
↓
Gemini Live
↓ base64 PCM 24k mono
↓ upsample24MonoTo48Stereo
↓
@discordjs/voice AudioPlayer → opus-encoded out to voice connection
↓
Discord channel audio out
@discordjs/voice handles Discord's DAVE (E2EE) via DAVESession first-party — no extra config.
bot scope with applications.commands + the voice perms (Connect, Speak, Use Voice Activity).~/.claude/channels/discord/.env:
DISCORD_BOT_TOKEN=...
GEMINI_API_KEY in .env at the repo root.DISCORD_VOICE_SERVER=1 \
npx tsx skills/discord-voice/scripts/discord-voice-server.ts \
--guild <GUILD_ID> \
--channel <VOICE_CHANNEL_ID>
Optional env:
VOICE_MODEL / VOICE_NATIVE_AUDIO_MODEL — mirrors voice-agent.ts.SUTANDO_WORKSPACE — workspace root for tasks/results/data/logs and the per-user config (see below).DISCORD_VOICE_SERVER=1 flips the polymorphic dismiss tool (src/meeting-tools.ts) into "SIGTERM self" mode instead of its default Zoom AppleScript path. Without it, asking Sutando to "leave"/"dismiss" in the channel would try to leave a (non-existent) Zoom meeting.
This skill's config carries per-user data (your Discord channel ids, your owner-mode choices), so it does NOT live in the git repo. It lives in the workspace:
$SUTANDO_WORKSPACE/config/discord-voice.json
(default ~/.sutando/workspace/config/discord-voice.json; $SUTANDO_WORKSPACE is resolved by the canonical workspace helper).
The repo ships a committed template — skills/discord-voice/config.json.example — with the safe defaults. On first run, if the workspace config is missing, the server copies the template into place; you then edit the workspace copy. (If the copy can't happen, the server falls back to the built-in defaults — owner_mode: false, every channel read-only.) Never commit a live discord-voice.json back into the repo — it's per-user data, not code.
Keys:
model / googleSearch — voice model + Web-grounding preference (defaults: gemini-2.5-flash-native-audio-preview-12-2025, true).owner_mode — skill-wide owner-mode default (boolean). false by default.channels — per-voice-channel override map: { "<voice_channel_id>": { "owner_mode": true } }. The channel entry is an object so it stays extensible.Resolution for a given channel: channels[<channel_id>].owner_mode if that entry exists, else the skill-wide owner_mode, else false. A fresh config (owner_mode: false, channels: {}) runs every channel read-only.
{
"model": "gemini-2.5-flash-native-audio-preview-12-2025",
"googleSearch": true,
"owner_mode": false,
"channels": {
"111111111111111111": { "owner_mode": true }
}
}
owner_mode: false is the safe default: non-owner speakers in the voice channel get the read-only tool surface (current time, status checks, lookups) but NOT owner-tier work, file edits, or message sends.
owner_mode: true — whether set skill-wide or per-channel via channels — is the opt-in for single-operator personal-use mode: it inherits owner-tier privileges to every speaker in the channel. It has a sharp edge — anyone who can speak in the same voice channel can delegate work, edit files, send messages, anything the proactive loop can do. Only enable it for voice channels whose membership is fully trusted (your own Lounge, never community/public). Prefer the per-channel channels override over the skill-wide owner_mode so a trusted-channel grant doesn't leak to every channel the bot joins. Set it in the workspace config ($SUTANDO_WORKSPACE/config/discord-voice.json), never the committed .example template.
Independently of owner_mode, owner-tier tools are gated per speaker, by Discord user id. Each turn is attributed to the speaker who started it, and tools are gated by that speaker's tier — read from the same ~/.claude/channels/discord/access.json the discord-bridge uses, so the two never drift:
allowFrom of access.json. Full tool surface: work, dismiss, screen-share, file edits, message sends.groups[*].allowFrom (per-channel trusted circle: peers, collaborators) that is not also owner. Read-only inline tools + configurable tools + dismiss; no work / file edits. (dismiss is intentional: a teammate can end the bot's voice session — useful when the owner isn't present to close the room; the owner can rejoin via DM.)owner_mode: true and the per-speaker tiers compose: a speaker gets the owner surface when owner_mode is on for the channel or their id resolves to the owner tier.
This is exactly the model discord-bridge.py uses (top-level allowFrom = owner, groups[*].allowFrom = team), so the same access.json is never read two ways. If allowFrom is empty, the gate falls back to the channel-wide owner_mode resolved from config.json (see Owner-mode config above).
This means the bot can sit safely in a shared/multi-person voice channel: a non-owner speaker physically cannot trigger owner-tier tools — the gate runs at tool-execution time, so even if the model tries, the call is denied.
The bot joins a voice channel when its owner DMs it "join the lounge voice channel in <server>" — the loop spawns the run command above as a subprocess. The task-bridge → proactive-loop → Bash pipeline handles it.
A join request is honored only when the originating task's access_tier is owner. access_tier is set by discord-bridge.py from access.json (owner = top-level allowFrom; team = the union of groups[*].allowFrom; other = neither). A team- or other-tier "join voice" request is declined — a non-owner cannot make the bot enter a voice channel. This holds at two layers: non-owner Discord tasks are already routed to a read-only sandbox (see CLAUDE.md "Discord access control") which cannot spawn the server, and the join request itself is owner-gated on top of that.
Inherits the full inlineTools + ownerOnlyTools set from src/inline-tools.ts (same surface as voice-agent.ts and conversation-server.ts). Notable Discord-relevant tools:
work — delegate non-trivial tasks to core (writes tasks/voice-task-{ts}.txt, blocks on result).dismiss — leave the current voice presence. Polymorphic via DISCORD_VOICE_SERVER env: SIGTERMs self in Discord mode, runs Zoom AppleScript otherwise.share_screen / stop_share_screen — drive Discord's screen-share picker. Has a hard dependency — see below.summon — skill-local override redirecting "share my screen" to share_screen (the core summon opens Zoom, wrong app when user is in Discord).get_current_time, get_core_status, join_zoom, join_gmeet, lookup_meeting_id, call_contact — all standard.share_screen / stop_share_screen are NOT free — they CGEvent-click the Discord webapp's "Share Your Screen" button and the Chrome native share-picker. That means:
chrome-devtools-mcp Chrome profile specifically (at ~/.cache/chrome-devtools-mcp/chrome-profile), so the share happens as whoever is logged into THAT Chrome — not the bot, not necessarily your main Discord. Recommended: create a secondary ("alt") Discord account and log into the MCP-Chrome as that, so your primary Discord (in regular Chrome / desktop app) stays uninterrupted. The alt and the bot both join the voice channel; the alt's screen is what gets shared. The alt must be a member of the same Discord server — voice channels are server-only (no DM voice for bots / no DM screen-share via this tool).macos-use refresh_traversal on the MCP-Chrome main PID, then update COORDS in scripts/share-screen-modal.py.If you don't want screen-sharing, the rest of the skill (voice conversation, tool delegation) works without any of this — share_screen will fail silently with no impact on voice.
SIGTERM/SIGINT triggers cleanupSession() which calls connection.destroy() (sends Discord voice-gateway disconnect frame) and voiceSession.close(). The handler then waits 1.5s before process.exit(0) so the disconnect frame actually flushes — without that delay, Discord pins the bot in-channel until its own 60-90s heartbeat timeout.
Transcripts + session metrics land in conversation.sqlite — the shared conversation and sessions tables (also used by voice + phone) — and are mirrored into the shared logs/conversation.log text log.
Operational/diagnostic output (the [Setup]/[Voice]/[Tool]/[VoiceSession]/[Dismiss] lines) is tee'd to $SUTANDO_WORKSPACE/logs/discord-voice.log — the discord-voice counterpart to logs/discord-bridge.log and logs/voice-agent.log — so the operational history survives a process exit.