| name | api |
| description | Deepgram API reference for speech-to-text, text-to-speech, voice agents, audio intelligence, and account management. Use whenever building with Deepgram APIs ā REST or WebSocket. Covers authentication, all endpoints, query parameters, request/response schemas, and WebSocket message formats. Reference files are organized by domain: listen (STT), speak (TTS), agent (voice agents), read (text/audio intelligence), models, projects, auth, and self-hosted.
|
Deepgram API
Build with Deepgram's speech-to-text, text-to-speech, voice agent, and audio intelligence APIs.
Getting Started
All API requests require authentication via API key or JWT:
- API Key:
Authorization: Token <API_KEY>
- JWT:
Authorization: Bearer <JWT>
Base servers:
- REST & STT/TTS WebSocket:
https://api.deepgram.com
- Voice Agent WebSocket:
https://agent.deepgram.com
How Deepgram's APIs Fit Together
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā api.deepgram.com ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāā
ā¼ ā¼ ā¼ ā¼ ā¼
/v1/listen /v2/listen /v1/speak /v1/read /v1/projects/*
Nova ā ASR Flux ā conv. TTS Text AI Management
REST or WSS WSS only REST or WSS REST only REST only
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā agent.deepgram.com ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
/v1/agent/converse
WebSocket only
audio āāā¶ STT āāā¶ LLM āāā¶ TTS āāā¶ audio
(Deepgram orchestrates the full pipeline)
Which API Should I Use?
Audio ā text (transcription)?
āā General-purpose transcription (captions, batch, call logs, live streams with custom turn logic)
ā āā Nova models via /v1/listen
ā āā Pre-recorded file ā REST POST https://api.deepgram.com/v1/listen?model=nova-3
ā āā Live stream ā WSS wss://api.deepgram.com/v1/listen?model=nova-3
ā
āā Conversational audio / voice-agent-style turn detection
āā Flux models via /v2/listen
āā Live stream ā WSS wss://api.deepgram.com/v2/listen?model=flux-general-en
Text ā audio?
āā One-shot ā REST POST /v1/speak
āā Low-latency stream ā WSS wss://api.deepgram.com/v1/speak
Full conversational voice agent (audio in, audio out)?
āā WSS wss://agent.deepgram.com/v1/agent/converse
Deepgram handles STT + your configured LLM + TTS internally
Analyze text for insights?
āā REST POST /v1/read
(summaries, sentiment, topics, intents)
Speech-to-Text: Nova (/v1/listen) vs Flux (/v2/listen)
Both model families are actively maintained and industry-leading. They solve different problems ā pick the one that matches your use case.
| Nova (/v1/listen) | Flux (/v2/listen) |
|---|
| Endpoint | /v1/listen | /v2/listen |
| Available models | nova-3, nova-2, nova, enhanced, base | flux-general-en |
| Best for | General transcription ā captions, subtitles, call logs, batch | Conversational audio ā voice agents, interactive assistants, turn-taking UIs |
| Output | Continuous transcript stream | Structured turn events + transcripts (built-in turn state machine) |
| Turn detection | Manual (utterance_end_ms, VAD events) | Built-in (EOT, eager-EOT, turn_index) |
| Transports | REST + WebSocket | WebSocket only |
| Intelligence overlays | Yes ā summarize, sentiment, topics, intents, diarize, redact, etc. | No ā smaller focused param set; no smart_format / diarize / punctuate |
| Mid-session reconfig | No (reconnect to change) | Yes (Configure message updates EOT thresholds + keyterms live) |
Pick Nova (/v1/listen, model=nova-3) when:
- Generating captions, subtitles, or transcripts for recorded media
- Running batch transcription over files (REST)
- You need analytics overlays (
summarize, sentiment, topics, intents, diarize, redact)
- You want WebSocket streaming with your own turn-detection logic
Pick Flux (/v2/listen, model=flux-general-en) when:
- Building an interactive voice agent or assistant
- You want end-of-turn detection handled for you
- You need low-latency turn signals and barge-in support
- You want to update EOT thresholds or keyterms mid-session without reconnecting
Migrating from Nova 3 to Flux? See the official Nova 3 ā Flux migration guide.
API Domains
| Domain | REST | WebSocket | Reference |
|---|
| Listen v1 ā STT, Nova models | POST /v1/listen | wss://api.deepgram.com/v1/listen | listen.md |
| Listen v2 ā STT, Flux (conversational) | ā | wss://api.deepgram.com/v2/listen | listen.md |
| Speak (TTS) | POST /v1/speak | wss://api.deepgram.com/v1/speak | speak.md |
| Voice Agent | GET /v1/agent/settings/think/models | wss://agent.deepgram.com/v1/agent/converse | agent.md |
| Read (Intelligence) | POST /v1/read | ā | read.md |
| Models | GET /v1/models | ā | models.md |
| Projects | /v1/projects/* | ā | projects.md |
| Auth | POST /v1/auth/grant | ā | auth.md |
| Self-Hosted | /v1/projects/*/selfhosted/* | ā | self-hosted.md |
Common Mistakes to Avoid
All APIs
-
Feature flags are query params ā except for Voice Agent and Flux mid-session updates. For /v1/listen, /v2/listen, and /v1/speak, initial options go on the URL. The request body carries only audio data (REST) or audio frames (WebSocket). Two exceptions: /v1/agent/converse has no URL query params at all (all config goes in the Settings message); and /v2/listen supports a Configure message after connection to update EOT thresholds and keyterms mid-session. Also note that /v2/listen has a much smaller param set than /v1/listen ā flags like smart_format, diarize, and punctuate are not available.
-
Rate limits are concurrent connections, not total requests. A 429 means too many simultaneous open connections, not too high a request volume. Diarization and other compute-heavy features reduce your concurrency allowance further.
STT WebSocket (/v1/listen)
-
Send KeepAlive as a text frame, not binary. The connection closes after 10 seconds of no audio. Send {"type":"KeepAlive"} as a text (JSON) frame every 3ā5 seconds during silence. Sending it as a binary frame causes transcription delays ā the audio pipeline chokes ā not a silent no-op.
-
Never send empty byte payloads. Sending a zero-length binary frame to /v1/listen is treated as a close ā it terminates the connection. Always check that your audio packet has length before sending.
-
encoding must match the actual audio format. If encoding=linear16 but you're sending opus, you'll get a DATA-0000 error or garbled output. Omit encoding entirely when sending containerized formats (mp3, wav, ogg) ā Deepgram detects them automatically.
-
Timestamps reset on reconnect. Each new WebSocket connection restarts timestamps at 00:00:00. For real-time apps, maintain a timestamp offset across reconnections or you'll silently corrupt your transcript timeline.
TTS WebSocket (/v1/speak)
-
Don't send empty text. A Speak message with an empty text field returns a 400 error. Always validate input before sending.
-
Character rate limiting (DATA-0001) means slow down, not retry. If you hit this, reduce how fast you're submitting text chunks ā don't immediately retry or you'll compound the problem.
Voice Agent (/v1/agent/converse)
- Send the
Settings message before any audio. The agent ignores everything until it receives and acknowledges the Settings configuration. Message ordering is strictly required.
Flux model
-
Use /v2/listen and model=flux-general-en. /v1/listen does not support Flux. model=flux alone is not a valid value. Do not include language or encoding params for containerized audio.
-
Use Configure to update EOT thresholds and keyterms mid-session. Unlike /v1/listen, Flux supports live reconfiguration after connection ā no need to reconnect to change turn detection sensitivity or boost new keyterms:
{ "type": "Configure", "thresholds": { "eot_threshold": "0.8", "eot_timeout_ms": "3000" }, "keyterms": ["Deepgram"] }
The server responds with ConfigureSuccess (echoing back applied values) or ConfigureFailure. Omitted threshold fields keep their current values.
Authentication
- JWT TTL applies only to the initial handshake. Tokens default to 30 seconds. Once the WebSocket connection is established, the token expiring does not close it ā tokens are only needed for the upgrade request.
SDK-Specific Skills
This api skill covers the product contracts (endpoints, query params, message shapes) that are identical across SDKs. For language-idiomatic code ā imports, async patterns, builder APIs, common errors ā install the SDK-specific skills. Each Deepgram SDK publishes 7 product skills named deepgram-{lang}-{product} (e.g. deepgram-python-speech-to-text, deepgram-js-voice-agent) plus a maintainer skill deepgram-{lang}-maintaining-sdk. The deepgram-{lang}- prefix avoids collisions when you install skills from multiple SDKs.
npx skills add deepgram/deepgram-python-sdk
npx skills add deepgram/deepgram-js-sdk
npx skills add deepgram/deepgram-java-sdk
npx skills add deepgram/deepgram-go-sdk
npx skills add deepgram/deepgram-rust-sdk
npx skills add deepgram/deepgram-swift-sdk
npx skills add deepgram/deepgram-kotlin-sdk
npx skills add deepgram/deepgram-dotnet-sdk
npx skills add deepgram/deepgram-browser-sdk
npx skills add deepgram/deepgram-python-sdk --skill deepgram-python-speech-to-text
npx skills add deepgram/deepgram-js-sdk --skill deepgram-js-voice-agent
Related Deepgram skills
| Skill | Purpose |
|---|
recipes | Minimal runnable snippets per feature per language |
examples | Full integration examples with third-party platforms (Twilio, LiveKit, etc.) |
starters | Runnable starter apps (framework Ć feature matrix) |
docs | Navigate Deepgram documentation |
setup-mcp | Install the Deepgram MCP server |
Documentation