| name | gradium-sdk |
| description | Teaches the agent how to use the Gradium AI real-time speech-to-text WebSocket API from TypeScript/JavaScript. Use when wiring voice transcription directly against the Gradium protocol. For browser use in this project, prefer the `@agsk/lib-gradium` wrapper — this skill documents the underlying protocol. |
Gradium AI (raw WebSocket protocol)
Use this skill when implementing a Gradium STT client from scratch, or when debugging one. In this repo the browser-side SDK at libs/lib-gradium/ already implements this — prefer importing it (@agsk/lib-gradium) unless you specifically need to talk to Gradium directly.
Endpoint
wss://us.api.gradium.ai/api/speech/asr
Authentication
Gradium requires an HTTP header on the WebSocket upgrade:
x-api-key: <GRADIUM_API_KEY>
Important: browser WebSocket cannot set custom headers. Only Node or native clients can. From a browser, you must route through a Node proxy that accepts a local WebSocket and re-opens to Gradium with the header — see libs/lib-gradium/demo/vite.config.ts for a 40-line reference proxy.
Verified as the only working auth method — query param, subprotocol, and setup-message payloads are all rejected with "No authentication provided".
Audio format
- PCM 16-bit signed little-endian, mono, 24 kHz
- Frames must be exactly 1920 samples = 3840 bytes = 80 ms
- Each frame is base64-encoded and wrapped in a JSON message
Do not use MediaRecorder — it only emits containerized opus/webm, never raw PCM. Use AudioContext + AudioWorklet to capture Float32 samples and convert to Int16.
Message shapes
Client → server
Send exactly one setup, then audio frames paced at ~80ms, then optionally end_of_stream:
{ "type": "setup", "model_name": "default", "input_format": "pcm" }
{ "type": "audio", "audio": "<base64 pcm_s16le 24kHz mono, 1920 samples>" }
{ "type": "flush", "flush_id": "<client-chosen id>" }
{ "type": "end_of_stream" }
All frames MUST be WebSocket text frames, not binary.
Server → client
{ "type": "ready", "request_id": "...", "model_name": "default",
"sample_rate": 24000, "frame_size": 1920, "delay_in_frames": 10,
"text_stream_names": ["asr_0"] }
{ "type": "text", "text": "Hello", "start_s": 0.5, "stream_id": null }
{ "type": "step", "vad": [
{ "horizon_s": 0.5, "inactivity_prob": 0.02 },
{ "horizon_s": 1.0, "inactivity_prob": 0.05 },
{ "horizon_s": 2.0, "inactivity_prob": 0.12 },
{ "horizon_s": 3.0, "inactivity_prob": 0.20 }
], "step_idx": 5, "step_duration_s": 0.08, "total_duration_s": 0.4 }
{ "type": "end_text", "stop_s": 2.5, "stream_id": null }
{ "type": "flushed", "flush_id": "..." }
{ "type": "error", "message": "...", "code": 1011 }
Critical sequencing rules
- Wait for
ready before sending any audio. Audio sent before ready may be silently discarded or cause the server to rage-close.
- Pace audio at ~80ms to match
frame_size=1920. Do not fire-hose a burst of buffered chunks — Gradium validates frame arrival and responds with "Message validation error" (code 1011) if frames arrive too fast.
- No partial/final flag on
text. Accumulate text events within a turn; treat end_text as the turn boundary that finalizes the utterance.
Minimal browser pipeline (via a proxy)
const WS_URL = 'ws://127.0.0.1:5174';
const ws = new WebSocket(WS_URL);
let ready = false;
let accumulated = '';
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'setup',
model_name: 'default',
input_format: 'pcm',
}));
};
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'ready': ready = true; startAudio(); break;
case 'text':
accumulated = accumulated ? `${accumulated} ${msg.text}` : msg.text;
console.log('partial:', accumulated);
break;
case 'end_text':
console.log('final:', accumulated);
accumulated = '';
break;
case 'step':
break;
case 'error':
console.error('gradium:', msg.message);
break;
}
};
async function startAudio() {
const stream = await navigator.mediaDevices.getUserMedia({
audio: { channelCount: 1, sampleRate: 24000 },
});
const ctx = new AudioContext({ sampleRate: 24000 });
await ctx.audioWorklet.addModule('./pcm-worklet.js');
const node = new AudioWorkletNode(ctx, 'pcm-chunker', {
processorOptions: { chunkSamples: 1920 },
numberOfInputs: 1, numberOfOutputs: 0, channelCount: 1,
});
node.port.onmessage = (e) => {
if (ws.readyState !== WebSocket.OPEN) return;
const bytes = new Uint8Array(e.data);
let bin = '';
for (let i = 0; i < bytes.length; i++) bin += String.fromCharCode(bytes[i]);
ws.send(JSON.stringify({ type: 'audio', audio: btoa(bin) }));
};
ctx.createMediaStreamSource(stream).connect(node);
}
Common mistakes (all of these were in this repo's first attempt)
| Mistake | Correct |
|---|
URL wss://st.gradium.ai | wss://us.api.gradium.ai/api/speech/asr |
Auth in setup message api_key | x-api-key header (requires proxy from browser) |
| Sample rate 16 kHz | 24 kHz |
MediaRecorder for PCM | AudioContext + AudioWorklet |
Audio field name data | audio |
Setup audio_format: { encoding, sample_rate, channels } | Just model_name + input_format: 'pcm' |
Server event names transcription / vad | text / end_text / step |
Sending audio before ready | Wait for ready first |