| name | ai-voice-agent |
| description | Build and deploy production-grade AI voice agents for businesses.
Use when: user asks about "voice AI," "AI phone agent," "IVR replacement,"
"automated calling," "Twilio voice bot," "AI receptionist," "voice assistant,"
"phone automation," "call center AI," or "voice conversational AI."
|
| metadata | {"author":"Cybrflux","version":"1.0.0","tags":["voice","ai","twilio","telephony","automation","conversational-ai"],"requires":["twilio","openai","deepgram"]} |
AI Voice Agent — Build Production Voice AI That Actually Works
This isn't a toy. This is how you build voice agents that handle real business calls—booking appointments, qualifying leads, providing support, and making sales—24/7 without human intervention.
Cybrflux has deployed voice agents for real estate (R7), healthcare, and e-commerce. This skill distills everything we learned building VoxKit into a battle-tested framework.
When to Use
- Inbound Call Handling — Replace IVRs, answer FAQs, route calls intelligently
- Outbound Sales/Prospecting — Cold calls that don't sound robotic
- Appointment Scheduling — Book, confirm, reschedule without human agents
- Lead Qualification — Pre-qualify leads before passing to sales
- Customer Support — Handle tier-1 support, escalate complex issues
- Follow-up Automation — Post-purchase calls, satisfaction surveys, renewal reminders
Trigger phrases: "voice AI," "AI phone agent," "automated calling system," "Twilio voice bot," "AI receptionist," "phone automation," "voice assistant for business," "IVR replacement," "call center AI"
Architecture Overview
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Caller │────▶│ Twilio │────▶│ WebSocket │────▶│ Your Server │
│ (Phone) │◄────│ (Voice) │◄────│ (Media) │◄────│ (Node/Python)│
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
│
┌─────────────┐ ┌─────────────┐ │
│ TTS │◄────│ LLM │◄───────────┘
│ (ElevenLabs│ │ (GPT-4o/ │
│ or Cartesia) │ Claude) │
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ STT │────▶│ Conversation │
│ (Deepgram │ │ Manager │
│ or Whisper) │ │
└─────────────┘ └───────────────┘
Latency Budget (target <800ms end-to-end):
- STT: 150-300ms
- LLM: 200-500ms
- TTS: 150-300ms
- Network: 50-100ms
Stack Recommendations
Telephony Layer
| Provider | Best For | Cost/Min | Notes |
|---|
| Twilio | General purpose | $0.0085/min | Best docs, most integrations |
| Vonage | International | $0.005-0.02/min | Better rates outside US |
| Telnyx | Cost optimization | $0.004/min | Lowest per-minute rates |
Speech-to-Text (STT)
| Provider | Latency | Accuracy | Cost/Hour | Best For |
|---|
| Deepgram Nova-2 | ~200ms | 95%+ | $0.75 | Production default |
| Deepgram Nova-3 | ~250ms | 97% | $1.25 | High-accuracy needs |
| Whisper API | ~400ms | 94% | $0.36 | Budget option |
| Whisper Local | Variable | 92% | $0 | On-premise requirements |
Text-to-Speech (TTS)
| Provider | Latency | Naturalness | Cost/1K chars | Best For |
|---|
| ElevenLabs Turbo v2.5 | ~150ms | ⭐⭐⭐⭐⭐ | $0.10 | Premium experience |
| Cartesia Sonic | ~100ms | ⭐⭐⭐⭐⭐ | $0.08 | Ultra-low latency |
| OpenAI TTS HD | ~200ms | ⭐⭐⭐⭐ | $0.03 | Budget-conscious |
| Azure Neural | ~250ms | ⭐⭐⭐⭐ | $0.016 | Enterprise fallback |
LLM (Conversation Brain)
| Model | Latency | Intelligence | Cost/1K tokens | Best For |
|---|
| GPT-4o | ~300ms | ⭐⭐⭐⭐⭐ | $0.005/0.015 | Default choice |
| Claude 3.5 Sonnet | ~400ms | ⭐⭐⭐⭐⭐ | $0.003/0.015 | Complex reasoning |
| GPT-4o-mini | ~150ms | ⭐⭐⭐⭐ | $0.00015/0.0006 | High-volume, simple flows |
Core Framework: The VOX Pattern
Every production voice agent follows the VOX Pattern:
V - Voice (STT capture + audio streaming)
O - Orchestrate (conversation state + context management)
X - eXecute (LLM reasoning + function calling + TTS output)
Step 1: Voice Layer (STT + Audio Streaming)
const WebSocket = require('ws');
const Deepgram = require('@deepgram/sdk');
class VoiceStreamHandler {
constructor(twilioWs, config) {
this.twilioWs = twilioWs;
this.deepgram = Deepgram.createClient(config.deepgramKey);
this.transcriptionBuffer = [];
this.isProcessing = false;
this.dgConnection = this.deepgram.listen.live({
model: 'nova-2',
language: 'en-US',
smart_format: true,
interim_results: true,
utterance_end_ms: 1500,
vad_events: true,
endpointing: 400,
});
this.setupDeepgramHandlers();
}
setupDeepgramHandlers() {
this.dgConnection.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
if (data.is_final) {
this.transcriptionBuffer.push({
text: transcript,
confidence: data.channel.alternatives[0].confidence,
timestamp: Date.now()
});
if (!this.isProcessing) {
this.processTranscription();
}
}
});
this.dgConnection.on(LiveTranscriptionEvents.UtteranceEnd, () => {
if (this.transcriptionBuffer.length > 0 && !this.isProcessing) {
this.processTranscription();
}
});
}
async processTranscription() {
this.isProcessing = true;
const utterance = this.transcriptionBuffer.map(u => u.text).join(' ');
this.transcriptionBuffer = [];
await this.orchestrator.handleUserInput(utterance);
this.isProcessing = false;
}
handleAudio(payload) {
const audioBuffer = Buffer.from(payload, 'base64');
this.dgConnection.send(audioBuffer);
}
}
Step 2: Orchestration Layer (State Management)
class ConversationOrchestrator {
constructor(config) {
this.sessions = new Map();
this.config = config;
}
async createSession(callSid, fromNumber, toNumber) {
const session = {
callSid,
fromNumber,
toNumber,
startTime: Date.now(),
context: {
userInfo: null,
appointmentDate: null,
qualificationScore: 0,
intents: [],
objects: {}
},
history: [],
currentNode: 'greeting',
transferRequested: false
};
session.context.userInfo = await this.enrichCaller(fromNumber);
this.sessions.set(callSid, session);
return session;
}
async handleUserInput(callSid, utterance) {
const session = this.sessions.get(callSid);
if (!session) return;
session.history.push({ role: 'user', content: utterance, timestamp: Date.now() });
const stateContext = this.buildStateContext(session);
const response = await this.executeLLM(session, stateContext);
if (response.functionCall) {
const result = await this.executeFunction(response.functionCall, session);
const followUp = await this.executeLLM(session, {
...stateContext,
functionResult: result
});
await this.speakResponse(callSid, followUp.content);
} else {
await this.speakResponse(callSid, response.content);
}
session.history.push({ role: 'assistant', content: response.content, timestamp: Date.now() });
}
buildStateContext(session) {
return {
currentNode: session.currentNode,
extractedInfo: session.context,
availableFunctions: this.getAvailableFunctions(session.currentNode),
systemPrompt: this.getSystemPromptForNode(session.currentNode),
maxHistory: 10
};
}
}
Step 3: Execution Layer (LLM + Function Calling)
class LLMExecutor {
constructor(config) {
this.openai = new OpenAI({ apiKey: config.openaiKey });
this.model = config.model || 'gpt-4o';
}
async generateResponse(session, context) {
const messages = [
{ role: 'system', content: context.systemPrompt },
...session.history.slice(-context.maxHistory).map(h => ({
role: h.role,
content: h.content
}))
];
const tools = context.availableFunctions.map(fn => ({
type: 'function',
function: fn
}));
const response = await this.openai.chat.completions.create({
model: this.model,
messages,
tools: tools.length > 0 ? tools : undefined,
tool_choice: tools.length > 0 ? 'auto' : undefined,
temperature: 0.7,
max_tokens: 300
});
const choice = response.choices[0];
if (choice.message.tool_calls) {
return {
content: choice.message.content,
functionCall: {
name: choice.message.tool_calls[0].function.name,
arguments: JSON.parse(choice.message.tool_calls[0].function.arguments)
}
};
}
return { content: choice.message.content };
}
}
const FUNCTIONS = {
bookAppointment: {
name: 'bookAppointment',
description: 'Book an appointment for the caller',
parameters: {
type: 'object',
properties: {
date: { type: 'string', format: 'date' },
time: { type: 'string', pattern: '^[0-9]{2}:[0-9]{2}$' },
service: { type: 'string' },
name: { type: 'string' },
phone: { type: 'string' }
},
required: ['date', 'time', 'service', 'name', 'phone']
}
},
checkAvailability: {
name: 'checkAvailability',
description: 'Check available time slots for a given date',
parameters: {
type: 'object',
properties: {
date: { type: 'string', format: 'date' },
service: { type: 'string' }
},
required: ['date']
}
},
transferToHuman: {
name: 'transferToHuman',
description: 'Transfer the call to a human agent',
parameters: {
type: 'object',
properties: {
reason: { type: 'string' },
priority: { type: 'string', enum: ['low', 'normal', 'high', 'urgent'] }
},
required: ['reason']
}
},
qualifyLead: {
name: 'qualifyLead',
description: 'Score the lead based on qualification criteria',
parameters: {
type: 'object',
properties: {
budget: { type: 'string' },
timeline: { type: 'string' },
authority: { type: 'boolean' },
need: { type: 'string' }
}
}
}
};
Step 4: TTS Output Streaming
class TTSStreamer {
constructor(config) {
this.elevenLabsKey = config.elevenLabsKey;
this.voiceId = config.voiceId;
this.twilioWs = null;
}
async speakResponse(callSid, text, twilioWs) {
const response = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${this.voiceId}/stream`,
{
method: 'POST',
headers: {
'Accept': 'audio/mulaw',
'xi-api-key': this.elevenLabsKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text,
model_id: 'eleven_turbo_v2_5',
output_format: 'mulaw_8000',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.3,
use_speaker_boost: true
}
})
}
);
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
twilioWs.send(JSON.stringify({
event: 'media',
streamSid: callSid,
media: {
payload: Buffer.from(value).toString('base64')
}
}));
}
}
}
Conversation Design Framework
The SPICE Method for Voice Scripts
Every voice interaction follows SPICE:
| Element | Purpose | Example |
|---|
| Set Context | Who/what/where | "Hi, this is Sarah from Apex Dental. I'm calling about your appointment request." |
| Proceed with Purpose | Clear next step | "I can check our availability for this week. What day works best for you?" |
| Invite Response | Open-ended prompt | "Tell me a bit about what you're looking for." |
| Confirm Understanding | Verify extraction | "So you're looking for a cleaning on Tuesday afternoon—is that right?" |
| Exit or Advance | Move forward or close | "Perfect! I've got you booked. You'll receive a confirmation text. Anything else I can help with?" |
Voice Persona Template
name: "Sarah"
purpose: "Dental appointment scheduling and lead qualification"
tone: "warm, professional, efficient"
speaking_style: "natural pace, occasional brief pauses, friendly but not overly casual"
avoid:
- "um", "uh", "like"
- Over-apologizing
- Robot-speak ("I am an AI assistant...")
- Rushing through important details
use_when:
- Inbound appointment calls
- Follow-up on web inquiries
- Rescheduling requests
greeting: "Hi, this is Sarah from Apex Dental. How can I help you today?"
Conversation Flow Templates
Template 1: Appointment Booking
const APPOINTMENT_FLOW = {
nodes: {
greeting: {
prompt: `You're Sarah from {businessName}. Answer the phone warmly.
If they want to book an appointment, proceed to qualification.
If they have other questions, answer helpfully or offer to transfer.`,
next: ['qualifyNeed', 'handleQuestion', 'transfer']
},
qualifyNeed: {
prompt: `Ask what service they need and when they'd prefer to come in.
Services: {availableServices}.
Extract: service type, preferred date range, urgency level.`,
extract: ['service', 'preferredDate', 'urgency'],
next: ['checkAvailability']
},
checkAvailability: {
function: 'checkAvailability',
prompt: `Share 2-3 available slots. Present them clearly with day and time.
Ask which they prefer or if they need different options.`,
next: ['confirmBooking', 'offerAlternatives']
},
confirmBooking: {
prompt: `Confirm the details: "Just to confirm, that's a {service} on {date} at {time} for {name}. Is that correct?"`,
function: 'bookAppointment',
next: ['bookingConfirmed', 'correctDetails']
},
bookingConfirmed: {
prompt: `Great! You're all set. Mention they'll get a confirmation text 24 hours before.
Ask if there's anything else they need.`,
next: ['additionalHelp', 'closeCall']
},
closeCall: {
prompt: `Thank them warmly and say goodbye. Keep it brief.`,
hangup: true
}
}
};
Template 2: Lead Qualification (Outbound)
const OUTBOUND_QUALIFICATION_FLOW = {
nodes: {
intro: {
prompt: `Quick intro: "Hi {name}, this is {agentName} from {company}.
I saw you recently {triggerEvent}. Do you have a quick minute?"
If YES → proceed to discovery
If NO → ask for better time, schedule callback
If voicemail → leave 15-second message with callback number`,
next: ['discovery', 'scheduleCallback', 'voicemail']
},
discovery: {
prompt: `Ask BANT questions naturally:
- Budget: "What's your budget range for this?" or "Are you the decision-maker for budget?"
- Authority: "Besides yourself, who else is involved in this decision?"
- Need: "What prompted you to look into this now?"
- Timeline: "When are you hoping to have this in place?"
Don't ask all at once—let it flow conversationally.
Score each response (1-10) and store in context.`,
extract: ['budgetRange', 'decisionMaker', 'painPoints', 'timeline', 'competitors'],
next: ['presentSolution', 'nurture', 'disqualify']
},
presentSolution: {
condition: 'qualificationScore >= 7',
prompt: `They're qualified! Briefly present the solution that matches their needs.
Focus on outcomes, not features.
End with: "Would it make sense to schedule a 15-minute demo this week?"`,
next: ['bookMeeting', 'handleObjection', 'nurture']
},
nurture: {
condition: 'qualificationScore >= 4 && qualificationScore < 7',
prompt: `Not ready to buy yet. Add value:
- Send relevant case study
- Offer to add them to monthly insights email
- Set follow-up in 30-60 days
Get permission and their preferred contact method.`,
function: 'addToNurtureSequence',
next: ['closeCall']
},
disqualify: {
condition: 'qualificationScore < 4',
prompt: `They're not a fit. Be polite but direct:
"Based on what you've shared, I don't think we're the right fit right now.
But if things change, feel free to reach out. Thanks for your time!"`,
next: ['closeCall']
}
}
};
Voice Selection Guide
Matching Voice to Brand
| Industry | Voice Characteristics | Recommended Voices |
|---|
| Healthcare | Warm, calming, trustworthy | ElevenLabs: "Grace" (premade) or custom clone |
| Legal/Finance | Professional, measured, authoritative | ElevenLabs: "Adam" or "Daniel" |
| E-commerce | Energetic, helpful, efficient | ElevenLabs: "Bella" or "Rachel" |
| SaaS/Tech | Modern, confident, solution-oriented | Cartesia: "Sonic" upbeat variants |
| Luxury | Refined, sophisticated, unhurried | Custom clone of premium brand voice |
Creating Custom Voice Clones
"""
Requirements:
- Minimum 30 minutes of clean audio
- Consistent tone throughout samples
- No background music or noise
- Diverse sentence structures (questions, statements, emotions)
- Speaker should sound like they want the clone to sound
Upload via: https://elevenlabs.io/voice-lab
Training time: ~30 minutes
Cost: ~$5/month per custom voice
"""
SCRIPT = """
Welcome to our service. I'm here to help you find exactly what you need.
Did you know we offer same-day delivery in your area? That's right—order by 2 PM and it's at your door by evening.
Let me check that availability for you. One moment please... Okay, great news! We have three slots open tomorrow.
Are you looking for something specific, or would you like me to make some recommendations?
I understand this is frustrating. Let me see what I can do to make this right.
Congratulations! Your booking is confirmed. You'll receive a confirmation shortly.
Is there anything else I can help you with today?
"""
Error Handling & Edge Cases
Common Failure Modes
| Issue | Detection | Recovery Strategy |
|---|
| STT fails/no transcription | 5+ seconds silence with audio | "I'm having trouble hearing you. Could you speak up or call back from a quieter location?" |
| LLM hallucinates | Nonsensical response detected | Interrupt, apologize, and restate: "Let me rephrase that..." |
| User asks "Are you a robot?" | Keyword detection | "I'm an AI assistant helping {company} with scheduling. I'm real in the sense that I can actually book your appointment right now. How can I help?" |
| Angry caller | Sentiment analysis (tone + words) | "I understand you're frustrated. Let me get you to a supervisor right away." → immediate transfer |
| Confusing request | Low confidence on intent | "I want to make sure I help you correctly. Are you looking to [option A] or [option B]?" |
| Background noise | Audio level detection | "It sounds like there's some background noise. Could you move to a quieter spot?" |
Barge-In Handling
class BargeInHandler {
constructor() {
this.isSpeaking = false;
this.interruptThreshold = 0.3;
}
onUserSpeechDetected() {
if (this.isSpeaking) {
this.stopSpeaking();
setTimeout(() => {
this.playAcknowledgment();
}, 200);
}
}
playAcknowledgment() {
const acks = ['yes?', 'go ahead', 'im listening'];
return acks[Math.floor(Math.random() * acks.length)];
}
}
Deployment Patterns
Pattern 1: Serverless (Vercel/Netlify Functions)
Best for: Low volume (<1,000 calls/day), simple flows, quick MVP
export default async function handler(req, res) {
if (req.method === 'POST' && req.body.Event === 'Start') {
const session = await createSession(req.body.CallSid);
res.setHeader('Content-Type', 'text/xml');
res.send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://your-domain.com/ws" />
</Connect>
</Response>`);
}
}
Pattern 2: Dedicated Server (Docker)
Best for: High volume, complex state management, low latency requirements
# Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
version: '3.8'
services:
voice-agent:
build: .
ports:
- "3000:3000"
environment:
- TWILIO_ACCOUNT_SID=${TWILIO_ACCOUNT_SID}
- TWILIO_AUTH_TOKEN=${TWILIO_AUTH_TOKEN}
- DEEPGRAM_API_KEY=${DEEPGRAM_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ELEVENLABS_API_KEY=${ELEVENLABS_API_KEY}
deploy:
replicas: 3
resources:
limits:
cpus: '1'
memory: 512M
Pattern 3: Kubernetes (Production Scale)
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-agent
spec:
replicas: 5
selector:
matchLabels:
app: voice-agent
template:
metadata:
labels:
app: voice-agent
spec:
containers:
- name: voice-agent
image: your-registry/voice-agent:latest
ports:
- containerPort: 3000
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
envFrom:
- secretRef:
name: voice-agent-secrets
---
apiVersion: v1
kind: Service
metadata:
name: voice-agent-service
spec:
selector:
app: voice-agent
ports:
- port: 80
targetPort: 3000
type: LoadBalancer
Pricing Calculator
Cost Per Minute Breakdown
| Component | Provider | Cost/Min | Notes |
|---|
| Telephony | Twilio | $0.0085 | Inbound US calls |
| STT | Deepgram Nova-2 | $0.0125 | Assuming ~150 words/min |
| LLM | GPT-4o | $0.015 | ~300 tokens avg response |
| TTS | ElevenLabs | $0.018 | ~180 chars avg response |
| Total | | ~$0.054/min | ~$3.24/hour of talk time |
Monthly Cost Projections
| Volume | Minutes/Month | Est. Cost | Use Case |
|---|
| Startup | 500 | $27 | Testing, 1-2 active clients |
| Small Biz | 3,000 | $162 | Single business inbound line |
| Growing | 15,000 | $810 | Multi-client, moderate volume |
| Scale | 50,000 | $2,700 | High-volume outbound + inbound |
| Enterprise | 200,000 | $10,800 | Call center replacement |
Cost Optimization Tips
- Use GPT-4o-mini for simple flows — 15x cheaper, often sufficient
- Batch TTS when possible — Cache common phrases (greetings, closings)
- Implement smart call routing — Route simple requests to FAQ handler
- Use Deepgram's tiered pricing — Commit to volume for 20-40% discounts
Testing & Monitoring
Load Testing Script
import { VoiceTestClient } from './test-client';
const CONCURRENT_CALLS = 10;
const TEST_DURATION_MINUTES = 5;
async function runLoadTest() {
const results = [];
for (let i = 0; i < CONCURRENT_CALLS; i++) {
const call = new VoiceTestClient({
scenario: 'appointment_booking',
audioSample: './test-audio/sample-conversation.wav'
});
const result = await call.run();
results.push({
latency: result.avgLatency,
success: result.bookingCompleted,
errors: result.errors,
duration: result.callDuration
});
}
console.log('Load Test Results:');
console.log(`Success Rate: ${results.filter(r => r.success).length / results.length * 100}%`);
console.log(`Avg Latency: ${results.reduce((a, r) => a + r.latency, 0) / results.length}ms`);
}
Key Metrics to Track
const METRICS = {
'stt.latency': 'Time from speech end to transcript',
'llm.latency': 'Time from transcript to response text',
'tts.latency': 'Time from response text to audio playback',
'e2e.latency': 'Total round-trip time',
'stt.word_error_rate': 'Transcription accuracy',
'conversation.completion_rate': 'Successful call completions',
'transfer.rate': 'Percentage transferred to human',
'user.interruption_rate': 'How often users interrupt AI',
'booking.conversion': 'Appointment bookings / total calls',
'lead.qualification_rate': 'Qualified leads / total leads',
'call.duration': 'Average call duration'
};
Cross-References
- revenue-website — Integrate voice agents into high-converting landing pages
- lead-machine — Connect voice qualification to your lead pipeline
- ai-company-ops — Deploy voice agents as part of your AI workforce
Quick Start Checklist
Time to first call: 2-4 hours with this skill.