How to Build a Voice AI Agent in 2026: Step-by-Step Tutorial with Architecture, Tools, and Code
Build a production-ready AI voice agent from scratch in 2026. This step-by-step tutorial covers the complete stack: STT with Deepgram, LLM with GPT-4o, TTS with ElevenLabs, telephony with Twilio, and real-time orchestration with Pipecat — plus the no-code path if you want to skip the engineering.
Utkarsh Mohan
Published: Jun 10, 2026

Table of Contents
Table of Contents
Building a voice AI agent in 2026 is dramatically easier than it was 18 months ago. The open-source and commercial building blocks have matured: Deepgram's Nova-3 STT achieves sub-50ms transcription latency, ElevenLabs Turbo v2.5 streams audio with 75ms first-chunk delivery, GPT-4o's Realtime API enables true speech-to-speech conversation, and orchestration frameworks like Pipecat abstract the complex pipeline wiring into Python that a mid-level developer can understand in an afternoon. This tutorial walks through the complete build a voice AI agent process — from architecture decisions to production deployment — with working code examples.
Before diving into the code path, a note on scope: the DIY stack is the right choice for teams with specific customization requirements, integration needs that off-the-shelf platforms don't support, or data sovereignty requirements that preclude using a managed platform. For 90% of business use cases — appointment booking, lead qualification, customer service, inbound reception — a platform like Ringlyn AI delivers the same result in one day versus the 2–6 weeks the custom build takes. The last section of this tutorial covers the no-code path for those who want the outcome without the engineering.
What You'll Build: Architecture Overview
The voice AI agent we're building handles inbound phone calls: answers when a phone number is called, listens to the caller, understands their request, generates a response, speaks it back, maintains conversation context, and can call external APIs (CRM, calendar, database) to take actions. The complete architecture:
- Telephony layer: Twilio (or Telnyx) receives the inbound call, converts audio to a WebSocket stream, and delivers it to your server.
- STT layer: Deepgram Nova-3 transcribes the caller's audio in real time, returning word-level transcripts with <50ms latency.
- LLM layer: GPT-4o or Claude 3.7 processes the transcript, maintains conversation history, executes tool calls, and generates a text response.
- TTS layer: ElevenLabs Turbo v2.5 or Cartesia Sonic converts the text response to audio and streams it back to the caller via Twilio.
- Orchestration layer: Pipecat (open-source Python framework) manages the pipeline — VAD (voice activity detection), turn-taking, interruption handling, and component coordination.
- Business logic layer: Tool definitions that allow the LLM to call external APIs — your CRM, calendar, database, or any webhook endpoint.
Two Paths: Custom Build vs No-Code Platform
| Dimension | Custom Build (This Tutorial) | No-Code Platform (Ringlyn AI) |
|---|---|---|
| Time to first call | 2–6 weeks for a production-quality deployment | 1–2 hours from signup to live call |
| Engineering required | Python/Node developer with async experience; GPU/cloud infra knowledge | None — browser-based configuration |
| Customization ceiling | Unlimited — any model, any tool, any prompt | High but bounded — configure prompts, tools, voices; limited custom model swapping |
| Ongoing maintenance | Your team — model updates, API changes, infrastructure management | Platform handles — automatic model updates, infrastructure managed |
| Monthly cost at 1,000 calls | $80–$150 infrastructure + developer time | $49–$99/month flat rate |
| Best for | Unique use cases, data sovereignty requirements, product companies building voice AI into their own SaaS | Businesses deploying voice AI for operations — appointment booking, lead qualification, customer service |
Choose Your Stack: STT, LLM, TTS, and Telephony Options in 2026
| Component | Recommended (Quality + Latency) | Budget Option | Self-Hostable Option |
|---|---|---|---|
| STT | Deepgram Nova-3 ($0.0043/min, <50ms latency) | AssemblyAI Universal-2 ($0.0055/min) | Whisper Large v3 (self-hosted, ~100ms on A10G GPU) |
| LLM | GPT-4o via OpenAI Realtime API (speech-to-speech) | GPT-4o-mini for cost-sensitive applications | Meta Llama 3.1 70B on private GPU infrastructure |
| TTS | ElevenLabs Turbo v2.5 (75ms latency, $0.003/1k chars) | Cartesia Sonic (50ms latency, comparable cost) | Kokoro (open source, ~200ms on A10G GPU) |
| Telephony | Twilio Voice with Media Streams ($0.0085/min) | Telnyx ($0.005/min, lower cost) | FreeSWITCH (self-hosted SIP, minimal per-minute cost) |
| Orchestration | Pipecat (Python, open source, highly recommended) | LiveKit Agents (Python, strong WebRTC support) | Custom asyncio pipeline (most control, most work) |
Voice AI stack options by component, 2026 — cost and latency tradeoffs
Step 1: Set Up Telephony with Twilio Media Streams
Twilio's Media Streams feature delivers inbound call audio as a WebSocket stream to your server, enabling real-time audio processing. Install Twilio and configure a webhook to point to your server when a call arrives:
# requirements: twilio, flask, websockets
from flask import Flask, request, Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
app = Flask(__name__)
@app.route('/incoming-call', methods=['POST'])
def incoming_call():
response = VoiceResponse()
connect = Connect()
# Point Twilio to your WebSocket server
stream = Stream(url='wss://your-server.com/audio-stream')
connect.append(stream)
response.append(connect)
return Response(str(response), mimetype='text/xml')Set your Twilio phone number's incoming call webhook to `https://your-server.com/incoming-call`. When a call arrives, Twilio sends an HTTP POST to this endpoint and opens a WebSocket to `/audio-stream` for bidirectional audio.
Step 2: Real-Time STT with Deepgram Nova-3
Deepgram's streaming STT API receives audio chunks and returns word-by-word transcripts via WebSocket. The key configuration for voice AI: `interim_results=true` (partial transcripts for low latency), `endpointing=300` (detect when the speaker has paused for 300ms — end of utterance), and `vad_events=true` (voice activity detection):
import asyncio
from deepgram import DeepgramClient, LiveOptions
async def transcribe_stream(audio_queue: asyncio.Queue, transcript_queue: asyncio.Queue):
dg_client = DeepgramClient(api_key=DEEPGRAM_API_KEY)
options = LiveOptions(
model="nova-3",
language="en-US",
smart_format=True,
interim_results=True,
endpointing=300,
vad_events=True,
)
connection = await dg_client.listen.asyncwebsocket.v("1").start(options)
async def on_transcript(result, **kwargs):
# Only process final transcripts (endpointing triggered)
if result.speech_final:
transcript = result.channel.alternatives[0].transcript
if transcript.strip():
await transcript_queue.put(transcript)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
# Feed audio chunks from Twilio WebSocket to Deepgram
async for chunk in audio_queue:
await connection.send(chunk)Step 3: LLM Reasoning with GPT-4o
The LLM receives the transcript, maintains the conversation history, and generates a response. Define tools (function calls) for any actions the AI should take — booking appointments, querying a CRM, checking availability. The system prompt is where you define the AI's persona, knowledge base, and behavioral guidelines:
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key=OPENAI_API_KEY)
SYSTEM_PROMPT = """
You are Aria, a friendly AI assistant for Acme Dental.
You help patients schedule appointments, answer questions about services,
and handle general inquiries. Be concise — phone conversations should
feel natural, not like reading a webpage.
Today's date: {date}. Available hours: Mon-Fri 8am-5pm, Sat 9am-1pm.
"""
async def get_llm_response(conversation_history: list, tools: list = None):
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": SYSTEM_PROMPT}] + conversation_history,
tools=tools,
tool_choice="auto",
stream=True, # Stream for lower time-to-first-token
)
full_response = ""
async for chunk in response:
if chunk.choices[0].delta.content:
full_response += chunk.choices[0].delta.content
yield chunk.choices[0].delta.content # Stream to TTS
return full_responseStep 4: Low-Latency TTS with ElevenLabs Turbo v2.5
Stream TTS output back to the caller as the LLM generates text — don't wait for the full response before starting audio playback. This reduces perceived latency from 2–3 seconds to under 400ms end-to-end:
import aiohttp
async def stream_tts_to_twilio(text_stream, twilio_websocket, voice_id: str):
"""Stream ElevenLabs TTS audio chunks directly to Twilio WebSocket"""
url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream"
headers = {"xi-api-key": ELEVENLABS_API_KEY, "Content-Type": "application/json"}
# Buffer text until sentence boundary before sending to TTS
# This improves prosody vs sending word-by-word
text_buffer = ""
async for text_chunk in text_stream:
text_buffer += text_chunk
if any(p in text_buffer for p in ['.', '!', '?', ',']):
async with aiohttp.ClientSession() as session:
async with session.post(url, headers=headers, json={
"text": text_buffer,
"model_id": "eleven_turbo_v2_5",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
}) as resp:
async for audio_chunk in resp.content.iter_chunked(1024):
# Send audio to caller via Twilio WebSocket
await twilio_websocket.send(encode_for_twilio(audio_chunk))
text_buffer = ""Step 5: Orchestration with Pipecat
Pipecat (by Daily.co) is the recommended open-source orchestration framework for production voice AI in 2026. It handles the hardest parts: voice activity detection, barge-in/interruption detection (caller speaks while AI is talking), turn-taking logic, and pipeline state management. Instead of writing all the async coordination logic yourself, Pipecat provides pre-built processors you assemble into a pipeline:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.twilio import TwilioTransport
async def build_voice_agent_pipeline(websocket, stream_sid: str):
transport = TwilioTransport(websocket, stream_sid)
stt = DeepgramSTTService(
api_key=DEEPGRAM_API_KEY,
audio_passthrough=True # Pass audio to VAD while transcribing
)
llm = OpenAILLMService(
api_key=OPENAI_API_KEY,
model="gpt-4o",
system_prompt=SYSTEM_PROMPT,
tools=YOUR_TOOL_DEFINITIONS
)
tts = ElevenLabsTTSService(
api_key=ELEVENLABS_API_KEY,
voice_id="your_voice_id",
model="eleven_turbo_v2_5"
)
pipeline = Pipeline([
transport.input(), # Twilio audio in
stt, # Speech to text
llm, # LLM reasoning
tts, # Text to speech
transport.output() # Audio back to caller
])
task = PipelineTask(pipeline, allow_interruptions=True)
await task.run()Step 6: Context, Memory, and Tool Use
The most powerful voice AI agents combine conversation memory with tool use — allowing the AI to take actions (book appointments, look up customer records, send confirmations) based on what the caller says. Here's an example tool definition for appointment booking:
BOOKING_TOOLS = [
{
"type": "function",
"function": {
"name": "check_availability",
"description": "Check available appointment slots for a given date range",
"parameters": {
"type": "object",
"properties": {
"date_range_start": {"type": "string", "description": "ISO date string"},
"date_range_end": {"type": "string", "description": "ISO date string"},
"service_type": {"type": "string", "description": "e.g. cleaning, checkup"}
},
"required": ["date_range_start", "service_type"]
}
}
},
{
"type": "function",
"function": {
"name": "book_appointment",
"description": "Book an appointment for the caller",
"parameters": {
"type": "object",
"properties": {
"datetime": {"type": "string"},
"patient_name": {"type": "string"},
"phone": {"type": "string"},
"service": {"type": "string"}
},
"required": ["datetime", "patient_name", "phone", "service"]
}
}
}
]
async def execute_tool(tool_name: str, args: dict) -> str:
if tool_name == "check_availability":
# Call your calendar API
slots = await calendar_api.get_available_slots(**args)
return json.dumps(slots)
elif tool_name == "book_appointment":
result = await calendar_api.book(**args)
# Also trigger CRM update and confirmation SMS
await crm.create_contact(args["patient_name"], args["phone"])
await sms.send_confirmation(args["phone"], result["confirmation_number"])
return f"Booked! Confirmation #{result['confirmation_number']}"
Step 7: Deploying to Production
Production deployment requirements for a voice AI agent: low latency (deploy in the same region as Twilio's media infrastructure — US East or US West), high availability (use managed services for Kubernetes or ECS, not a single EC2 instance), and observability (log every conversation turn, response latency, and tool call result).
- Compute: AWS EC2 c6i.xlarge (4 vCPU, 8 GB RAM) handles ~20 concurrent calls per instance. For 100 concurrent calls, 5 instances behind a load balancer. Estimated cost: ~$280/month at on-demand pricing.
- Region selection: Deploy in us-east-1 (N. Virginia) or us-west-2 (Oregon) to match Twilio's media processing hubs — this alone reduces audio round-trip latency by 50–100ms versus a distant region.
- WebSocket concurrency: Each call holds an open WebSocket connection for the duration of the call. Use uvicorn or hypercorn with asyncio for high connection concurrency.
- Monitoring: Track end-to-end latency (time from STT speech_final to first TTS audio byte) per call. Alert when median latency exceeds 500ms — this indicates a component is degraded.
- Graceful degradation: If ElevenLabs TTS latency spikes above 300ms, automatically switch to Cartesia Sonic. If GPT-4o latency spikes, fall back to GPT-4o-mini. Circuit breaker patterns prevent cascade failures.
The No-Code Path: Ringlyn AI for Non-Engineers
If you want an AI voice agent handling your business calls without writing any of the above code, Ringlyn AI provides all of this functionality through a no-code configuration interface. You configure the AI's persona and knowledge base in a text editor, connect your CRM and calendar via pre-built integrations, and go live with a production-grade voice agent in hours rather than weeks. The underlying infrastructure is the same stack described above (Deepgram, ElevenLabs, GPT-4o, Twilio) — Ringlyn AI simply handles the orchestration, maintenance, and scaling for you.
The specific case for using a platform rather than building: if your use case is standard business voice AI (appointment booking, lead qualification, customer service, after-hours answering), the platform delivers identical outcomes in 1/20th the time. Build your own stack when you have use cases that standard platforms genuinely cannot support — specific model requirements, unusual integration needs, or data sovereignty requirements that preclude any managed service.
Deploy a Production Voice AI Agent in Hours — No Code Required
Ringlyn AI uses the same Deepgram + GPT-4o + ElevenLabs stack described in this tutorial — managed, maintained, and scaling for you from $49/month.
Cost at Scale: Budget Your Voice AI Deployment
| Component | Cost at 1,000 calls/month (3 min avg) | Cost at 10,000 calls/month |
|---|---|---|
| Deepgram STT (Nova-3) | $0.0043/min × 3,000 min = $12.90 | $129 |
| GPT-4o (LLM) | ~$0.02/min avg token cost × 3,000 min = $60 | $600 |
| ElevenLabs TTS (Turbo v2.5) | ~$0.01/min TTS cost × 3,000 min = $30 | $300 |
| Twilio telephony | $0.0085/min × 3,000 min = $25.50 | $255 |
| Compute (EC2 c6i.xlarge) | $56/month base | $280 (5 instances) |
| Total custom build cost | ~$184/month | ~$1,564/month |
| Ringlyn AI flat rate | $49–$99/month | $199/month (Pro plan) |
Voice AI cost comparison: custom build vs Ringlyn AI platform at 1,000 and 10,000 calls/month
Frequently Asked Questions
The production-proven stack in 2026: Deepgram Nova-3 for STT (sub-50ms latency), GPT-4o via OpenAI API for LLM reasoning (or Claude 3.7 Sonnet for better instruction following), ElevenLabs Turbo v2.5 for TTS (75ms first chunk), Twilio Media Streams for telephony (most widely used and best documented), and Pipecat for orchestration (handles the hardest parts: VAD, barge-in, turn-taking). For lowest possible latency, replace the STT+LLM+TTS pipeline with OpenAI's Realtime API (speech-to-speech), which achieves under 200ms end-to-end.
A proof-of-concept voice AI agent using Pipecat with Deepgram + GPT-4o + ElevenLabs + Twilio can be built in 2–4 days by a developer with Python asyncio experience. A production-quality deployment with proper error handling, observability, graceful degradation, auto-scaling, and integration with a CRM or calendar system takes 3–6 weeks. If you're building a one-off business use case rather than a product, this build time cost usually exceeds the economic value of building over using a managed platform.
Core tools in 2026: Pipecat (Python orchestration framework — open source), Deepgram SDK (STT), OpenAI Python SDK (LLM), ElevenLabs Python SDK (TTS), Twilio Python Helper Library (telephony), FastAPI or Starlette (WebSocket server), Docker (containerization), and AWS ECS or EC2 (compute). For infrastructure as code: Terraform or AWS CDK. For monitoring: Datadog or Grafana with Prometheus. The Pipecat documentation at docs.pipecat.ai is the best starting point — it includes working examples for Twilio + Deepgram + ElevenLabs.
Tool use in voice AI works through OpenAI's function calling API. Define your tools as JSON schema objects describing the function name, description, and parameters. Pass the tool definitions in the LLM API call. When the LLM decides to call a tool, it returns a tool_use response instead of a text response — your orchestration layer intercepts this, calls the actual function (your CRM API, calendar API, database query), and passes the result back to the LLM for the next turn. Pipecat's LLMService handles this tool-call intercept automatically when you define function handlers in your pipeline configuration.
The lowest-cost custom build uses: Whisper Large v3 (self-hosted on a $0.30/hr spot GPU instance) for STT, Meta Llama 3.1 8B (self-hosted) for LLM, Kokoro (open source, self-hosted) for TTS, and Telnyx (cheaper per minute than Twilio) for telephony. This stack costs approximately $0.02–$0.05 per minute at scale versus $0.04–$0.08 for the commercial stack. The trade-offs: worse latency (self-hosted models are slower than cloud APIs unless you have dedicated GPU), more maintenance, and significantly more engineering to achieve production reliability. For most businesses, Ringlyn AI's $49/month Starter plan is more cost-effective than self-hosting everything once engineering time is factored in.