TL;DR
A SIP trunk connects your PBX to an external carrier. A SIP bridge registers directly on your PBX as a local extension. For voice AI, the bridge approach eliminates Twilio's $0.02/min overhead, removes 95ms of latency per audio direction, and sets up in 10 minutes instead of days. This is how we dropped response times below 500ms and cut telephony costs by 90%.
Two Ways to Connect AI to a Phone System
Every voice AI platform needs to get audio from a phone call into its processing pipeline (speech-to-text, language model, text-to-speech) and back out again. There are two different ways to do this, and they're not equivalent.
Option 1: SIP Trunk Routing. The standard approach. Your PBX sends calls over a SIP trunk to a carrier (usually Twilio), which converts the audio into a WebSocket stream and sends it to the AI. Twilio sits in the middle of every call, handling the telephony layer.
Option 2: Direct SIP Bridge. The AI registers directly on your PBX as a SIP extension, like a remote agent's phone. Audio flows via RTP from the PBX to the AI with no intermediary. No Twilio, no carrier relay, no middleware.
Most voice AI platforms use Option 1 because it's easier to build. VAPI, Retell, and Bland all route through Twilio (or similar carriers) for their telephony. But Option 1 comes with costs and latency that Option 2 eliminates.
How the SIP Trunk Approach Works
When a voice AI platform uses Twilio for telephony, here's the actual audio path:
Caller → Carrier Network → Twilio Media Edge → Twilio Core → WebSocket → AI Server
Each hop adds processing time. Twilio's own engineering team published these numbers in their latency guide (November 2025):
| Component | Latency Added |
|---|---|
| Audio transmission to Twilio media edge | ~40ms |
| Buffering at media edge | ~30ms |
| Decoding (mulaw to PCM or passthrough) | ~25ms |
| WebSocket transmission to AI server | ~10-20ms |
| Total Twilio overhead (before AI processing starts) | ~95-115ms |
That's 95-115ms of delay before your AI even begins processing the audio. And this happens in both directions -- caller-to-AI and AI-to-caller. So the round-trip overhead is roughly 190-230ms just from Twilio's relay.
On top of the latency, Twilio charges for every component:
| Twilio Component | Cost |
|---|---|
| Outbound voice (US local) | $0.014/min |
| Media Streams (WebSocket relay) | $0.004/min |
| Call recording | $0.0025/min |
| Total per minute | $0.0205/min |
At 10,000 calls per month averaging 30 seconds each, that's $1,025/month in Twilio costs alone -- before your AI stack costs a penny.
How the SIP Bridge Approach Works
A SIP bridge skips Twilio entirely. The AI registers on your PBX (Asterisk, VICIdial, FreePBX) as a SIP extension. From the PBX's perspective, the AI is just another phone.
The audio path:
Caller → Carrier Network → Your PBX → Direct RTP → AI Bridge → AI Pipeline
Two fewer hops. No Twilio media edge, no Twilio core, no WebSocket relay. Audio arrives as raw RTP packets (mulaw, 8kHz, 20ms frames) directly from the PBX to the bridge.
Measured latency:
| Component | Latency |
|---|---|
| PBX to bridge (same network or nearby VM) | ~10-40ms |
| Bridge to AI pipeline (WebSocket, local) | ~5-10ms |
| Total bridge overhead | ~15-50ms |
Compare that to Twilio's 95-115ms. The bridge saves 55-85ms per audio direction, or roughly 110-170ms round-trip.
Cost:
| Component | Cost |
|---|---|
| GCP VM running the bridge | ~$0.002/min (amortized) |
| Twilio charges | $0.00 |
| Total per minute | ~$0.002/min |
That's a 90% reduction in telephony costs. For 10,000 calls per month at 30 seconds average, the bridge costs about $100/month vs $1,025/month for Twilio.
Why the Latency Difference Matters
In regular phone calls, 100ms of extra delay is barely noticeable. But voice AI has a unique problem: the audio must travel through three processing steps before a response plays back.
Caller speaks → STT (speech-to-text) → LLM (language model) → TTS (text-to-speech) → Response plays
Each step takes time. Published benchmarks from voice AI platforms:
| Step | Typical Latency |
|---|---|
| STT (Deepgram Nova) | 200-350ms |
| LLM (GPT-4o-mini) | 200-400ms |
| TTS (Cartesia) | 80-150ms |
| AI pipeline total | 480-900ms |
Now add the transport layer:
| Transport | Overhead | Total Response Time |
|---|---|---|
| Twilio relay | 190-230ms round-trip | 670-1,130ms |
| Direct RTP bridge | 30-100ms round-trip | 510-1,000ms |
The difference between 670ms and 510ms might not sound like much. But human conversational turn-taking averages about 200ms. Anything above 500ms starts to feel like a delay. Above 800ms, the conversation feels broken -- like talking to someone on a bad satellite connection.
Published response times for Twilio-based platforms:
| Platform | Average Response Time | Notes |
|---|---|---|
| Retell AI | ~600ms | Their own published benchmarks |
| VAPI | ~700ms (465ms claimed optimal) | Real-world reports of 1.5s+ with default settings |
| Bland AI | ~800ms | User reports of inconsistency |
| Twilio ConversationRelay | ~500ms median | Twilio's managed AI orchestration ($0.07/min extra) |
With the SIP bridge approach, sub-500ms is achievable because you've eliminated the transport bottleneck. The AI pipeline is the same -- same STT, same LLM, same TTS. The difference is what happens before and after the pipeline.
The WebSocket vs RTP Problem
Beyond raw latency, there's a protocol-level difference that matters for real-time audio.
Twilio uses WebSocket (TCP) to relay audio. TCP guarantees that every packet arrives in order. Sounds good, but for real-time audio it creates a problem called Head-of-Line blocking: if one packet is lost in transit, TCP holds ALL subsequent packets until the lost one is retransmitted. That retransmission takes roughly 100ms. During that time, the audio stream stalls.
On a stable network, this rarely matters. But phone calls aren't always on stable networks. Cell phone connections, VoIP over WiFi, and congested call center networks all have packet loss. When a Twilio WebSocket hits a lost packet, the AI hears silence for 100ms+ while TCP recovers.
Direct RTP uses UDP. UDP doesn't guarantee ordered delivery. If a packet is lost, it's gone -- the stream continues without it. For voice, this is a better tradeoff. A single lost 20ms audio frame is imperceptible. A 100ms stall while TCP retransmits is noticeable.
This is why WebRTC (the protocol behind browser-based video calls) uses UDP for media, not TCP. Real-time audio needs "good enough, right now" over "perfect, eventually."
Setup Comparison
| Step | SIP Trunk (Twilio) | SIP Bridge |
|---|---|---|
| Account setup | Create Twilio account, verify business, get SID/token | Get SIP credentials from AI platform |
| Phone numbers | Buy/port numbers on Twilio | Use your existing numbers (no change) |
| SIP configuration | Configure SIP trunk in Twilio + PBX carrier settings + dial plan | Add one SIP extension in PBX admin panel |
| Audio routing | Configure Twilio Media Streams webhook + WebSocket endpoint | Bridge auto-connects on SIP registration |
| Firewall rules | Allow Twilio IP ranges (dozens of IPs) | Allow one IP (the bridge VM) |
| Testing | End-to-end through Twilio's infrastructure | Direct call test |
| Typical time | 4-8 hours (experienced), days-weeks (first time) | 10 minutes |
The bridge setup is identical to adding a remote agent to your PBX. If your VICIdial admin has added a remote agent before, they can add the AI bridge.
When SIP Trunk Routing Makes Sense
The bridge approach isn't always the answer. SIP trunk routing is better when:
- You don't run your own PBX. If you use a cloud phone system (RingCentral, Vonage, etc.) that doesn't allow SIP extension registration, you need a trunk.
- You need Twilio-specific features. Call recording transcription, Studio flows, Flex contact center -- these require Twilio in the path.
- You're building a developer-facing product. If your customers are developers who want an API to build with, the Twilio-based approach is more familiar.
The bridge approach wins when:
- You run VICIdial, Asterisk, FreePBX, or any SIP-capable PBX. You have full control over extensions.
- Latency matters. BPO pre-qualification calls where the AI needs to sound human, not robotic.
- Cost matters at scale. $0.02/min savings compounds fast at 50,000+ minutes/month.
- Setup speed matters. 10 minutes vs days.
What We Built
Our SIP bridge is roughly 1,000 lines of Python. It handles SIP REGISTER authentication (digest auth with qop and cnonce), receives RTP audio (mulaw 8kHz, 20ms frames), and converts it to a format compatible with our AI pipeline. The AI pipeline itself -- running on Cloudflare Workers -- doesn't know or care whether the audio came from Twilio or the bridge. The bridge just swaps the transport layer.
We run it on a GCP e2-standard-2 VM ($48/month). It handles 50+ concurrent calls. The total per-minute cost for the bridge infrastructure is about $0.002 -- compared to $0.0205 through Twilio.
We're not sharing the code (it's our competitive moat), but the architecture is straightforward: SIP registration on the client's PBX, RTP receive threads, audio buffering, WebSocket output to the AI pipeline.
FAQ
Can I use this with VICIdial? Yes. VICIdial runs on Asterisk, which supports SIP extension registration natively. The bridge registers as an extension and receives calls from VICIdial's predictive dialer. Full guide: How to Add AI Agents to VICIdial.
Does this work with Trackdrive, Five9, or other dialers? Any dialer that supports SIP trunking or SIP extension registration. The bridge is protocol-level, not platform-specific.
What about call recording? The bridge records calls on dual channels (AI and caller separately) and stores recordings directly. No Twilio recording charges.
What about STIR/SHAKEN caller ID? STIR/SHAKEN is handled by your carriers, not by Twilio or the bridge. Your existing carrier authentication continues to work. The bridge doesn't touch the signaling path for outbound calls -- it only handles inbound answered calls routed to it by the PBX.
How much does it cost? The bridge itself has near-zero per-minute cost (~$0.002/min infrastructure). The AI processing on top is $0.10-0.15/min all-in. Compare that to Twilio-based platforms that charge $0.13-0.28/min after you add all the components.
Last updated: March 22, 2026 By Ansh Deb, Founder & CEO of Klariqo. We built a SIP bridge that registers as an agent on VICIdial and eliminated Twilio from our voice AI stack. Questions? [email protected]