TL;DR
A voice AI demo runs in ideal conditions: clean audio, a browser mic, one happy path. A real call runs over the public phone network, through carriers, codecs, and a dialer, and that is where most agents fall apart. The failures cluster in the telephony layer, not the model: latency that stacks past what callers tolerate, audio degraded by phone codecs, voicemail the agent talks to anyway, and warm transfers that lose the context. Fixing it is architectural, not a better prompt. You shorten the call path, handle the telephony events the demo never shows, and integrate with the dialer the team already runs.
A voice AI demo and a real phone call are not the same thing, and the gap between them is where most deployments quietly die. In a demo you speak into a laptop, the audio is clean, the path is short, and the script follows one route. It feels like magic. Then the same agent gets pointed at real outbound calls through a carrier and a dialer, and the magic turns into three seconds of silence, a confident conversation with a voicemail, and a transfer that hands a human a caller with no context.
None of that is the AI being dumb. It is the telephony layer the demo never had to deal with. Here is what actually changes between the demo and the call, and what it takes to survive the difference.
The demo is the easy 5%
A demo is a controlled environment. Clean speech, a quiet room, a direct browser connection, and a prompt path the builder already walked ten times. Under those conditions almost any modern voice stack performs well, which is exactly why demos sell and production disappoints.
The other 95% is the real call: a caller on a mobile in a moving car, a number that has to connect in-country, a carrier in the middle, an answering machine half the time, and a human who eventually needs the lead handed to them cleanly. The demo proves the model can talk. It proves almost nothing about whether the system can hold a real conversation over a real phone line.
Latency: the budget is gone before the model speaks
The single most common production failure is lag. Callers start to feel a pause as unnatural at around 1.5 seconds, and drop-off climbs sharply past 3 seconds (Hamming AI). Yet many production voice stacks land at 6 to 8 seconds of response time, and agentic flows with multiple tool calls run longer still.
The reason is architecture. A typical stack chains speech-to-text, the language model, and text-to-speech as separate services, often in different data centres, with the phone call routed through yet another network on top. Every hop spends part of the latency budget before the model generates a single token. By the time you have stitched the vendors together, the round trips between them, not the AI itself, are what the caller hears.
You cannot prompt your way out of that. The fix is to shorten the path: collapse the hops and keep the work close to the call instead of bouncing the audio across the internet between vendors (Pure IP).
The audio your model hears is not the audio you tested
In the demo your model hears wideband studio-grade audio straight from a laptop mic. On a real call it hears something far worse. The moment a call crosses the public phone network it is compressed into a narrowband telephony codec built decades ago, roughly 300 to 3400 Hz, and most of the frequency range is thrown away before your model ever receives the signal.
That degradation is why speech-to-text that aced your test set suddenly mangles a spelled-out email or a string of digits on real calls. The carrier has already deleted the part of the sound that separates an F from an S. Most voice models are trained on clean, neutral, studio-quality speech, so they cope poorly with compressed, accented, noisy, real-world audio. The model did not get worse. The audio did.
The phone call has parts the demo never shows
Beyond latency and audio, a real outbound call contains a whole category of events a browser demo simply does not have:
Voicemail. A large share of outbound calls reach an answering machine. Without proper answering machine detection, the agent cheerfully qualifies a voicemail greeting and burns minutes on nobody. Transcription-based detection is far more accurate than the energy-and-silence heuristics bundled with most telephony providers, and it should hang up in a few seconds, not after a full monologue.
Interruptions. Real people talk over the agent. If the system cannot detect barge-in and stop talking instantly, it feels like a robot reading a script, and the caller hangs up.
The warm transfer. This is the classic breaking point. The moment the AI hands a qualified caller to a human is where automated systems historically fail: the context vanishes and the caller has to repeat everything. A transfer that works carries a summary of the conversation to the human, so the agent picks up a lead they already understand instead of a cold, confused caller.
Connectivity itself. Numbers that fail to connect in-country, dropped calls, and calls flagged by the carrier before the agent ever speaks. None of this exists in a demo. All of it decides whether a campaign works.
Why this is a telephony problem, not an AI problem
Look at that list again: latency from too many network hops, audio wrecked by carrier codecs, voicemail detection, barge-in, warm transfer, in-country connectivity. Not one of them is solved by a smarter language model. They are solved by understanding the phone network, the codecs, the SIP signalling, and the dialer.
That is the gap between voice AI and telephony, and it is wide. The teams sharp enough to build a great agent are often the least interested in the unglamorous plumbing underneath the call. So the model keeps getting better while the call keeps falling over, and the demo keeps looking nothing like production.
| The demo | The real call | |
|---|---|---|
| Audio | Clean, wideband, laptop mic | Narrowband, carrier-compressed, noisy |
| Path | Short, direct, one network | Caller, carrier, dialer, multiple hops |
| Latency | Sub-second | Often 6 to 8 seconds without work |
| Voicemail | Never happens | Half the calls, needs detection |
| Interruptions | Scripted, one path | Constant, needs barge-in |
| Handoff | Not shown | Warm transfer with context, or it breaks |
What actually fixes it
The pattern across all of it is the same: own more of the call path and handle the telephony, instead of treating the phone call as a solved API call.
In practice that means connecting directly over SIP into the dialer the team already runs, with no extra telephony relay sitting in the path adding latency and cost. It means real answering machine detection so the agent only spends time on live humans, barge-in so the conversation feels natural, and a warm transfer that hands the human a qualified lead with a summary attached. And it means running close to the call so the audio is not making a world tour between vendors before anyone speaks.
This is the approach Klariqo is built on: a self-hosted SIP bridge that registers straight into VICIdial and other dialers as a remote agent (the mechanics are in our VICIdial AI integration guide), no Twilio relay in the path, sub-500ms response, and warm transfer with full context into the team's existing setup. If you are weighing platforms on this, our rundown of the best voice AI for VICIdial covers who handles the telephony well and who only demos well. The AI is the easy part. The phone call is the part we actually engineered around.
FAQ
Why does my voice AI work in testing but fail on real phone calls? Because testing happens on a clean, short, direct path and real calls run over carriers, codecs, and a dialer. Latency stacks up across the network hops, the audio is compressed to narrowband, and the call includes voicemail, interruptions, and handoffs your test never had. The model is fine; the telephony around it is what breaks.
What is an acceptable response latency for a voice agent? Callers perceive pauses as unnatural at around 1.5 seconds and start dropping off past 3 seconds. Many production stacks sit at 6 to 8 seconds because of how many separate services and networks the call passes through. Getting under that threshold is an architecture problem, not a prompt problem.
Why does speech-to-text get worse on phone calls? The public phone network compresses audio into a narrowband codec and discards most of the frequency range before your model receives it. That removes the detail that distinguishes similar sounds, like an F from an S, which is why spelled-out emails and digit strings get mangled on real calls but not in testing.
What is answering machine detection and why does it matter? Answering machine detection (AMD) decides whether a human or a voicemail picked up. Without it, the agent qualifies a voicemail greeting and wastes the call. Transcription-based detection is more accurate than the silence-based heuristics most providers ship, and it should end the call within a few seconds.
How should a warm transfer work? The AI should pass the human a summary of the conversation at the moment of transfer, who the caller is, what they want, and where they are in the script, so the human continues the call instead of starting over. A transfer that drops the context is where most automated systems lose the lead.
Does this work with our existing dialer? It should. The right setup registers into the dialer you already run, such as VICIdial, over SIP as a remote agent, so the AI handles the front of the call and warm-transfers qualified leads to your team without replacing anything. The headline rate matters less than the all-in cost of running it, which we break down in what voice AI really costs per minute.
See it on a real call
The difference between a demo and production is the whole game. If you run a dialer, the fastest way to judge a voice agent is to put it on your actual calls and listen to the transcripts.