TL;DR
Voice AI platforms advertise rates from $0.05-0.11/min. The real cost after adding speech-to-text, language model, text-to-speech, and telephony is $0.13-0.25/min. This guide breaks down the actual component costs for every major platform, shows what you're really paying, and compares it to human agent costs. If you're running a call center or BPO, the all-in number is what matters -- not the headline rate.
The Pricing Problem
Every voice AI platform has a pricing page. And every pricing page is designed to make the headline number look as low as possible.
VAPI says $0.05/min. Retell says $0.07/min. Bland says $0.09/min.
None of those numbers are what you actually pay.
Voice AI requires four components to work: speech-to-text (STT) to understand the caller, a language model (LLM) to decide what to say, text-to-speech (TTS) to say it, and telephony to carry the audio. Most platforms charge separately for each component, or bundle some while hiding others.
This guide breaks down what each component costs, what each platform charges when you add it all up, and what the real per-minute cost is for running voice AI at BPO scale.
The Four Components of Voice AI Cost
1. Speech-to-Text (STT): $0.003-0.008/min
STT converts the caller's voice into text. Costs have dropped sharply as competition has increased.
| Provider | Cost/Min | Notes |
|---|---|---|
| Deepgram Nova-3 | $0.0077 | Industry standard for voice AI. Good accuracy, low latency. |
| Cartesia Ink | $0.0022 | Newest entrant, aggressive pricing. |
| OpenAI Whisper (API) | $0.006 | Good accuracy but higher latency than Deepgram. |
| Google Cloud STT | $0.006-0.009 | Enterprise tier. |
| AWS Transcribe | $0.006 | Amazon's offering. |
| AssemblyAI | $0.0025-0.005 | Good for post-call transcription. |
Bottom line: STT is the cheapest component. At $0.003-0.008/min, it's a rounding error in total cost.
2. Language Model (LLM): $0.001-0.006/min
The LLM decides what the AI should say based on the transcript, script, and context. Raw API costs are very low -- platforms mark them up considerably.
| Model | Raw API Cost/Min (est.) | Notes |
|---|---|---|
| GPT-4o-mini | ~$0.0005 | Cheapest option with good quality. |
| GPT-4.1 | ~$0.002 | Better reasoning, higher cost. |
| GPT-4o | ~$0.003-0.005 | Good balance. |
| Claude Sonnet 4.6 | ~$0.004 | Strong reasoning. |
| Groq (Llama 3) | ~$0.0001-0.001 | Very fast inference, lowest cost. |
Bottom line: Raw LLM costs for voice AI are under $0.01/min for any model. When a platform charges $0.02-0.05/min for "LLM costs," most of that is markup.
3. Text-to-Speech (TTS): $0.01-0.25/min
TTS converts the AI's text response into spoken audio. This is the most expensive and variable component.
| Provider | Cost/Min | Notes |
|---|---|---|
| Deepgram Aura-2 | $0.036 | Low cost, decent quality. |
| OpenAI TTS | $0.015-0.036 | Multiple tiers. |
| Cartesia Sonic | $0.044 | Low latency, good for real-time. |
| PlayHT | $0.06 | Mid-range. |
| ElevenLabs | $0.12-0.25 | Highest quality, highest cost. Voice cloning available. |
Bottom line: TTS cost varies 7x between providers. ElevenLabs sounds the best but costs 3-7x more than alternatives. For BPO pre-qualification where the AI needs to sound natural but doesn't need a celebrity-quality voice, mid-range providers like Cartesia or Deepgram are the right pick.
4. Telephony: $0.01-0.02/min (or $0.00 with SIP bridge)
Getting audio from a phone call to the AI and back. This is where the hidden costs live.
Twilio (what most platforms use):
| Component | Cost/Min |
|---|---|
| Outbound voice (US local) | $0.014 |
| Media Streams (WebSocket relay) | $0.004 |
| Call recording | $0.0025 |
| Total | $0.0205 |
Telnyx (Twilio alternative):
- ~$0.005-0.010/min for voice
- Claims "up to 70% cheaper than Twilio"
Direct SIP Bridge (no carrier middleman):
- ~$0.002/min (VM infrastructure cost only)
- No per-minute telephony charges
Bottom line: Twilio adds $0.02/min to every call. That's 35-50% of total COGS for the cheapest AI stacks. Eliminating Twilio is the single biggest cost reduction available.
Real Platform Pricing (What You Actually Pay)
VAPI
Advertised: $0.05/min (platform fee only)
What they don't include: STT, LLM, TTS, and telephony are all extra. You pick providers and pay their rates on top of VAPI's platform fee.
| Component | Cost/Min |
|---|---|
| VAPI platform fee | $0.05 |
| STT (Deepgram) | ~$0.008 |
| LLM (GPT-4o-mini) | ~$0.02 |
| TTS (ElevenLabs) | ~$0.12 |
| Telephony (Twilio) | ~$0.02 |
| Real total | $0.22/min |
With cheaper TTS (Cartesia instead of ElevenLabs): ~$0.14/min. Still nearly 3x the advertised rate.
Retell AI
Advertised: $0.07/min (infrastructure only)
| Component | Cost/Min |
|---|---|
| Retell infra fee | $0.055 |
| TTS | ~$0.015 |
| LLM (GPT-4.1) | ~$0.045 |
| Telephony | ~$0.015 |
| Real total | $0.13/min |
Retell's pricing is more honest than VAPI's, but the LLM costs are marked up compared to raw API rates. Their published "real cost" estimate of $0.13/min is roughly accurate for their default stack.
Bland AI
Advertised: $0.11/min (recently raised from $0.09)
| Component | Cost/Min |
|---|---|
| Bland base rate | $0.11-0.14 |
| TTS character fees | ~$0.02/set |
| Per-call minimum | $0.015/call |
| Monthly plan fee | $299-499/mo |
| Real total | $0.13-0.18/min + monthly fee |
Bland is closer to all-inclusive than VAPI, but the monthly plan fee and per-call minimums add up. At low volumes, the effective per-minute rate can be much higher than the headline.
Synthflow
Advertised: $0.09/min (voice engine only)
| Component | Cost/Min |
|---|---|
| Synthflow voice engine | $0.09 |
| LLM | ~$0.02-0.05 |
| Telephony | ~$0.02 |
| Optional add-ons | ~$0.04 each |
| Real total | $0.15-0.24/min |
Klariqo
Advertised: $0.10-0.15/min
| Component | Cost/Min |
|---|---|
| Everything (STT + LLM + TTS + telephony + recording + compliance) | $0.10-0.15 |
| Real total | $0.10-0.15/min |
The advertised rate IS the real rate. No add-ons, no per-call fees, no monthly platform fee. Volume pricing: $0.15 at 1-4K minutes, $0.12 at 4-10K, $0.10 at 10K+.
Side-by-Side: Advertised vs Real
| Platform | Advertised | Real Cost | Markup |
|---|---|---|---|
| VAPI | $0.05 | $0.14-0.22 | 2.8-4.4x |
| Retell AI | $0.07 | $0.13-0.19 | 1.9-2.7x |
| Bland AI | $0.11 | $0.13-0.18 + plan | 1.2-1.6x + monthly |
| Synthflow | $0.09 | $0.15-0.24 | 1.7-2.7x |
| Klariqo | $0.10-0.15 | $0.10-0.15 | 1.0x |
What This Means at Scale
For a BPO processing 50,000 outbound minutes per month:
| Platform | Monthly Cost | Annual Cost |
|---|---|---|
| VAPI (mid-range stack) | $8,500 | $102,000 |
| Retell AI | $7,500 | $90,000 |
| Bland AI (Enterprise plan) | $7,000 + $499/mo | $89,988 |
| Synthflow | $9,000 | $108,000 |
| Klariqo (at $0.10/min) | $5,000 | $60,000 |
| Human agents (offshore) | $15,000 | $180,000 |
| Human agents (US) | $40,000 | $480,000 |
The spread between the cheapest voice AI ($5,000/mo) and offshore human agents ($15,000/mo) is $10,000/month. That's $120,000/year in savings for a single operation.
Why Platform Pricing Is So Confusing
It's not an accident. Here's how each cost obfuscation works:
Component-based pricing (VAPI, Synthflow): Show the platform fee as the headline number. Let the user discover STT, LLM, TTS, and telephony costs after they've already committed to the platform. By the time they see the full bill, switching costs are high.
Infrastructure-plus-AI pricing (Retell): Bundle some components, break out others. The "infrastructure" fee sounds all-inclusive but doesn't include the LLM, which is the second most expensive component.
Base-plus-add-on pricing (Bland): Show a simple per-minute rate but add per-call minimums, character fees for TTS, and monthly plan fees. The per-minute rate is technically accurate but incomplete.
All-inclusive pricing (Klariqo): One number. STT, LLM, TTS, telephony, recording, compliance disclosure -- everything included. The number you see on the pricing page is the number on the invoice.
Comparing to Human Agents
The real comparison for most BPOs isn't "which AI platform" but "AI vs our current floor."
| US Agent | Offshore Agent | AI | |
|---|---|---|---|
| Base cost/min | $0.48-0.70 | $0.13-0.27 | $0.10-0.15 |
| Fully loaded/min | $0.88-1.50 | $0.25-0.45 | $0.10-0.15 |
| Productive time | 55-66% | 55-66% | 100% |
| Effective cost/productive min | $1.33-2.73 | $0.38-0.82 | $0.10-0.15 |
| Training time | 2-4 weeks | 2-4 weeks | Upload script |
| Annual turnover | 30-40% | 30-40% | 0% |
| Turnover replacement cost | $10,000-13,000/agent | $3,000-5,000/agent | $0 |
| Works 24/7 | No (shifts) | No (shifts) | Yes |
The "effective cost per productive minute" row is the one that matters. US agents cost $1.33-2.73 per productive minute (because you're paying for breaks, training, idle time, and turnover). Offshore agents cost $0.38-0.82. AI costs $0.10-0.15 with 100% productive time.
How to Calculate Your Real Cost
If you're evaluating voice AI platforms, here's the formula:
Per minute cost = Platform fee + STT + LLM + TTS + Telephony + Recording
Ask every vendor:
- What's your platform fee per minute?
- What STT provider and cost?
- What LLM and cost? (Ask for the ACTUAL model, not just "AI processing")
- What TTS provider and cost?
- What telephony provider? Is Twilio included or extra?
- Is call recording included or extra?
- Are there per-call minimums?
- Are there monthly platform fees on top of per-minute rates?
If they can't give you a straight per-minute all-in number, they're hiding something.
FAQ
Why is ElevenLabs TTS so much more expensive? ElevenLabs offers the highest quality voice synthesis with voice cloning capabilities. For consumer-facing applications where voice quality is the product (audiobooks, content creation), it's worth it. For BPO pre-qualification where the AI needs to sound natural but not cinematic, Cartesia or Deepgram at $0.03-0.04/min deliver good enough quality at a fraction of the cost.
Does cheaper TTS sound worse? No. Cartesia and Deepgram's latest models sound natural in phone conversations. The difference is noticeable in side-by-side comparison on speakers, but over a phone line with standard audio quality, callers can't tell the difference.
Why doesn't VAPI just include everything in one price? VAPI is a developer platform. Their users want to choose their own providers and optimize each component. This makes sense for developers building custom applications. It doesn't make sense for BPO operators who just want to know "what does it cost per minute?"
How much does AI cost compared to an IVR? Traditional IVR (press 1, press 2) costs about $0.03-0.05/min for the technology, but IVR can't have a conversation. It can't qualify leads, handle objections, or warm-transfer. Comparing AI to IVR is comparing a qualification agent to a phone tree -- different tools for different jobs.
What's the cheapest possible voice AI stack? Groq (Llama 3) for LLM (~$0.001/min) + Deepgram for STT ($0.008/min) + Deepgram Aura for TTS ($0.036/min) + direct SIP bridge ($0.002/min) = roughly $0.05/min in raw component costs. But you still need orchestration, error handling, compliance, recording, and infrastructure. The raw component cost is the floor, not the price.
Last updated: March 22, 2026 By Ansh Deb, Founder & CEO of Klariqo. We run voice AI for BPOs at $0.10-0.15/min, all-in. No hidden components. Questions? [email protected]