Voice AI Cost Per Minute: What You Actually Pay in 2026

TL;DR

Voice AI platforms advertise rates from $0.05-0.11/min. The real cost after adding speech-to-text, language model, text-to-speech, and telephony is $0.13-0.25/min. This guide breaks down the actual component costs for every major platform, shows what you're really paying, and compares it to human agent costs. If you're running a call center or BPO, the all-in number is what matters -- not the headline rate.

The Pricing Problem

Every voice AI platform has a pricing page. And every pricing page is designed to make the headline number look as low as possible.

VAPI says $0.05/min. Retell says $0.07/min. Bland says $0.09/min.

None of those numbers are what you actually pay.

Voice AI requires four components to work: speech-to-text (STT) to understand the caller, a language model (LLM) to decide what to say, text-to-speech (TTS) to say it, and telephony to carry the audio. Most platforms charge separately for each component, or bundle some while hiding others.

This guide breaks down what each component costs, what each platform charges when you add it all up, and what the real per-minute cost is for running voice AI at BPO scale.

The Four Components of Voice AI Cost

1. Speech-to-Text (STT): $0.003-0.008/min

STT converts the caller's voice into text. Costs have dropped sharply as competition has increased.

Provider	Cost/Min	Notes
Deepgram Nova-3	$0.0077	Industry standard for voice AI. Good accuracy, low latency.
Cartesia Ink	$0.0022	Newest entrant, aggressive pricing.
OpenAI Whisper (API)	$0.006	Good accuracy but higher latency than Deepgram.
Google Cloud STT	$0.006-0.009	Enterprise tier.
AWS Transcribe	$0.006	Amazon's offering.
AssemblyAI	$0.0025-0.005	Good for post-call transcription.

Bottom line: STT is the cheapest component. At $0.003-0.008/min, it's a rounding error in total cost.

2. Language Model (LLM): $0.001-0.006/min

The LLM decides what the AI should say based on the transcript, script, and context. Raw API costs are very low -- platforms mark them up considerably.

Model	Raw API Cost/Min (est.)	Notes
GPT-4o-mini	~$0.0005	Cheapest option with good quality.
GPT-4.1	~$0.002	Better reasoning, higher cost.
GPT-4o	~$0.003-0.005	Good balance.
Claude Sonnet 4.6	~$0.004	Strong reasoning.
Groq (Llama 3)	~$0.0001-0.001	Very fast inference, lowest cost.

Bottom line: Raw LLM costs for voice AI are under $0.01/min for any model. When a platform charges $0.02-0.05/min for "LLM costs," most of that is markup.

3. Text-to-Speech (TTS): $0.01-0.25/min

TTS converts the AI's text response into spoken audio. This is the most expensive and variable component.

Provider	Cost/Min	Notes
Deepgram Aura-2	$0.036	Low cost, decent quality.
OpenAI TTS	$0.015-0.036	Multiple tiers.
Cartesia Sonic	$0.044	Low latency, good for real-time.
PlayHT	$0.06	Mid-range.
ElevenLabs	$0.12-0.25	Highest quality, highest cost. Voice cloning available.

Bottom line: TTS cost varies 7x between providers. ElevenLabs sounds the best but costs 3-7x more than alternatives. For BPO pre-qualification where the AI needs to sound natural but doesn't need a celebrity-quality voice, mid-range providers like Cartesia or Deepgram are the right pick.

4. Telephony: $0.01-0.02/min (or $0.00 with SIP bridge)

Getting audio from a phone call to the AI and back. This is where the hidden costs live.

Twilio (what most platforms use):

Component	Cost/Min
Outbound voice (US local)	$0.014
Media Streams (WebSocket relay)	$0.004
Call recording	$0.0025
Total	$0.0205

Telnyx (Twilio alternative):

~$0.005-0.010/min for voice
Claims "up to 70% cheaper than Twilio"

Direct SIP Bridge (no carrier middleman):

~$0.002/min (VM infrastructure cost only)
No per-minute telephony charges

Bottom line: Twilio adds $0.02/min to every call. That's 35-50% of total COGS for the cheapest AI stacks. Eliminating Twilio is the single biggest cost reduction available.

Real Platform Pricing (What You Actually Pay)

VAPI

Advertised: $0.05/min (platform fee only)

What they don't include: STT, LLM, TTS, and telephony are all extra. You pick providers and pay their rates on top of VAPI's platform fee.

Component	Cost/Min
VAPI platform fee	$0.05
STT (Deepgram)	~$0.008
LLM (GPT-4o-mini)	~$0.02
TTS (ElevenLabs)	~$0.12
Telephony (Twilio)	~$0.02
Real total	$0.22/min

With cheaper TTS (Cartesia instead of ElevenLabs): ~$0.14/min. Still nearly 3x the advertised rate.

Retell AI

Advertised: $0.07/min (infrastructure only)

Component	Cost/Min
Retell infra fee	$0.055
TTS	~$0.015
LLM (GPT-4.1)	~$0.045
Telephony	~$0.015
Real total	$0.13/min

Retell's pricing is more honest than VAPI's, but the LLM costs are marked up compared to raw API rates. Their published "real cost" estimate of $0.13/min is roughly accurate for their default stack.

Bland AI

Advertised: $0.11/min (recently raised from $0.09)

Component	Cost/Min
Bland base rate	$0.11-0.14
TTS character fees	~$0.02/set
Per-call minimum	$0.015/call
Monthly plan fee	$299-499/mo
Real total	$0.13-0.18/min + monthly fee

Bland is closer to all-inclusive than VAPI, but the monthly plan fee and per-call minimums add up. At low volumes, the effective per-minute rate can be much higher than the headline.

Synthflow

Advertised: $0.09/min (voice engine only)

Component	Cost/Min
Synthflow voice engine	$0.09
LLM	~$0.02-0.05
Telephony	~$0.02
Optional add-ons	~$0.04 each
Real total	$0.15-0.24/min

Klariqo

Advertised: $0.10-0.15/min

Component	Cost/Min
Everything (STT + LLM + TTS + telephony + recording + compliance)	$0.10-0.15
Real total	$0.10-0.15/min

The advertised rate IS the real rate. No add-ons, no per-call fees, no monthly platform fee. Volume pricing: $0.15 at 1-4K minutes, $0.12 at 4-10K, $0.10 at 10K+.

Side-by-Side: Advertised vs Real

Platform	Advertised	Real Cost	Markup
VAPI	$0.05	$0.14-0.22	2.8-4.4x
Retell AI	$0.07	$0.13-0.19	1.9-2.7x
Bland AI	$0.11	$0.13-0.18 + plan	1.2-1.6x + monthly
Synthflow	$0.09	$0.15-0.24	1.7-2.7x
Klariqo	$0.10-0.15	$0.10-0.15	1.0x

What This Means at Scale

For a BPO processing 50,000 outbound minutes per month:

Platform	Monthly Cost	Annual Cost
VAPI (mid-range stack)	$8,500	$102,000
Retell AI	$7,500	$90,000
Bland AI (Enterprise plan)	$7,000 + $499/mo	$89,988
Synthflow	$9,000	$108,000
Klariqo (at $0.10/min)	$5,000	$60,000
Human agents (offshore)	$15,000	$180,000
Human agents (US)	$40,000	$480,000

The spread between the cheapest voice AI ($5,000/mo) and offshore human agents ($15,000/mo) is $10,000/month. That's $120,000/year in savings for a single operation.

Why Platform Pricing Is So Confusing

It's not an accident. Here's how each cost obfuscation works:

Component-based pricing (VAPI, Synthflow): Show the platform fee as the headline number. Let the user discover STT, LLM, TTS, and telephony costs after they've already committed to the platform. By the time they see the full bill, switching costs are high.

Infrastructure-plus-AI pricing (Retell): Bundle some components, break out others. The "infrastructure" fee sounds all-inclusive but doesn't include the LLM, which is the second most expensive component.

Base-plus-add-on pricing (Bland): Show a simple per-minute rate but add per-call minimums, character fees for TTS, and monthly plan fees. The per-minute rate is technically accurate but incomplete.

All-inclusive pricing (Klariqo): One number. STT, LLM, TTS, telephony, recording, compliance disclosure -- everything included. The number you see on the pricing page is the number on the invoice.

Comparing to Human Agents

The real comparison for most BPOs isn't "which AI platform" but "AI vs our current floor."

	US Agent	Offshore Agent	AI
Base cost/min	$0.48-0.70	$0.13-0.27	$0.10-0.15
Fully loaded/min	$0.88-1.50	$0.25-0.45	$0.10-0.15
Productive time	55-66%	55-66%	100%
Effective cost/productive min	$1.33-2.73	$0.38-0.82	$0.10-0.15
Training time	2-4 weeks	2-4 weeks	Upload script
Annual turnover	30-40%	30-40%	0%
Turnover replacement cost	$10,000-13,000/agent	$3,000-5,000/agent	$0
Works 24/7	No (shifts)	No (shifts)	Yes

The "effective cost per productive minute" row is the one that matters. US agents cost $1.33-2.73 per productive minute (because you're paying for breaks, training, idle time, and turnover). Offshore agents cost $0.38-0.82. AI costs $0.10-0.15 with 100% productive time.

How to Calculate Your Real Cost

If you're evaluating voice AI platforms, here's the formula:

Per minute cost = Platform fee + STT + LLM + TTS + Telephony + Recording

Ask every vendor:

What's your platform fee per minute?
What STT provider and cost?
What LLM and cost? (Ask for the ACTUAL model, not just "AI processing")
What TTS provider and cost?
What telephony provider? Is Twilio included or extra?
Is call recording included or extra?
Are there per-call minimums?
Are there monthly platform fees on top of per-minute rates?

If they can't give you a straight per-minute all-in number, they're hiding something.

FAQ

Why is ElevenLabs TTS so much more expensive? ElevenLabs offers the highest quality voice synthesis with voice cloning capabilities. For consumer-facing applications where voice quality is the product (audiobooks, content creation), it's worth it. For BPO pre-qualification where the AI needs to sound natural but not cinematic, Cartesia or Deepgram at $0.03-0.04/min deliver good enough quality at a fraction of the cost.

Does cheaper TTS sound worse? No. Cartesia and Deepgram's latest models sound natural in phone conversations. The difference is noticeable in side-by-side comparison on speakers, but over a phone line with standard audio quality, callers can't tell the difference.

Why doesn't VAPI just include everything in one price? VAPI is a developer platform. Their users want to choose their own providers and optimize each component. This makes sense for developers building custom applications. It doesn't make sense for BPO operators who just want to know "what does it cost per minute?"

How much does AI cost compared to an IVR? Traditional IVR (press 1, press 2) costs about $0.03-0.05/min for the technology, but IVR can't have a conversation. It can't qualify leads, handle objections, or warm-transfer. Comparing AI to IVR is comparing a qualification agent to a phone tree -- different tools for different jobs.

What's the cheapest possible voice AI stack? Groq (Llama 3) for LLM (~$0.001/min) + Deepgram for STT ($0.008/min) + Deepgram Aura for TTS ($0.036/min) + direct SIP bridge ($0.002/min) = roughly $0.05/min in raw component costs. But you still need orchestration, error handling, compliance, recording, and infrastructure. The raw component cost is the floor, not the price.

Last updated: March 22, 2026 By Ansh Deb, Founder & CEO of Klariqo. We run voice AI for BPOs at $0.10-0.15/min, all-in. No hidden components. Questions? [email protected]