Stop Scripting Your AI Voice Agent: Why Briefs Beat Scripts in Production
Most teams building AI voice agents make the same mistake we did: they write scripts.
Not loose guidelines. Full word-for-word scripts. "When the caller says X, respond with exactly Y." Every rebuttal mapped. Every greeting hardcoded. Every objection handler written out in full sentences.
It makes sense on paper. If your best human agents use a script that works, why not give the same script to the AI?
Because the AI isn't reading the script. It's memorizing the script and then trying to recite it in a live conversation. And that's where everything breaks.
We ran this approach for months across SSDI, Final Expense, and ACA outbound campaigns. It worked. Transfer rates were decent. But callers could tell something was off. Not because of the voice quality. Because the responses were too perfect. Too rehearsed. Real humans don't deliver rebuttals word-for-word, even when they're reading from a script.
Last week, we changed one thing. We replaced the scripts with briefs.
Transfer rates nearly doubled.
The Numbers
We track every call through our production pipeline. Here's what happened after the prompt change on March 27, 2026:
Final Expense Outbound:
- Before: ~1.6% transfer rate across 2,215 calls (March 25-27)
- After: ~3.2% transfer rate across 1,409 calls (March 30+)
- Change: 2x improvement
SSDI Outbound:
- Before: ~3% transfer rate across 1,013 calls (March 25-27)
- After: ~5.4% transfer rate across 706 calls (April 2-3)
- Change: 1.8x improvement
These aren't A/B tests on 20 calls. These are thousands of real outbound calls to real people, measured by actual transfers confirmed by the buyer.
What We Actually Changed
The difference looks subtle on paper. In practice, it changed everything about how the AI handles conversations.
The Old Way: Hardcoded Scripts
If they say they already have life insurance: say "Final expense is
different from regular life insurance. Regular life insurance can take
60 to 90 days to pay out, but final expense pays your family within
24 to 48 hours specifically for funeral costs."
If they say money is tight: say "I completely understand. Plans start
at very affordable rates. The goal is to make sure your family isn't
left with thousands in funeral costs."
Every rebuttal was a complete sentence the AI had to deliver verbatim. 15+ explicit handlers. ~800 tokens of scripted phrases.
The New Way: Conceptual Briefs
If they already have life insurance: final expense is different — pays
in 24-48 hours for funeral costs vs 60-90 days for regular life insurance.
If money is tight or on Social Security: plans start very affordable.
Goal is protecting family from thousands in funeral costs.
Same information. Same intent. But now the AI knows what to convey and decides how to say it based on how the conversation is actually going.
Why Scripts Break in Production
When your prompt is 800 tokens of hardcoded instructions, four things go wrong:
1. Instruction Conflicts
"Always end your response with a question" conflicts with "Say exactly: I understand, have a great day." The AI has to choose which instruction to violate. In a 15-rebuttal script, these conflicts multiply fast.
2. Prompt Leaks
With dozens of exact phrases memorized, the AI sometimes blends two different rebuttals into one incoherent response. We call these "prompt leaks." The AI combines the "already have insurance" rebuttal with the "money is tight" rebuttal into a Frankenstein sentence that makes no sense.
3. Rigidity Cascade
When a caller says something the script doesn't cover, the AI freezes or defaults to repeating its last scripted line. Brief-style prompts give the AI principles to reason from, so it can handle situations the prompt author never anticipated.
4. Context Window Waste
800 tokens of scripted phrases leaves less room for actual conversation history. The AI starts forgetting what the caller said three turns ago because the prompt ate up the token budget. Shorter prompts mean more room for the conversation itself.
The Counterintuitive Part: Smaller Models Do This Better
Our voice AI runs on Llama 3.1 8B through Groq for inference (~100ms). Not GPT-4. Not Claude. An 8-billion parameter model.
Brief-style prompts actually work better with smaller models because:
- Less instruction-following overhead. The model isn't parsing 800 tokens of conditional logic.
- Fewer conflicting rules to resolve. Brief prompts have clear intent with minimal ambiguity.
- The model can focus on conversation flow instead of memorizing exact phrases.
- Hardcoded scripts are essentially asking the model to be a lookup table. That wastes the one thing LLMs are actually good at: generating natural language.
Would a larger model handle scripts better? Marginally. But you'd be paying 10-50x more in inference costs to solve a problem that prompt design solves for free.
What Can Go Wrong With Briefs
We'd be lying if we said briefs are perfect out of the box. Here's what we ran into and how we fixed it:
Over-enthusiasm. Without hard guardrails, the AI occasionally gets too conversational and forgets to ask a required qualification question. Fix: keep the qualification checklist explicit (ask age, ask if currently working, ask zip code) while making the delivery of each question flexible.
Premature transfers. Early brief-style prompts sometimes triggered transfers before all qualification criteria were met. Fix: code-level qualification markers that validate completion independently of the LLM's judgment. The AI can't trigger a transfer until the system confirms all required fields are collected.
Repeat questions. Occasionally the AI re-asks something the caller already answered. Fix: entity tracking that injects the caller's collected information before each LLM turn, so the model knows what's already been answered.
These are engineering problems, not fundamental flaws with the approach. Every one of them was solved in production within days.
How to Apply This to Your AI Voice Agent
If you're running AI outbound calls with hardcoded scripts, try this:
Step 1: Audit your current prompt. How many tokens is it? If it's over 500 tokens of explicit scripted phrases, you probably have instruction conflicts and prompt leaks you haven't noticed.
Step 2: Extract the intent. For each scripted rebuttal, write down what the AI needs to convey, not what it needs to say. "Final expense pays in 24-48 hours vs 60-90 days for regular life insurance" is the intent. The exact wording is the AI's job.
Step 3: Keep qualification explicit. The checklist of what needs to be collected (age, location, eligibility) should stay hardcoded. The conversational delivery of those questions should be flexible.
Step 4: Measure transfer rate, not call quality scores. The only metric that matters is whether qualified leads are getting transferred. If transfers go up, the AI is doing its job regardless of whether it matches your ideal script.
The Management Analogy
Think of it like managing a human sales team. You wouldn't hand your top closer a word-for-word script. You'd give them talking points, objection themes, and qualification criteria, then let them read the room.
LLMs work the same way.
Scripts make bad agents. Briefs make great ones.
FAQ
Does this only work with large language models?
No. Our production system runs on Llama 3.1 8B, which is a relatively small model. Brief-style prompts actually perform better on smaller models because there's less instruction-following overhead and fewer conflicting rules to resolve.
Won't the AI say something wrong without a script?
The AI can still say something unexpected, but our experience shows it happens less often than with scripts. Hardcoded scripts create more failure modes (prompt leaks, Frankenstein responses, rigidity when callers go off-script) than brief-style prompts.
What about compliance? Don't regulated verticals need exact scripts?
Keep compliance requirements explicit and non-negotiable in the prompt. "You must disclose that this is a recorded call" stays hardcoded. "How you explain the product benefits" becomes a brief. Separate what must be said from how it's said.
How long did it take to see results?
Transfer rate improvements were visible within the first day of calling after the prompt change. We waited 4-5 calling days before drawing conclusions to ensure the data was stable.
What's the cost difference?
Zero. Shorter prompts actually reduce inference costs slightly because there are fewer input tokens per call. The improvement is free.