#2183: Making Voice Agents Feel Natural

Turn-taking, interruptions, and latency are destroying voice AI UX—and the fixes are deeply technical. Here's what's actually happening underneath.

0:000:00
Episode Details
Episode ID
MWP-2341
Published
Duration
28:37
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
claude-sonnet-4-6

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Invisible Engineering Behind Natural Voice Conversations

Voice AI feels almost solved on the surface. Transcription is accurate. Synthesis sounds natural. But users consistently report that talking to voice agents feels "slightly wrong"—and the reason has nothing to do with voice quality.

The real problem lives in conversational dynamics: the split-second decisions that determine whether an agent keeps talking when you say "uh-huh," whether it cuts you off mid-thought, how long it waits before responding, and whether it can sense frustration in your tone. These invisible failure modes are where the actual engineering is happening in 2026.

The Backchannel Problem

Voice Activity Detection (VAD) is the naive solution: listen for audio energy from the user and stop when you detect it. The problem is VAD is completely dumb. It cannot distinguish a genuine interruption—the user actively trying to take over—from a backchannel acknowledgment like "mm-hmm" or "right." Both look identical to the system.

The result: the agent stops mid-sentence because you said "okay" to show you were listening. Then it waits for your next input while you wait for it to continue. Awkward silence.

The three major platforms handle this differently:

Vapi abstracts the problem with a stopSpeakingPlan that offers two modes. The first is VAD-based but fast. The second waits for a configurable number of transcribed words—set it to two, and the user has to say two recognizable words before the agent yields. This trades response speed for accuracy, reducing false positives dramatically. Vapi also maintains an acknowledgementPhrases list ("okay," "right," "uh-huh," "got it," "mm-hmm") that tells the system to ignore those interruptions entirely and keep talking.

LiveKit takes the opposite approach: they expose the full framework and let developers configure turn-taking to their specific needs. You can set allow_interruptions directly, call interrupt() explicitly in code, and build custom logic around handoffs. Their smart endpointing uses a sigmoid-curve wait function—a mathematical formula that returns wait time in milliseconds based on speech-completion probability. At 50% confidence the user has finished, you might wait 200ms. At 90% confidence, you're down to 50ms. It's not a fixed threshold; it's a continuous function you can tune.

Pipecat, the open-source option (BSD-2-clause license), uses SmartTurn version 3.2—a Whisper Tiny backbone with a linear classifier layer, about 8 million parameters. The CPU version is quantized to 8 megabytes. What makes it different from pure VAD is that it looks at prosodic cues—pitch, intonation, speaking rate—rather than just audio energy. It waits for 200ms of silence from Silero VAD, evaluates whether a turn shift should occur, and if confidence is too low, it defers. If silence persists for 3 seconds, it forces the transition anyway. That fallback is critical: you don't want the agent waiting forever.

Krisp recently entered the space with their VIVA SDK, a 6-million-parameter audio-only turn-taking model optimized for CPU inference. Benchmarked against SmartTurn, Krisp achieves 0.82 balanced accuracy versus SmartTurn's 0.78, and more importantly, 0.9 seconds mean shift time versus 1.3 seconds for SmartTurn at the same false positive rate. Thirty percent faster at equivalent accuracy while being 5-10x smaller.

The Thinking Pause Problem

But backchannel confusion is just the tip of the iceberg. The deeper issue is what Speechmatics calls the "thinking pause" problem.

Consider someone saying "I understand your point, but..." and then pausing for a full second while they formulate the next thought. VAD-only systems call that an end of turn. A human listener intuitively keeps waiting. In production, the consequences are worse than awkwardness. In finance, customers spelling out account numbers pause between digits—the agent cuts them off mid-sequence. In healthcare, patients recalling a patient ID from memory pause—same problem. Every premature interruption also drives up LLM API costs because you're reprocessing a misinterpreted partial utterance.

The field has converged on three approaches to turn detection:

Audio-based approaches analyze prosodic features—pitch, energy, intonation. Fast and lightweight, works in real-time, but misses semantic context.

Text-based approaches analyze transcribed content for sentence boundaries, discourse markers, question markers. More accurate but adds latency.

Multimodal fusion is where Deepgram's Flux model lives. Launched late 2025, Flux's key innovation is architectural: the same model producing transcripts is also modeling conversational flow and turn detection. You're not running ASR, getting text, then running a separate turn-detection model. Turn detection happens in the same forward pass, eliminating significant sequential delay.

Deepgram's benchmark numbers: Flux cuts agent response latency by 200-600 milliseconds compared to pipeline approaches, reduces false interruptions by about 30%, achieves p90 latency of 1 second. They define two core conversation-native events—StartOfTurn and EndOfTurn—with a configurable confidence threshold called eot_threshold, defaulting to 0.7. You can drop to 0.5-0.6 for aggressive response (higher risk of cutting users off) or raise to 0.9-1.0 to wait longer. They also offer eager end-of-turn detection, which fires 150-250 milliseconds earlier than the standard event, allowing speculative LLM calls. The cost is 50-70% more LLM calls, but for latency-critical applications, that tradeoff often makes sense.

The Latency Budget

This is where engineering becomes genuinely unforgiving.

The magic number is 300 milliseconds. Human conversation has a natural inter-turn pause of 200-300ms. Research shows pauses above 400ms are perceptible, and beyond 1.5 seconds you've fundamentally shifted the user's mental model from "conversation" to "query-response." Once that shift happens, no voice quality improvement rescues the experience.

The latency budget breakdown from current production engineering:

  • STT finalization: 50-100ms
  • LLM time-to-first-token: 100-200ms
  • TTS time-to-first-byte: 50-80ms
  • WebRTC transport: 20-50ms

Total: 220-380 milliseconds. That's the window.

The LLM piece shows the most variance by model choice:

  • Groq-hosted Llama variants: 50-100ms time-to-first-token
  • GPT-4o-mini: 120-200ms
  • Gemini Flash 1.5: ~300ms (already at the edge)
  • GPT-4o: ~700ms
  • Frontier reasoning models (extended thinking): seconds

The capability-latency tradeoff is brutal. The models fast enough for voice tend to be smaller and less capable at complex reasoning.

Streaming architecture is non-negotiable. A naive sequential pipeline—wait for full STT, then run LLM, then run TTS—produces 600-2000ms of latency. The production solution streams across all three stages simultaneously. Streaming STT emits partial transcripts in 20ms audio chunks, so the LLM starts processing before the user even finishes speaking. Streaming LLM sends tokens to TTS as they arrive. Streaming TTS begins synthesizing audio from the first sentence fragment while the LLM is still generating later paragraphs. Combined savings: 300-600ms over batch processing.

Transport also matters. PSTN—traditional phone calls—adds 150-700ms of network transit, a penalty no model optimization recovers. Geographic co-location matters too. A user in Australia calling a system hosted in Virginia pays a latency tax that can't be engineered away.

The Remaining Frontier

What about emotional and prosodic awareness? Can voice agents actually read the room in 2026?

The short answer: not reliably. Emotion recognition from voice is notoriously difficult and prone to false positives. Most production systems don't attempt it. Some platforms offer basic sentiment detection—flagging high-stress indicators in speech rate or pitch—but using that signal to adjust agent behavior remains mostly experimental.

The deeper issue is that emotion detection and turn-taking require different model architectures. You can't easily bolt emotion recognition onto a turn-detection pipeline without adding latency or complexity. And given that the latency budget is already razor-thin, most developers prioritize getting turn-taking right before attempting emotion.

Why This Matters

These aren't theoretical problems. They're the difference between a voice agent that feels conversational and one that feels like you're talking to a system. The engineering is invisible to users, but the experience isn't.

The good news: the solutions are getting better and more standardized. Turn detection is moving from heuristics to learned models. Latency budgets are becoming better understood. Streaming architectures are becoming default rather than optimization.

The bad news: most developers building on top of these platforms don't fully understand the failure modes. They're using Vapi or LiveKit or Pipecat with defaults set and they don't know what's actually happening underneath. That gap between surface quality and underlying dynamics is where most voice AI still feels slightly off.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2183: Making Voice Agents Feel Natural

Corn
Here's what Daniel sent us this week. He wants to dig into the deeply technical side of voice agent UX — not TTS quality, not transcription accuracy, but the conversational dynamics underneath. Specifically: how agents handle interruptions when a user talks over them, how they detect when a user has actually finished speaking versus just pausing to think, what the latency budgets look like across the full pipeline and what happens when you blow them, how agents maintain conversation flow while they're off fetching data from some external system, and finally, the state of emotional and prosodic awareness in 2026 — whether voice agents can actually read the room. Five topics, all connected by the same core question: why does talking to a voice agent still feel slightly wrong, and what's actually being done about it?
Herman
This is one of those topics where the hard part is invisible to almost everyone using these systems. The perception is that voice AI is basically solved — transcription is good, voices sound great — but the failure modes Daniel is pointing at are where the real engineering is happening right now.
Corn
And I'd argue most developers building on top of these platforms don't fully understand the failure modes either. They're using Vapi or LiveKit or Pipecat and they've got the defaults set and they don't know what's actually going on underneath.
Herman
By the way, today's episode is powered by Claude Sonnet 4.6 — our script-writing AI of choice this week.
Corn
Alright, let's start with interruption handling because I think it's the most viscerally familiar problem. Everyone has had that experience of trying to cut off a voice agent and it just... keeps talking at you.
Herman
So the naive architecture is Voice Activity Detection — VAD — which listens for the presence of audio energy from the user and triggers a stop when it detects it. The detection itself is fast, fifty to a hundred milliseconds. The problem is VAD is completely dumb about what it's detecting. It cannot distinguish a genuine barge-in — the user actively trying to take over — from a backchannel acknowledgment like "uh-huh" or "right" or "okay." Those are fundamentally different conversational acts, and VAD treats them identically.
Corn
So the agent stops mid-sentence because you said "mm-hmm" to show you were listening.
Herman
Every time. And then it's waiting for your next input and you're waiting for it to continue and there's this awkward silence. That's the failure mode. Now, the three platforms Daniel mentioned handle this differently in ways that reveal a lot about their design philosophy. Vapi has the most publicly documented interruption system. They have a stopSpeakingPlan with two modes. The first is VAD-based — fast but dumb, as we said. The second is transcription-based, where the system waits for a configurable number of transcribed words before it decides to stop. So you set numWords to two, and the user has to actually say two recognizable words before the agent yields. That buys you two hundred to five hundred milliseconds of delay but it cuts false positives dramatically.
Corn
That's an interesting design choice. You're trading response speed for accuracy in the interruption detection itself.
Herman
And Vapi also has what they call acknowledgementPhrases — a list of words that, when detected, tell the system to ignore the interruption entirely. "Okay," "right," "uh-huh," "got it," "mm-hmm" — if you say any of those, the system treats it as a backchannel and keeps talking. That's a meaningful step toward solving the backchannel problem, even if it's a list-based heuristic rather than a learned model.
Corn
LiveKit takes a completely different approach, right? Less opinionated.
Herman
LiveKit is the opposite end of the spectrum. They describe themselves as the LEGOs of voice AI — they expose the full framework and let developers configure turn-taking to their specific needs. You can set allow_interruptions directly, you can call interrupt() explicitly in code, and you can build custom logic around when handoffs happen. They also use WebRTC Selective Forwarding Units rather than WebSockets, which gives you better packet loss handling and scalability. Their smart endpointing uses a sigmoid-curve wait function — a mathematical formula that returns wait time in milliseconds based on speech-completion probability. You can tune it aggressively for fast response or conservatively for careful response.
Corn
The sigmoid curve is interesting. It's not a fixed threshold, it's a continuous function. So at fifty percent confidence that the user has finished, you might wait two hundred milliseconds, but at ninety percent confidence you're already down to fifty milliseconds.
Herman
That's the aggressive configuration, yeah. The conservative version adds seconds of wait time at lower confidence levels. The point is that LiveKit gives you the knobs. Vapi abstracts them. And then Pipecat is the open-source option — BSD-2-clause license — and their approach is their SmartTurn model, now at version 3.2. It's a Whisper Tiny backbone with a linear classifier layer, about eight million parameters, available in a CPU version at eight megabytes quantized and a GPU version at thirty-two megabytes. It runs in as little as ten milliseconds on some CPUs.
Corn
Eight megabytes for turn detection. That's remarkably small.
Herman
It's small because the backbone is Whisper Tiny, which is already a tiny model. But the key thing SmartTurn does that pure VAD doesn't is it looks at prosodic cues — pitch, intonation, speaking rate — rather than just audio energy. It waits for two hundred milliseconds of silence from Silero VAD, evaluates whether a turn shift should occur, and if confidence is too low, it defers the decision. If silence persists for three seconds, it forces the transition anyway. That fallback is important — you don't want the agent waiting forever because it couldn't make up its mind.
Corn
So VAD is the trigger, SmartTurn is the judge.
Herman
Krisp has entered this space too with their VIVA SDK — a six-million-parameter audio-only turn-taking model optimized for CPU inference, operating on hundred-millisecond audio frames. They benchmarked it against SmartTurn versions one and two. Krisp's model achieves a balanced accuracy of 0.82 versus SmartTurn's 0.78, and more importantly, 0.9 seconds mean shift time versus 1.3 seconds for SmartTurn at the same false positive rate. Thirty percent faster at equivalent accuracy while being five to ten times smaller.
Corn
Okay so let's talk about why turn-taking is genuinely hard, because I think the backchannel problem is the tip of the iceberg. The deeper issue is what Speechmatics calls the "thinking pause" problem.
Herman
This is where it gets subtle. Consider someone saying "I understand your point, but..." and then pausing for a full second while they formulate the next thought. VAD-only systems call that an end of turn. A human listener intuitively keeps waiting. And the consequences in production are worse than just awkward. In finance, customers spelling out account numbers pause between digits — the agent cuts them off mid-sequence. In healthcare, patients recalling a patient ID from memory pause — same problem. Every premature interruption also drives up LLM API costs because you're reprocessing a misinterpreted partial utterance.
Corn
And the solution isn't just "wait longer," because if you wait too long you've got a different problem.
Herman
The field has converged on three approaches to turn detection. Audio-based approaches analyze prosodic features — pitch, energy, intonation. Fast and lightweight, works in real-time, but misses semantic context. Text-based approaches analyze the transcribed content for sentence boundaries, discourse markers, question markers. More accurate but more latency. And then the multimodal fusion approach, which is where Deepgram's Flux model lives.
Corn
Flux is interesting because it's architecturally different from bolting turn detection onto ASR after the fact.
Herman
The key innovation in Flux — launched late 2025 — is that the same model producing transcripts is also modeling conversational flow and turn detection. You're not running ASR, getting text, and then running a separate turn-detection model on that text. The turn detection is happening in the same forward pass. That eliminates a significant sequential delay. Their benchmark numbers: cuts agent response latency by two hundred to six hundred milliseconds compared to pipeline approaches, reduces false interruptions by about thirty percent, achieves p90 latency of one second. And they define two core conversation-native events — StartOfTurn and EndOfTurn — with a configurable confidence threshold called eot_threshold.
Corn
What's the default threshold?
Herman
0.7. You can go down to 0.5-0.6 for aggressive response — higher risk of cutting users off — or up to 0.9-1.0 if you want the agent to wait longer and be very sure the user is done. They also have something called eager end-of-turn detection, which fires an EagerEndOfTurn event 150 to 250 milliseconds earlier than the standard event, allowing speculative LLM calls. The cost is fifty to seventy percent more LLM calls. For latency-critical applications, that tradeoff often makes sense.
Corn
Let's talk latency budgets, because this is where the engineering gets genuinely unforgiving.
Herman
The magic number is three hundred milliseconds. Human conversation has a natural inter-turn pause of two hundred to three hundred milliseconds. Research shows pauses above about four hundred milliseconds are perceptible, and beyond 1.5 seconds you've fundamentally shifted the user's mental model from "conversation" to "query-response." Once that shift happens, no voice quality improvement rescues the experience. The budget breakdown from current production engineering: STT finalization takes fifty to a hundred milliseconds, LLM time-to-first-token takes a hundred to two hundred milliseconds, TTS time-to-first-byte takes fifty to eighty milliseconds, WebRTC transport adds twenty to fifty milliseconds. Total: two hundred twenty to three hundred eighty milliseconds. That's the window.
Corn
And the LLM piece is where you see the most variance by model choice.
Herman
Dramatically so. Groq-hosted Llama variants hit fifty to a hundred milliseconds time-to-first-token. GPT-4o-mini is a hundred twenty to two hundred milliseconds. Gemini Flash 1.5 is around three hundred milliseconds, which is already at the edge. GPT-4o is around seven hundred milliseconds. And frontier reasoning models — the extended thinking variants — are in seconds. So the capability-latency tradeoff is real and brutal. The models fast enough for voice tend to be smaller and less capable at complex reasoning.
Corn
Which is why streaming architecture is non-negotiable.
Herman
A naive sequential pipeline — wait for full STT, then run LLM, then run TTS — produces six hundred to two thousand milliseconds of latency. The production solution is streaming across all three stages simultaneously. Streaming STT emits partial transcripts in twenty-millisecond audio chunks, so the LLM starts processing before the user even finishes speaking. Streaming LLM sends tokens to TTS as they arrive. Streaming TTS begins synthesizing audio from the first sentence fragment while the LLM is still generating later paragraphs. Combined savings: three hundred to six hundred milliseconds over batch processing.
Corn
And then transport matters too. Phone calls are a different beast.
Herman
PSTN — traditional phone calls — adds a hundred fifty to seven hundred milliseconds of network transit. That's a penalty no model optimization recovers. You can have the fastest LLM on the planet and still blow your latency budget because the phone network ate it. Geographic co-location matters too. A user in Australia hitting a Virginia datacenter adds two hundred to three hundred milliseconds of round-trip before a single token is processed.
Corn
The FDB-v3 benchmark from April 2026 is worth getting into here because it shows the full picture across systems.
Herman
The Full-Duplex-Bench-v3 from National Taiwan University and NVIDIA tested six systems on multi-step tool use with real human audio. Not simple response latency — task completion latency including tool calls. GPT-Realtime completed tasks in 6.89 seconds. Gemini Live 2.5 at 7.26 seconds. Grok at 6.65 seconds. The cascaded pipeline — Whisper into GPT-4o into TTS — took 10.12 seconds, dominated by an 8.78-second first-word delay. And Gemini Live 3.1 was the fastest at 4.25 seconds.
Corn
But Gemini Live 3.1 has a problem.
Herman
A significant one. Despite being fastest at task completion, it has the worst turn-take rate — seventy-eight percent. It produces no speech at all in twenty-two percent of scenarios. And eighty-six percent of those silent cases still executed tool calls — the model found the right API to call but never generated speech. The paper calls it a "disconnect between reasoning and speech generation." It's concentrated in harder scenarios: zero percent of easy tasks, 23.5 percent of medium tasks, and 46.7 percent of hard tasks received no response.
Corn
So the fastest system is also the least reliable. Speed and reliability are directly in tension.
Herman
Which is the central tradeoff in architecture choice. And there's a counterintuitive finding that connects to this. The uncanny silence problem isn't about latency at all — it's about prosody. You can get latency under three hundred milliseconds, transcription accuracy is high, the voice sounds good in isolation, and users still report something feels off. Sesame AI's research paper from February 2025, "Crossing the Uncanny Valley of Conversational Voice," nails this. Their CMOS evaluation found that when human evaluators are shown generated versus real speech without any conversational context, they show no clear preference — naturalness is saturated, modern TTS matches human performance on that metric. But when you give evaluators ninety seconds of conversational context and ask which continuation feels more appropriate, they consistently favor the human recordings. The gap isn't in audio quality. It's in contextual prosodic appropriateness.
Corn
The model doesn't know how to speak a sentence given the emotional and conversational history.
Herman
That's the one-to-many problem. There are countless valid ways to speak any given sentence, but only some fit a given conversational moment. Without the emotional and conversational context, the model doesn't have the information to choose the right one. And there's also a counterintuitive point Speechmatics makes about response speed: agents that respond in two hundred milliseconds feel wrong — not impressive. Human conversations have a natural six hundred millisecond inter-turn pause. That slight delay signals that the listener is processing what was said. This is why Vapi's waitSeconds parameter exists — it's a deliberate artificial delay applied after all processing completes, before the assistant speaks. Default is 0.4 seconds. Healthcare applications push it to 0.6 to 0.8 seconds. Gaming applications go down to zero.
Corn
The field spent years optimizing for speed and is now deliberately adding latency back in. That's a great headline.
Herman
It really is. And the self-correction problem from FDB-v3 is equally sobering. They tested what happens when users correct themselves mid-utterance — "Book me a flight to New York — actually, make that Boston." Results across all systems: GPT-Realtime, the best performer, scored 0.588 pass rate on self-correction scenarios. Gemini Live 2.5 at 0.471. The cascaded pipeline at 0.176 — worse than most random baselines. The cascaded pipeline fails because Whisper finalizes the original transcription before the correction arrives, so the downstream LLM never receives the updated intent. Even the best end-to-end models fail on over forty percent of self-corrections. This is arguably the biggest unsolved problem in voice agent UX right now.
Corn
Okay, function calling. This is where I think most production deployments fall apart in ways that users notice but can't diagnose.
Herman
The core tension is that most useful voice agents need to call external systems mid-conversation — databases, CRMs, scheduling APIs. Those calls range from fifty milliseconds to five hundred milliseconds with high variance. The wrong pattern is treating that as a synchronous blocking operation. The right pattern is treating it as an event that needs to be masked. And the FDB-v3 paper gives us the most detailed empirical breakdown of how different systems actually handle this.
Corn
The filler rate numbers are striking.
Herman
Filler rate is the percentage of responses containing a content-free sentence before the substantive response — something like "Sure, let me look that up." GPT-Realtime: 16.9 percent filler rate, 96 percent turn-take rate, 13.5 percent interruption rate. That's the best overall balance — brief fillers used judiciously to cover tool-execution gaps. Gemini Live 2.5: 8.9 percent filler rate. Gemini Live 3.1: 31.7 percent. Grok: 44.3 percent. Cascaded pipeline: 26.9 percent. And then Ultravox at 88 percent.
Corn
Ultravox is a cautionary tale.
Herman
It's a perfect illustration of how a locally sensible heuristic becomes a global failure. Ultravox almost always emits a filler sentence before initiating the API call. First word latency looks decent — 3.88 seconds. But tool call latency is the worst of any system at 6.01 seconds, because the filler speech fires before the tool call even starts. And because it's speaking when users are still talking, it has a 47.9 percent interruption rate — nearly every other response is overlapping with a user utterance. Task completion ends up at 8.40 seconds, second worst overall.
Corn
So the filler speech is actively making the interruption problem worse.
Herman
Because the filler speech is happening during the window when users are still finishing their thought. You say "Let me check on that" and the user says "—oh, and also can you make it a window seat" and now you've got overlapping audio, the agent is confused about whether it's been interrupted, and the whole thing degrades. Grok's approach is the opposite extreme. It has the highest pre-emptive tool call rate — 41.6 percent — meaning it invokes APIs before the user finishes speaking. But it does this silently, while letting the user continue talking. So the tool call is running in the background, the user finishes their thought, and the agent already has the data it needs.
Corn
But pre-emptive tool calls have their own problem with self-correction.
Herman
That's exactly where Gemini Live 3.1's speed advantage becomes a liability. Its tool-call latency is negative 2.27 seconds — it invoked the API 2.27 seconds before the user finished speaking. If the user then corrects themselves, the API was already called with the original, uncorrected destination. The data is locked in. The fundamental tension is that the same early processing that makes agents fast frequently locks in outdated user intent. You can't update what you've already committed to.
Corn
What does good production practice actually look like here?
Herman
A few patterns that work. Pre-fetching predictable data — if you know that at call start you'll almost certainly need to load the customer's account information, fire that request at call initiation before the first user response arrives. Concurrent masking — acknowledge the request verbally while the API call runs in parallel, generating filler response audio to cover the wait. Threshold-based bridging — if a call exceeds a latency threshold, return a bridging statement rather than silence. Vapi also supports regex-based custom endpointing rules — so when the assistant has just asked for a phone number, you can extend the wait timeout to three seconds to give users time to recall it. That's a nice example of domain-specific configuration that generic defaults don't cover.
Corn
Let's close on emotional and prosodic awareness, because this is where the gap between "technically functional" and "actually feels good" lives.
Herman
Sesame's framework for this is useful. They call the goal "voice presence" — the quality that makes spoken interactions feel real, understood, and valued. Their breakdown: emotional intelligence, meaning reading and responding to emotional contexts; conversational dynamics, meaning natural timing and emphasis; contextual awareness, meaning adjusting tone to match the situation; and consistent personality. Their Conversational Speech Model — CSM — is a multimodal transformer processing interleaved text and audio tokens. Three sizes: one billion, three billion, and eight billion parameter backbones. Trained on roughly a million hours of predominantly English audio. Open-sourced under Apache 2.0.
Corn
And their key finding is that even with all of that, contextual prosodic appropriateness still falls short when evaluators have conversational history to compare against.
Herman
Their conclusion is honest about it: CSM can model text and speech content in a conversation, but not the structure of the conversation itself. Human conversations involve turn-taking, pauses, pacing, and dynamics that the model has to learn implicitly from data, and even a million hours of training data isn't enough to fully close that gap. Their view is that the future lies in fully duplex models that can learn these dynamics end-to-end.
Corn
Vapi's emotion detection layer is interesting in this context because it's a closed-box implementation.
Herman
Their proprietary Orchestration Layer includes emotion detection — analyzing emotional tone and passing it to the LLM as context. The LLM can then adjust its response tone based on that metadata. But developers can't inspect or customize it. That's a deliberate architectural choice. The Orchestration Layer — which also handles endpointing, interruption detection, backchanneling, and filler injection — is explicitly described as Vapi's core value proposition. It's the one component in their stack where you cannot bring your own infrastructure. Everything else in the Vapi stack is replaceable. The Orchestration Layer is not.
Corn
Which is interesting from a business perspective. The moat isn't the model or the voice or the transcription — it's the conversational orchestration.
Herman
And Speechmatics has a useful phrase for the failure mode that all of this is trying to prevent: the "uncanny valley of conversation." Where interactions feel just human enough to set expectations, but not sophisticated enough to meet them. The insight from their ML engineer Aaron Ng is that the most sophisticated thing a voice agent can learn isn't generating sub-two-hundred-millisecond responses — it's knowing when to stay silent. That's a reframing that I think the whole field is slowly converging on.
Corn
The backchannel problem is worth flagging as genuinely unsolved.
Herman
Krisp's roadmap explicitly lists backchannel prediction as a future release. Deepgram Flux's roadmap includes backchanneling identification. Vapi handles it with a static word list. Nobody has a fully learned, general solution for distinguishing "I'm listening, keep going" from "I want to speak now" across all the ways humans express those things. That's the frontier.
Corn
And the TTS prosody piece — the one-to-many problem. There are countless valid ways to speak a sentence and the model doesn't have enough context to choose.
Herman
ElevenLabs frames it as a key trend for this year — AI voice agents trained to recognize emotions in speech and adjust delivery accordingly. Detecting urgency in a service request, picking up hesitation in a sales inquiry. But even their approach is about adjusting pitch and pacing parameters rather than fundamentally solving the contextual appropriateness problem. The homograph disambiguation issue alone — knowing whether to pronounce "lead" as "leed" or "led" based on conversational context — is still an active engineering problem.
Corn
What's the practical takeaway for someone building on top of these platforms today?
Herman
A few things. First, don't leave interruption handling on defaults. The difference between VAD-only and transcription-based interruption detection with a sensible acknowledgementPhrases list is the difference between an agent that gets frustrated users and one that feels conversational. Second, the latency budget is more unforgiving than most developers realize — and the constraint is often the LLM choice, not the STT or TTS. If you're using GPT-4o for voice, you're already over budget before the audio even starts. Third, filler speech is a double-edged sword. Used judiciously — GPT-Realtime's sixteen percent rate — it covers tool-execution gaps naturally. Used aggressively — Ultravox's eighty-eight percent — it creates more interruption problems than it solves.
Corn
And the deliberate latency point. If your agent responds in under two hundred milliseconds, add a wait. It will feel better, not worse.
Herman
The waitSeconds parameter is one of the most counterintuitive features in voice agent engineering. You've done all this work to get fast and then you artificially slow it down. But the human conversation rhythm is real, and fighting it makes the interaction feel inhuman even when everything else is working.
Corn
The self-correction problem is the one I'd flag as most important for teams to understand going in. If your use case involves any scenario where users might revise what they're saying mid-sentence — which is most real conversations — you need to know that even the best systems fail on over forty percent of those cases. That's not a configuration problem, that's a fundamental architectural limitation right now.
Herman
And the cascaded pipeline's 0.176 pass rate on self-corrections is the strongest argument for moving toward end-to-end architectures, even with their other failure modes. The sequential bottleneck doesn't just add latency — it destroys correctness on the inputs that matter most.
Corn
Alright, that's a lot of ground covered. This topic is one of those where the more you dig in, the more you realize how much invisible engineering is holding every voice interaction together — or failing to.
Herman
Thanks as always to our producer Hilbert Flumingtop for keeping things running. Big thanks to Modal for providing the GPU credits that power this show. This has been My Weird Prompts. If you're enjoying the show, a quick review on your podcast app really does help us reach new listeners.
Corn
We'll see you on the next one.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.