#2781: When Voice AI Sounds Too Real

Voice AI platforms now let you simulate background noise, hesitation, and natural conversation — and that's a problem.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2944
Published: May 12
Duration: 32:46
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: voice-cloning ai-ethics financial-fraud

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Voice AI platforms have quietly crossed a threshold. Features that once seemed futuristic — natural disfluencies, variable pause lengths, ambient background sound — are now available behind dashboards with sliders. Vapi, Retell AI, and Bland AI all offer tools that make voice agents sound convincingly human. The problem is that these same tools are indistinguishable from a fraudster's toolkit.

The Stanford Internet Observatory found that variable latency — the thing that makes an agent sound thoughtful — dramatically increases persuasiveness in deception scenarios. Background noise simulation lets callers fabricate a setting. Prosody control allows hesitation to be weaponized. And in most jurisdictions, there's no blanket legal requirement to disclose that a caller is AI. The FCC's TCPA ruling on AI voices only covers unsolicited marketing robocalls, leaving a vast gray zone for one-to-one interactions.

Technical guardrails remain elusive. Audio watermarking is easily defeated by compression codecs. Protocol-level disclosure via SIP headers would require carrier cooperation and regulatory push, years away from reality. Meanwhile, platforms rely on acceptable use policies — policy guardrails, not technical ones. The constructive use cases — accessibility for visually impaired users, non-literate populations, and high-agency user-initiated interactions — show the technology's genuine potential. But the core tension remains: realism as a primary metric creates an architectural honesty problem that no terms of service can solve.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2781: When Voice AI Sounds Too Real

Daniel sent us this one — and I have to say, it's the kind of prompt that makes you realize voice AI has quietly sprinted past the point where most people's mental model of it lives. He's been playing around with Vapi, built a couple of voice agents, including one he used to prank a friend into thinking a wizard was inviting him to a barbecue. But the deeper he dug into the platform, the more he started noticing features that feel less like productivity tools and more like a fraudster's Christmas list. Background noise simulation, prosody control, latency dynamics, the fact that nothing compels you to disclose you're an AI. The question is basically — what are the genuinely good use cases here, and what do the actual safeguards look like right now?

The barbecue wizard is a perfect entry point, because it captures the whole tension. It's funny, it's harmless, it's technically impressive — and it uses exactly the same infrastructure that someone running a grandparent scam would use. There's no separate "prank stack" and "fraud stack." It's the same stack.

The wizard contains multitudes.

And the background noise thing — I want to start there, because it's more significant than it sounds. Most of the major voice AI platforms now offer ambient sound layering. Vapi has it, Retell AI has it, Bland AI has it. You can upload a looping audio file, or in some cases select from a library — coffee shop, office murmur, street traffic, airport lounge. It's marketed as "increase realism" and "improve caller comfort." Which, for legitimate use cases, makes sense. If you're calling to confirm a doctor's appointment and it sounds like the call is coming from an actual medical office rather than a silent void, you're less likely to hang up.

The flip side is that the same feature lets you fabricate a setting that exists purely to sell a lie. "I'm calling from the hotel in Milan" — and there's the espresso machine in the background.

And what makes this particularly interesting from a security standpoint is that ambient sound has historically been one of the few reliable heuristics humans use to verify context. If someone calls you and says they're at the office, but you hear seagulls and waves, you know something's off. Voice AI platforms are now letting you manufacture that verification signal. It's the audio equivalent of photoshopping a geotag.

The heuristic becomes useless the moment it's trivially fakeable. Which is the same story we've seen with every other verification signal in the last three years. Writing style, voice, now environmental audio.

The prosody control is the next layer. The old speech-to-text to text-to-speech pipeline — what people call the STT-LLM-TTS sandwich — that produced what you described as jagged results. And jagged is generous. It was terrible for conversation because you'd get these fixed latency gaps, no natural pausing, no ums or ahs, no sense that the other party was thinking. The new generation of voice-native models — GPT-4o's voice mode, ElevenLabs's conversational agent, Sesame's models — they're trained end-to-end on speech, not on text that gets converted to speech. So they produce natural disfluencies, variable pause lengths, the little micro-adjustments in timing that signal "I'm processing what you just said.

I assume the fraud potential there is that hesitation can be weaponized. If I'm running a scam and I need to sound like I'm uncertain about something — "let me just... check that account number" — the model can do that in a way that sounds completely unscripted.

And there's research on this. The Stanford Internet Observatory put out a paper last year looking at conversational agents used in social engineering. One of their findings was that variable latency — the thing that makes an agent sound thoughtful — also makes it dramatically more persuasive in deception scenarios. People rated AI callers with natural pausing as significantly more trustworthy than those with fixed response gaps, even when the actual content was identical.

Of course they did. Trust is carried on all these tiny signals we don't consciously register. And we've now productized the signals.

We've productized them and put them behind a dashboard with a slider. That's the part that should give people pause. Vapi's dashboard literally lets you adjust endpointing sensitivity, interrupt handling, how long the model waits before jumping in. These are presented as quality-of-life settings for developers, and they are — but they're also deception controls. Turn this knob to make the agent sound more hesitant. Turn this one to make it interrupt more naturally.

Let's talk about the disclosure question, because this is where I think a lot of the ethical rubber meets the road. You mentioned there's nothing that compels disclosure. Is that true across the board?

It depends on jurisdiction, but functionally yes — in most places, there is no blanket legal requirement to disclose that a caller is an AI agent. The FCC in the United States has been looking at this. They issued a declaratory ruling in early twenty twenty-four that AI-generated voices fall under the TCPA's restrictions on artificial or prerecorded voices when used in robocalls. But that's specifically for unsolicited marketing calls. If it's a one-to-one call — say, a customer service callback or a scheduled appointment reminder — the disclosure requirements get murky.

The TCPA ruling came after that fake Biden robocall in New Hampshire during the primary, right?

That was the catalyst, yes. January twenty twenty-four, a political consultant named Steve Kramer used an AI-generated voice of President Biden to discourage voters from going to the polls. The FCC moved unusually fast on that one. But the ruling is narrow. It says AI voices count as artificial voices for robocall purposes. It doesn't say anything about the broader universe of voice agent interactions.

If I'm a company and I want to replace my entire outbound customer service team with AI agents that don't disclose they're AI, there's no federal law stopping me.

And in many states, the laws are either non-existent or untested. California has some of the stronger disclosure requirements — their bolstered BOT disclosure law from twenty eighteen requires automated accounts to disclose their nature if they're being used to influence a commercial transaction or election. But enforcement has been spotty, and voice agents occupy this weird gray zone because the law was written primarily with text-based bots in mind.

What about Europe?

The EU AI Act, which started phased implementation in twenty twenty-five, has provisions around transparency for AI systems that interact with humans. Article fifty-two essentially says that users should be informed they're interacting with an AI unless it's "obvious from the circumstances." The problem is that "obvious" is doing an enormous amount of work there. As voices get better, as prosody gets more natural, as background noise gets synthesized — what's obvious?

This is the wizard problem again. The features that make the technology good are the same features that make it deceptive. There's no clean separation.

That brings us to what I think is actually the central question here — not whether individual features are good or bad, but whether the entire design philosophy of these platforms has an honesty problem baked in at the architectural level.

Say more about that.

When you build a platform where realism is the primary metric, where every slider and setting is optimized to reduce the uncanny valley gap, where the onboarding tutorials show you how to make your agent sound more human — you're not just building a tool. You're building a tool whose entire value proposition is indistinguishable from deception. The thing that makes it good at appointment scheduling is the same thing that makes it good at impersonating a family member. There's no "legitimacy mode" that's technically separate from "fraud mode.

It's like building a printing press that's really good at printing both currency and birthday cards, and then being surprised when someone prints currency.

And the platform companies are not oblivious to this. Vapi, Bland, Retell — they all have acceptable use policies that prohibit fraud and impersonation. Vapi's terms explicitly ban using the service for "deceptive or fraudulent purposes" and require compliance with applicable laws. But these are policy guardrails, not technical guardrails. They're the equivalent of a sign on the door that says "please don't rob the bank.

What would a technical guardrail even look like here? Is there a way to build disclosure into the stack so it can't be stripped out?

There are a few approaches being discussed. One is what some researchers call "audio watermarking" — embedding an inaudible signal in the audio output that identifies it as AI-generated. The problem is that this is trivially defeated by anyone who can re-encode the audio, and in a phone call context, the compression codecs would likely strip it anyway.

It works in the lab and nowhere else.

A more promising direction is mandatory disclosure at the protocol level. The IETF has an internet-draft for something called "AI-origin indication" in SIP headers — basically, a flag in the call setup that says "this call originates from an AI system." If that were widely adopted, downstream systems could display it, block it, or handle it differently. But that requires carrier cooperation, handset manufacturer buy-in, and some kind of regulatory push. We're years away from that being real.

In the meantime, the wizard is calling about the barbecue.

The wizard is calling, and nobody knows it's a wizard.

Let's pivot to the constructive side, because the prompt wasn't just about the problems — it was about where this technology actually does good work with informed consent. What are the use cases that don't make you feel like you need a shower afterward?

The accessibility space is, to me, the most unambiguously positive deployment. And I don't mean this in a "we should feel good about ourselves" way — I mean there are transformative applications. People with visual impairments using voice agents to navigate customer service phone trees that were designed by people who hate humanity. People with motor disabilities who can't easily type or tap through menus. Non-literate users who can speak but can't read. Voice is the natural interface for these populations, and AI agents that can actually hold a conversation rather than just recognize keywords change what's possible.

That's a use case where disclosure isn't just present — it's irrelevant, because the user knows exactly what they're interacting with. They initiated the interaction.

The power dynamic is completely different. Another category that I think holds up under scrutiny is what I'd call "high-agency, user-initiated" interactions. You call your bank because you want to dispute a charge — you know you're talking to an AI, you chose to make the call, and the AI can actually resolve your issue faster than waiting on hold for a human. The key distinction is whether the AI is replacing a friction you wanted to avoid, or inserting itself into an interaction you didn't ask for.

That's a useful lens. The outbound SDR replacement — the thing the prompt mentioned about eliminating sales development reps — that's the AI inserting itself. The inbound customer service call that you initiated — that's the AI replacing friction.

Even within outbound, there are gradations. An appointment reminder from your dentist's office? That's outbound, but it's low-stakes, it's expected, and the information content is minimal. Nobody's being deceived about anything meaningful. The problem is when outbound calls carry persuasive intent — sales, fundraising, political messaging — and the recipient doesn't know they're talking to a machine.

There was a piece in the Wall Street Journal a few months back about companies using AI voice agents for debt collection. That seems like the worst of all worlds.

It's a nightmare scenario. Debt collection is already a space with enormous power asymmetry, vulnerable populations, and legal complexity. Adding an AI that sounds human, can simulate empathy, and is designed to keep someone on the phone as long as possible to extract payment commitments — that's not efficiency, that's a weaponization of conversational design.

The agent never gets tired, never feels bad, never has a crisis of conscience about whether the person on the other end can actually afford to pay.

The persistence is a feature, and in that context, it's a feature that makes the interaction worse for the human being.

What does the safety conversation actually look like right now? Not the academic conversation, not the think-tank white papers — what are the platform companies actually doing?

It's a mix. The major voice AI platforms have all implemented some level of safety infrastructure, but it varies wildly in sophistication. Most have content moderation on the text side — the LLM that generates the agent's responses is typically filtered through the same safety classifiers that the underlying model provider offers. So if you're using OpenAI's real-time API as your voice engine, you get their moderation endpoint by default. If you're using an open-source model through a platform like Vapi, it depends on what you've configured.

Content moderation catches what the agent says, not what the agent is. It'll flag if the agent says something threatening or illegal, but it won't flag that the agent is pretending to be a human in the first place.

And that's the category error that most of the current safety approaches make. They're looking for bad content, not bad context. An agent that says "I'm calling from the fraud department at Wells Fargo" when it isn't — that's a bad context, but the individual words aren't problematic. The sentence structure isn't problematic. What's problematic is the entire premise of the call.

You'd need something that evaluates the agent's system prompt, not just its outputs.

Some platforms have started doing prompt-level review for certain categories of use. Vapi and Bland will review agent configurations if they're flagged, and they have automated systems that look for obviously problematic prompt patterns — "pretend to be," "don't disclose," "impersonate." But it's a cat-and-mouse game. You can write a system prompt that achieves the same deceptive outcome without using any of the obvious trigger words.

"You are a helpful financial services representative." Technically true, in the sense that a wolf is a helpful forest management consultant.

That's the fundamental challenge. The intent lives in the mind of the deployer, not in the prompt tokens. You can't classify your way out of this problem with better prompt scanning.

What about the model providers themselves? OpenAI, Anthropic, Google — they're the ones building the voice-native models that the platforms use. Do they have skin in this game?

They do, and they're approaching it differently. OpenAI's real-time API has a set of usage policies that prohibit impersonation, fraud, and undisclosed AI interaction. They also have what they call "safety by design" in the voice pipeline — the model is trained to resist certain kinds of adversarial prompting. But again, these are policy and training guardrails, not architectural ones. The model can be jailbroken, and once it's deployed through a third-party platform, the platform's own safety measures may or may not be layered on top.

It's a chain of custody problem. The model provider trusts the platform, the platform trusts the developer, the developer trusts their prompt, and somewhere in there, the end user's interests get lost.

That's exactly the right way to frame it. And the end user — the person receiving the call — is the one with the least visibility and the most at stake. They don't know what model is being used, what platform, what prompt, what safeguards. They just hear a voice.

What's the most promising thing on the horizon? If you had to point to one development that might actually move the needle on this, what would it be?

I think the most interesting work is happening around what's being called "attestation" — cryptographic proof that a voice interaction is AI-generated, attached to the call metadata in a way that can't be stripped. There's a group at MIT's Media Lab working on this, and a couple of startups that have come out of Y Combinator's most recent batch are building attestation layers specifically for voice. The idea is that instead of trying to detect AI voices — which is a losing battle — you make it trivially easy to prove that a voice is AI, and you make that proof available to anyone who wants to verify it.

Flip the burden. Instead of "prove this is a human," it's "prove this is an AI, and if you can't prove it, we assume it's human.

And that's a much more tractable technical problem, because the AI system can cryptographically sign its own output. A human can't do that. So you create a world where legitimate AI callers carry a verifiable credential, and anyone who doesn't carry one is presumed human — or at least, the absence of the credential becomes a signal that something might be off.

That's elegant. It also creates a market incentive for legitimate use cases to adopt the credential, because being verifiably AI becomes a trust signal rather than a liability. "We're an AI, we're proud of it, here's the proof.

It aligns with where regulation seems to be heading. The EU AI Act's transparency requirements are basically mandating something like this, even if they don't specify the technical mechanism. The question is whether the US will follow or whether we'll get a patchwork of state-level rules that make compliance a nightmare.

Given the current administration's posture on AI regulation, I'd bet on the patchwork.

Probably a safe bet. The Trump-Xi summit that's happening right now has AI governance on the agenda, but it's mostly about export controls and chip restrictions, not about consumer protection for voice agents. The domestic regulatory conversation is still mostly happening at the agency level — FCC, FTC — and in the states.

For someone like the person who sent this prompt — someone who's technically capable, who's built agents, who understands the stack — what should they actually do? If they want to use this technology responsibly, what does that look like in practice?

I'd say there are a few principles that are actionable right now. First, always disclose. Even if the law doesn't require it, even if the platform doesn't enforce it — just tell people they're talking to an AI. It costs you nothing, and it sets the interaction on honest footing.

"Hi, I'm an AI assistant calling on behalf of...

Second, don't simulate environments you're not in. If your agent is running on a server in Virginia, don't give it a coffee shop background to make it sound like it's calling from a local office. That's a small deception that serves no legitimate purpose.

Unless you're running a coffee shop. Then it's just method acting.

Method acting for automated appointment reminders. Third, think about the power dynamic of the interaction. Is this a call the recipient wants to receive? Did they opt in? Are they in a position to hang up without consequence? If the answers are no, maybe don't automate it.

That last one is going to eliminate about ninety percent of the current commercial use cases, and I'm fine with that.

I think we should be fine with that. The technology is impressive. The barbecue wizard is funny. But the fact that something is technically impressive doesn't mean it should be deployed everywhere it can be deployed. The history of telecommunications is basically a hundred and fifty years of finding new ways to bother people, and we finally have a chance to not do that.

"A hundred and fifty years of finding new ways to bother people" — that's the subtitle of the podcast.

It really is. And look, I don't want to be entirely negative about this. There are companies doing this right. There's a healthcare startup called Syllable that builds voice agents for hospital systems — they handle things like prescription refills and appointment scheduling — and they're obsessive about disclosure and consent. Their agents identify themselves immediately, they explain exactly what they can and can't do, and they offer an instant opt-out to a human. It can be done.

It just requires caring enough to do it.

It requires caring, and it requires a business model that doesn't depend on deception. Syllable gets paid by the hospital systems, not by maximizing call duration or conversion rates. The incentives are aligned with the patient's interests.

Which brings us back to the SDR replacement question. The reason companies are rushing to replace sales development reps with AI isn't that AI is better at building relationships — it's that AI is cheaper, and you can run ten thousand calls a day instead of fifty. The metric isn't quality, it's volume. And volume, in outbound sales, is just another word for spam.

The tragedy is that it's probably going to work, at least in the short term. If you can make ten thousand calls for the cost of one SDR's salary, and even a tiny fraction of those convert, the math works out. The externality is that everyone else's phone becomes unusable.

The tragedy of the commons, voiced by a very persuasive AI with variable latency and a coffee shop background.

That's the thing I keep coming back to. The individual features are not the problem. Background noise simulation is a neat technical trick. Prosody control is impressive engineering. MCP integrations for email follow-ups are useful. The problem is the combination, deployed at scale, without consent, in a regulatory vacuum.

To wrap this back to the core question — the safeguards. Where they are now is mostly policy-layer, easily circumvented, and uneven across jurisdictions. Where they need to go is toward cryptographic attestation, mandatory disclosure, and a regulatory framework that actually has teeth. And in the meantime, the best safeguard is the conscience of the person deploying the agent, which is not a safeguard at all.

That's a fair summary. I'd add one more thing, which is that the conversation is actually moving faster than I expected it to. The FCC's robocall ruling, the EU AI Act's transparency requirements, the attestation work coming out of MIT and Y Combinator — this is all happening in the last eighteen months. Two years ago, nobody was talking about this. Now it's a live policy debate. The question is whether the technology outpaces the safeguards, and right now, the technology is winning. But it's not an unbridgeable gap.

For the person who sent this in — the barbecue wizard deployer — I think the fact that he looked at this platform and immediately thought "wait, this is creepy" is actually the right instinct. The people building these tools should have that instinct more often.

The wizard was a canary in the coal mine.

The wizard knew too much.

And now: Hilbert's daily fun fact.

Hilbert: In the eighteen sixties, British colonial officers stationed in Eritrea documented a local variant of hurling in which players, after scoring a point, were required to immediately recite a short poem of self-praise — a practice that colonial administrators described in their reports as a "notable behavioural anomaly" and attempted to suppress, believing it encouraged insubordination among native troops who had adopted the game.

...right.

The British Empire, defeated by poetry and a stick.

This has been My Weird Prompts. Thanks to our producer, Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps. We'll be back next time.

Same weird time, same weird feed.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2781: When Voice AI Sounds Too Real

Downloads

You Might Also Like

#2781: When Voice AI Sounds Too Real