I was looking at some old clips of GPS navigation from about fifteen years ago the other day, and man, it is painful. That robotic, stilted cadence where every street name sounded like a question? "Turn left on... Main Street?" We’ve come such a long way, but it’s actually made the job harder for developers because now there are too many good choices. Today’s prompt from Daniel is about how to actually pick the right text-to-speech model in twenty twenty-six, because the landscape has just exploded.
It really has. We’ve moved past the era where you just picked the one API that didn’t sound like a blender. Now, you’re balancing model architecture, latency, V-RAM requirements, and prosody. By the way, fun fact for the listeners—today’s episode script was actually generated by Google Gemini three Flash. It’s fitting, considering we’re talking about the models that give these AI brains a voice. I’m Herman Poppleberry, and I’ve been diving into the documentation for everything from the heavy hitters like ElevenLabs to the ultra-efficient open-source stuff like Kokoro and Piper.
Usually when we talk about AI, people focus on the LLM, the "brain." But the TTS is the "face" of the product. If the voice is off, the whole user experience feels cheap. Daniel’s asking about the core technical parameters first—model size, sample rate, and latency. Herman, help me out here. If I’m a developer, why do I care if a model is a hundred million parameters versus a billion if they both sound "fine" in a demo?
Because "fine" in a demo doesn't scale to a production environment where you're paying for compute or fighting for milliseconds. Model size is your first big fork in the road. Larger models, like the ones powering the high-end ElevenLabs tiers or the massive MARS-eight family from CAMB AI, generally have a deeper understanding of context. They aren't just mapping characters to sounds; they’re predicting the emotional weight of a sentence. But the trade-off is compute. If you’re running a billion-parameter TTS model locally, you’re going to need a serious GPU, and your "time to first byte" is going to suffer. Think of it like a massive orchestra versus a solo violinist. The orchestra sounds richer, but it takes a lot longer to get everyone on stage and tuned up.
Right, and that leads directly into latency. If I’m building a voice assistant, I can’t have a three-second pause while the model "thinks" about how to say "Hello." That completely kills the "assistant" vibe. It feels like you're talking to someone on a satellite phone with a bad delay.
Well, not exactly, but you're hitting the nail on the head. For real-time applications, you’re looking for models with a latency of under two hundred milliseconds. This is where smaller, optimized models like Piper or MeloTTS shine. They use architectures like VITS—Variational Inference with adversarial Learning for end-to-end Text-to-Speech. These models are designed to be fast, often running on a single CPU core. But the cost of that speed is usually "richness." You lose that breathy, human quality you get from the massive generative models. You get clarity, but you lose the "soul."
But wait, if I'm using a cloud API, isn't the latency mostly just the network trip? Or is the model inference still the bottleneck in twenty twenty-six?
It’s both. Even with the fastest fiber optics, if the model takes five hundred milliseconds to generate the first chunk of audio, the user is going to feel it. This is why "streaming" is so important. A good TTS model doesn't wait to finish the whole paragraph before it starts talking. It generates the first few words, streams them to the user, and continues generating the rest in the background. If your model architecture doesn't support efficient streaming, it doesn't matter how small it is—it's going to feel slow.
And what about sample rate? I see twenty-two kilohertz, forty-four point one, forty-eight. Does the average person even hear the difference over a phone speaker?
On a phone speaker? Maybe not. But if your user is wearing headphones, the difference between twenty-two kilohertz and forty-eight kilohertz is the difference between a low-bitrate MP3 and sitting in the room with the person. Twenty-two kilohertz is the standard for most functional TTS. It’s clear, it’s intelligible, and it saves on bandwidth. But if you’re doing high-end content creation, like a narrated book or a premium podcast, you want forty-four point one or forty-eight. Higher sample rates mean more "air" in the voice, more high-frequency detail that makes it feel alive rather than reconstructed.
Is there a "sweet spot" for developers who want quality but need to be mindful of data costs?
Usually thirty-two kilohertz is that middle ground. It’s high enough to avoid that "telephone" compression sound but light enough that you aren't sending massive wav files over the wire. But again, it depends on the model's native training. If a model was trained on twenty-two kilohertz data, upsampling it to forty-eight won't magically add detail. It just adds empty space.
It’s interesting because we’re seeing a split in the market. You have the "Cloud Giants" where you just send text and get back a high-quality file, and then you have this massive surge in "Edge TTS."
The Edge movement is fascinating right now. Look at something like Kokoro. It’s a relatively small model—around eighty million parameters—but the quality-to-size ratio is insane. It punches way above its weight class. For a developer, the choice often comes down to: Do I want to pay per character to a provider who handles the scaling, or do I want to host a model like Kokoro on a service like Modal and have total control over my costs and privacy?
Privacy is a huge one. If I'm building a medical app or a private banking assistant, I probably don't want to be sending sensitive user data to a third-party API every time the app needs to speak.
Precisely. That’s where local, on-device models are winning. If you can run a model like Piper directly on a user’s iPhone or Android device, the data never leaves the hardware. You eliminate the API cost, you eliminate the network latency, and you gain massive trust with the user. The trade-off is that you're limited by the device's NPU or GPU. You can't run a massive ElevenLabs-style model on a five-year-old budget phone.
Let’s talk about the "non-English" problem. Daniel mentioned multilingual versus language-specific models. I feel like for a long time, English was the only language that sounded human in TTS, and everything else sounded like a bad translation of a robot.
That was the reality for a long time. But in twenty twenty-six, the "Foundation Model" approach has changed that. Multilingual models are trained on dozens of languages simultaneously. The cool thing there is "cross-lingual transfer." If a model learns how to express excitement in English, it can often transfer that tonal quality to its Spanish or Hebrew output, even if it has less training data in those languages.
Is there a downside to the "one-size-fits-all" multilingual approach?
There is. It’s called "language interference." Sometimes a multilingual model will have a slight, ghost-like accent from its dominant training language—usually English—even when it's speaking Japanese or French. If you need absolute native-level perfection for a specific market, a language-specific model that was tuned exclusively on high-quality data from that region will usually win. But for most global apps, a robust multilingual model like those from CAMB AI is the way to go because it handles code-switching. If a user drops an English technical term in the middle of a German sentence, a multilingual model handles it gracefully. A language-specific model might just choke or pronounce the English word with a thick German phonology that sounds ridiculous.
I've heard that happen! It's like the AI is trying too hard to be local. It'll say "iPhone" but with such a heavy accent that you can barely recognize the brand name.
And that’s a huge immersion breaker. Another thing to consider is dialect. A "Spanish" model might be trained on European Spanish from Madrid, which sounds very different to a user in Mexico City or Buenos Aires. Modern models are getting better at "dialect conditioning," where you can pass a flag to tell the model, "Speak Spanish, but use a Colombian accent." It’s not perfect yet, but it’s miles ahead of where we were even two years ago.
That brings us to "Prosody." It’s one of those words that sounds like it belongs in a poetry class, but it’s actually the "secret sauce" for why some AI voices make you want to listen and others make you want to hit the mute button.
Prosody is effectively the rhythm, stress, and intonation of speech. It’s why a question sounds like a question. It’s the way your voice rises at the end of a sentence or how you pause for emphasis before a big reveal. In the old days, we tried to hard-code prosody using rules. "If there is a comma, pause for two hundred milliseconds." It sounded terrible because humans don’t follow rigid rules.
We’re messy. We speed up when we’re excited and slow down when we’re explaining something complex. We trail off when we're unsure.
Precisely. Modern generative TTS models learn prosody from the data. They look at the surrounding text to understand the intent. If a sentence ends in an exclamation point and starts with "I can't believe," the model knows to shift the pitch higher and increase the energy. This is where "Prosody Control" becomes a big differentiator for developers. Some models give you "style tags" or "emotion sliders." You can literally tell the model to be "ten percent more whispered" or "twenty percent more authoritative."
Wait, can you actually mix those? Like, could I have a "whispered" and "authoritative" voice at the same time? Like a spy giving orders in a library?
In theory, yes! With the latest "Flow-matching" architectures, the model treats these styles as vectors in a latent space. You can blend them. You can have a voice that is sixty percent "joyful" and forty percent "surprised." It allows for a level of nuance that makes the AI feel like it actually understands the subtext of what it's saying.
I’ve seen some models where you can provide a "reference audio" purely for the prosody, not the voice. Like, "Use this voice, but say it with the cadence of this other person."
That’s the cutting edge. It’s called "Prosody Transfer." It separates the "timbre"—the actual sound of the vocal cords—from the "performance." It’s incredibly useful for things like dubbing movies. You can keep the original actor’s performance—their timing, their sighs, their emotional beats—but swap the language and the voice to a localized version. It keeps the "acting" intact while changing the "instrument."
One thing that drives me crazy with TTS is how it handles "raw" internet text. If I feed a model a tweet that has asterisks for emphasis or some weird markdown like underscores, half the models out there will literally say "asterisk word asterisk." It completely ruins the immersion. Daniel asked about input text handling—who is winning the battle against "imperfect text"?
This is a massive headache for developers. Most basic TTS models are "dumb." They take the string you give them and convert it to phonemes. If there’s an emoji or a markdown tag, they try to pronounce it. The "smart" way to handle this—and what we’re seeing in top-tier models now—is an integrated "text-normalization" front-end.
So like a mini-LLM that cleans the text before the voice engine sees it?
Or, even better, a model that was trained on "dirty" text. Some of the newer end-to-end models have seen enough markdown and informal punctuation in their training sets that they’ve learned to interpret an asterisk as "apply emphasis" rather than "say the word asterisk." But for developers using older or smaller models, you usually have to write a regex script or use an LLM to "sanitize" the input first. You have to turn "twenty-five m-p-h" into "twenty-five miles per hour" because if the TTS says "m-p-h," it sounds like a robot.
Does that normalization handle things like dates and currency too? Because "01/02/2026" could be January 2nd or February 1st depending on where you are.
That is the ultimate test of a normalization engine. A "dumb" model will just say "zero one slash zero two." A "smart" model uses the locale context. If the system knows the user is in London, it says "the first of February." If the user is in New York, it says "January second." This is why choosing a TTS provider isn't just about the voice; it's about the "intelligence" of the text processing pipeline that sits in front of it.
What about SSML? Speech Synthesis Markup Language. Is that still a thing in twenty twenty-six, or have we moved past it?
It’s still very much a thing, but its role is changing. SSML is like HTML for your voice. You use tags to say "Pause here for one second" or "Change pitch to high." Traditional providers like Amazon Polly or Google Cloud TTS rely heavily on it. If you want absolute, granular control over every syllable, SSML is your best friend. But, let’s be honest, it’s a pain to write.
It’s tedious. No one wants to manually tag a thousand-word article. It feels like coding in 1995.
And that’s why "Plain Text Generation" with "Inference-time Conditioning" is winning. Instead of using XML tags, you just give the model a natural language instruction or a "style" parameter. You say, "Read this like a news anchor," and the model handles the prosody better than you could with manual SSML tags. SSML is becoming a "fallback" tool for when the AI gets it wrong, rather than the primary way we build voice applications. It’s like manual overrides on a self-driving car.
Let’s talk about the marathon runners of the TTS world—long-form content. If I’m trying to turn a thirty-thousand-word book into an audiobook, I can’t just shove the whole text into a single API call. Most models have a "context window" for audio, right?
They do. Usually, it’s limited by the attention mechanism in the transformer architecture. If you try to generate ten minutes of audio in one go, the model starts to "hallucinate" or lose the voice's consistency halfway through. It might start as a deep male voice and end up sounding like a chipmunk. Or it might just start repeating the same word over and over like a broken record.
So what’s the strategy? Just chop it up into sentences?
If you just chop it into sentences and stitch them together, it sounds like a "digital sandwich." There’s no flow between the sentences. The "breath" at the end of sentence one doesn't match the "start" of sentence two. The prosody feels disjointed. The professional way to do it is "Overlapping Chunking with Context."
Break that down for me. How does that work in practice?
You send, say, three sentences to the model, but you also include the "trailing text" from the previous chunk as a reference. This tells the model, "This is how the last sentence ended, so make sure this new sentence starts with the same energy and tone." Then, you use an "audio crossfade" to blend the chunks together seamlessly. Some modern long-form APIs, like the ones from ElevenLabs or Deepgram, handle this "stitching" internally, but if you’re building your own pipeline with open-source models, you have to be very careful about your concatenation logic. If you mess up the "silence" between chunks, the listener’s brain will immediately flag it as "fake."
Is there a limit to how long that context can be? Like, can the model remember how it pronounced a character's name three chapters ago?
Usually, no. That’s a different problem called "Global Consistency." If you have a character named "Siobhan," and the model pronounces it correctly in chapter one but then switches to a phonetic "Si-ob-han" in chapter ten, your audiobook is ruined. To solve that, developers use "Phoneme Lexicons." You basically create a dictionary that tells the model: "Whenever you see this word, use these specific phonetic sounds." It’s the only way to ensure 100% consistency over long durations.
And then there’s the big one. The topic that gets all the headlines and the occasional legal lawsuit—Voice Cloning. Daniel wants to know about "single-shot inference" versus "fine-tuning." I remember when you needed hours of high-quality studio recording to clone a voice. Now it feels like I can do it with a five-second voice memo.
The progress there is staggering. "Single-shot" or "Zero-shot" cloning is what you see in most consumer apps today. You give the model a three-second clip, and it extracts a "vocal embedding"—a mathematical representation of that person’s unique voice characteristics. It then uses that embedding to "steer" the generation. The pro is that it’s instant and incredibly easy. The con is that it’s often a "shallow" clone. It gets the pitch and the general tone right, but it might miss the subtle "idiosyncrasies"—the specific way a person pronounces their "R"s or the unique rasp in their voice.
So that’s where "Fine-tuning" comes in. That’s the "deep dive."
With fine-tuning, you’re actually updating the weights of a small part of the model using thirty minutes or an hour of that person's specific audio. This creates a much more robust clone. It captures the "soul" of the voice—the emotional range, the unique vocal fry, the specific regional inflections. For a brand's "official" voice or a digital twin of a celebrity, you always go with fine-tuning. But for a "throwaway" use case, like a personalized greeting in a video game, zero-shot is more than enough.
What’s the compute trade-off there? I assume fine-tuning is expensive.
It’s expensive up front because you have to run a training job. But once the model is fine-tuned, the "inference cost"—the cost to actually generate speech—is usually the same as a standard model. The real "cost" is the data preparation. You need clean, transcribed audio. If your training data has background noise or someone else talking, the model will actually learn to "clone the noise" too. I’ve seen clones where the AI voice consistently makes a "keyboard clicking" sound because the original source audio was recorded near a laptop.
That’s hilarious and also kind of terrifying. It’s like the AI is a perfect mimic, even of the flaws. Does it pick up things like "ums" and "ahs" too?
If they are in the training data, absolutely. Some people actually want that! If you're building a highly realistic conversational AI, you want it to say "um" occasionally. It makes it feel more human and less like a scripted broadcast. But if you're trying to create a clean corporate narrator, you have to scrub all those verbal tics out of your training set or the AI will replicate them faithfully.
That brings up an interesting point about "Style Overlays." In twenty twenty-six, we’re seeing a move toward "Hybrid Cloning." You take a zero-shot clone of a user’s voice, but you overlay it on a "professional narrator" model. So you get the user’s "sound," but with the professional’s "acting ability." It solves the problem of "boring" clones. Most people aren't professional voice actors; if you clone their voice exactly, the output sounds as flat and boring as they do in real life. Hybrid models make us all sound like we’re reading for the Oscars.
I could use that. My "morning voice" is not something anyone wants to hear for an entire audiobook.
You and me both, Corn. But look, if you’re a developer listening to this, the "Golden Rule" for twenty twenty-six is: Don’t over-engineer for quality you don’t need. If you’re building a notification system for a factory floor, you don’t need a forty-eight kilohertz, fine-tuned, generative model with emotional prosody. You need a small, robust, low-latency model like Piper that can run on a Raspberry Pi and survive a lost internet connection.
But if you’re building a companion AI or a high-end storytelling app, cutting corners on the TTS is the fastest way to lose your audience. People have a very low tolerance for "Uncanny Valley" voices. As soon as the voice feels "wrong," the "suspension of disbelief" is gone, and they stop engaging with the content.
That’s the "Turing Trap" of voice. The closer you get to perfection, the more jarring the remaining flaws become. If a voice is obviously a robot, we forgive it. If it sounds ninety-nine percent human but mispronounces one word with a weird metallic glitch, it's creepy.
One area we haven't touched on much is the "Input Processing" for non-textual cues. Like, how do models handle "laughter" or "sighing" in twenty twenty-six? I’ve seen some demos where the AI actually chuckles mid-sentence.
That’s the "Non-Verbal Communication" frontier. This is largely handled through "Tokenization." Just like LLMs have tokens for words, advanced TTS models now have "vocal tokens" for things like breaths, laughs, clears-throat, or even "hesitation sounds" like "uh" and "um." If you’re using a model like Chatterbox, you can actually insert a tag for a "short laugh" and the model will synthesize it naturally into the vocal stream. It’s not just playing a sound effect; it’s generating the laugh using the same vocal cords as the speech.
That’s a huge deal for "naturalness." Because a real human doesn't just stop talking, play a "laugh dot w-a-v" file, and then start talking again. The laugh bleeds into the words. Your voice is "shaky" for a second after you laugh.
That "co-articulation" between speech and non-speech sounds is what makes the top-tier models like ElevenLabs feel so alive. They understand that a sigh isn't just a sound; it changes the "airiness" of the next three words. For developers, the takeaway is to look for models that support "Phonetic and Non-Verbal Tokens" if you’re doing anything conversational.
So, looking at the practical side for a second. If someone is starting a project today, what's the "stack" you recommend? Because you've mentioned a lot of names.
It depends on your "Budget-to-Quality" ratio. If you have "Infinite Budget" and want the "Best Quality," you go with ElevenLabs. Their API is the industry standard for a reason—the prosody is unmatched, and the voice library is massive. If you’re "Privacy-Conscious" or "Cost-Sensitive," you look at the open-source world. Kokoro is the "darling" of the community right now because it's so lightweight and sounds great. If you need "Massive Scale" with "Low Latency," something like Deepgram’s Aura is built specifically for that—it’s sacrifice-some-warmth-for-extreme-speed.
And don't forget the "Middleware." There are services now that act as an "Aggregator" for TTS. You write your code once, and you can swap between twenty different providers with a single config change.
That’s a smart move for "future-proofing." The king of the hill today might be obsolete in six months. We saw it with the jump from the old concatenative models to neural models, and now we’re seeing it with the jump from "Auto-regressive" to "Flow-matching" architectures. Being able to swap your "Voice Engine" without rewriting your entire application is a massive competitive advantage. It also lets you A/B test different voices with real users to see which one converts better.
We’ve covered a lot of ground—from the "physics" of the audio like sample rates and latency to the "psychology" of it like prosody and emotional weight. It feels like the "Voice AI" world is finally catching up to the "Text AI" world. We’re moving from "making it work" to "making it beautiful."
It’s an exciting time to be an ear, Corn. We’re finally reaching the point where the "Digital Assistant" doesn't sound like a piece of software. It sounds like a person. And that changes everything about how we interact with technology. It becomes less of a "tool" and more of a "presence." Think about how much more likely you are to follow advice from a voice that sounds empathetic versus one that sounds like a calculator.
Hopefully a presence that knows how to handle a "m-p-h" abbreviation without sounding like it’s having a stroke.
One can only hope. But with the right model selection, that’s a problem we can finally leave in the past. We're getting to the point where the AI can infer the meaning of ambiguous abbreviations just by looking at the context of the sentence, which is a huge relief for developers everywhere.
Well, I think that’s a pretty solid roadmap for anyone trying to navigate the "Voice Jungle." Huge thanks to Daniel for the prompt—this is one of those topics that changes so fast that you really have to check the "state of the art" every few weeks.
It’s a full-time job just keeping up with the GitHub repos. But that’s what makes it fun. Every Tuesday there’s a new paper that claims to have solved some impossible latency problem or added a new layer of vocal realism.
If you’re out there building something with these tools, we’d love to hear about it. What are you using? What’s the "edge case" that’s driving you crazy? Are you struggling with technical pronunciations or trying to get the perfect emotional "sigh"? Drop us a line.
Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power the generation of this show. Without that serverless compute, we’d just be two animals talking to ourselves in the dark.
Which, to be fair, we basically are anyway. This has been My Weird Prompts. If you’re enjoying the deep dives, a quick review on your podcast app really does help us reach more people who care about this weird intersection of tech and humanity. It helps the algorithm realize we aren't just robots ourselves.
Find us at myweirdprompts dot com for the full archive and all the ways to subscribe. We've got links to all the models we mentioned today in the show notes.
See ya.
Goodbye.