#2618: Fixing Acronyms in TTS Pipelines

How to handle acronyms in text-to-speech pipelines using BERT models, lexicons, and layered preprocessing.

0:000:00
Episode Details
Episode ID
MWP-2777
Published
Duration
37:44
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Acronym Problem in Text-to-Speech Pipelines

Text-to-speech engines are surprisingly bad at handling acronyms. When a script generator capitalizes a word for emphasis — like "WHAT!" — the TTS engine might try to spell it out letter by letter. This is the core challenge of text normalization for speech synthesis: raw text is full of ambiguities that need to be resolved before audio can be generated.

The Pipeline Problem

The typical TTS pipeline starts with a language model generating script text, then passes that text to a speech synthesis engine. The problem is that language models format things inconsistently — they capitalize for emphasis, use all-caps for acronyms, and sometimes do both. The downstream TTS engine has no way to distinguish between "FBI" (an acronym) and "WHAT" (emphasis) without additional context.

Approaches to Acronym Detection

There are several ways to solve this, each with different tradeoffs:

Rule-based regex: The simplest approach. Write patterns that detect consecutive capital letters, check sentence position, and apply transformations. This works for obvious cases but breaks on edge cases — like single-letter words or emphasis capitalization.

BERT-based classifiers: Small fine-tuned models (~50MB) that look at context to determine whether an all-caps sequence is an acronym or emphasis. These achieve ~98% accuracy on acronym detection across multiple domains by recognizing syntactic patterns like "the" before an acronym or parenthetical definitions after first use.

Pronunciation lexicons: Maintain a lookup table of known acronyms with their phonetic spellings. When the pipeline sees "FDA," it tells the TTS engine to pronounce it as "ef dee ay." This is deterministic but requires ongoing maintenance.

The Layered Approach

The most robust solution combines multiple methods in a preprocessing stack:

  1. A BERT-based acronym detector labels each candidate sequence
  2. A rule-based fallback handles uncertain cases
  3. Transformations are applied based on labels
  4. A pronunciation lexicon serves as the backstop for high-frequency acronyms

This layered approach handles the 98% of easy cases with the ML model, covers frequent cases with the lexicon, and logs edge cases for human review.

The Industry Direction

ElevenLabs has pioneered speech-specific markup — a simplified format that lets content creators tag emphasis, acronyms, and pauses. But the industry remains fragmented, with different TTS engines supporting different markup conventions. The preprocessing approach (keeping scripts in plain text and applying engine-specific transformations downstream) offers the most portability across TTS engines.

Key Takeaways

  • Acronym handling is a text normalization problem that requires understanding context
  • BERT-based classifiers are surprisingly effective for this constrained task
  • A layered approach (ML + rules + lexicon) is more robust than any single method
  • Prompt engineering at the script generation level can help but isn't reliable enough for automated pipelines
  • The preprocessing approach offers more flexibility than vendor-specific markup

The fundamental tension is between deterministic systems (rules, lexicons) that are reliable but brittle, and probabilistic systems (ML models) that are flexible but occasionally wrong. For production pipelines without human review, the layered approach offers the best balance of accuracy and maintainability.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2618: Fixing Acronyms in TTS Pipelines

Corn
Daniel sent us this one, and it's a bit different from our usual. He's not just asking a question — he's pulled back the curtain on how the sausage gets made and he's got a technical problem he wants to talk through. Specifically, the podcast has been running into acronym handling issues with Chatterbox, our text-to-speech engine. Things like F.coming out robotic, or the engine trying to pronounce all-caps words as words instead of letters. And he's asking, if you're building a TTS pipeline, how do you actually solve this deterministically? Do you bolt on specialized models as sidecars, or is there a cleaner approach?
Herman
This is one of those problems that sounds trivial until you actually sit down and try to solve it. And Daniel's right — you can't just say if it's all caps, it's an acronym. I mean, someone writes what with an exclamation mark, the model capitalizes it for emphasis, and suddenly your pipeline is trying to spell it out letter by letter. W-H-A-T. Which is nonsense.
Corn
Fun fact — DeepSeek V four Pro is writing our script today. Which is fitting, because we're about to get very deep into the weeds on a problem that script-writing models themselves create.
Herman
Oh, that's actually perfect. Because the script generator is the source of the ambiguity. It capitalizes things for emphasis, it formats acronyms inconsistently, and then the TTS engine downstream has to guess what was intended. So the problem starts before you even get to pronunciation.
Corn
Let's lay out the actual pipeline Daniel described, because I think the specifics matter here. The script comes from DeepSeek, then it goes to Chatterbox for text-to-speech. Chatterbox is open source — Daniel really likes it for the porosity and the overall quality — but it doesn't have built-in acronym handling the way something like ElevenLabs does with their speech-specific markup.
Herman
And ElevenLabs has actually been the leader here. They introduced a markup schema where content creators can tag things explicitly — emphasis, sarcasm, acronyms, pauses. It's almost like writing stage directions in a script. But that only works if the TTS engine can parse those tags. Chatterbox, out of the box, doesn't have that. So you're left solving it yourself.
Corn
Let me reframe the actual engineering question here, because I think it's more interesting than just acronyms. What Daniel is really describing is the general problem of text normalization for speech synthesis. You've got raw text coming from a language model, and it has all these ambiguities — abbreviations, numbers, symbols, acronyms, emphasis capitalization. The TTS engine needs to turn that into a phonetic representation, but it has to first figure out what each token actually means.
Herman
Text normalization is one of those areas where rule-based systems and machine learning models have been fighting it out for years. The classic approach was always hand-crafted regex rules. You write a pattern that says, if you see two or more consecutive capital letters, and it's not at the start of a sentence, and it's not the word I or A, then treat it as an acronym and insert periods or spaces between the letters.
Corn
That works until it doesn't. Which happens fast.
Herman
Daniel mentioned BERT models specifically for acronym detection. There are these tiny fine-tuned models — we're talking maybe fifty megabytes — that have been trained on labeled data to distinguish between all-caps for emphasis and genuine acronyms. They look at context. The word what followed by an exclamation mark in all caps? Probably not an acronym. A sequence like F-D-A in the middle of a medical discussion? Almost certainly an acronym.
Corn
The sidecar approach Daniel mentioned is actually a pretty common pattern in production TTS pipelines. You have your main model doing the heavy lifting — generating the audio — but before the text even reaches it, you run it through a series of small, specialized preprocessing models. One for acronym detection, one for number normalization, one for handling dates and times, maybe one for currency.
Herman
I've seen pipelines where they use a BERT-based sequence tagger that labels every token as either acronym, emphasis-caps, or regular text. And then a downstream script applies the appropriate transformation. For acronyms, it inserts spaces or periods between letters. For emphasis caps, it might add an S S M L emphasis tag if the engine supports it, or just leave it as-is since the capitalization itself doesn't actually hurt pronunciation.
Corn
Here's where Daniel's specific problem gets thornier. He mentioned that when they tried spacing the letters apart — so F D A instead of FDA — Herman's voice, meaning your voice model, reads it in a robotic way. Each letter pronounced individually with unnatural pauses.
Herman
That's a Chatterbox-specific issue, or really a voice-model-specific issue. Some TTS engines handle spaced letters naturally. Others need the letters to flow together more. The challenge is that the spacing approach is deterministic but sounds bad, and the dot approach — F period D period A period — might cause the engine to literally say the word period, which is even worse.
Corn
Let's talk about what actually does work. I've been looking into this, and there are a few approaches that production systems use. The cleanest one, if your TTS engine supports it, is to use the international phonetic alphabet or some kind of pronunciation hinting. You don't mess with the text at all — you pass a separate pronunciation dictionary alongside it.
Herman
Oh, this is the lexicon approach. You maintain a lookup table of known acronyms and their phonetic spellings. So when the pipeline sees F D A, it looks it up and tells the TTS engine, pronounce this as ef dee ay. The text on the page stays clean, but the pronunciation is explicit.
Corn
The beautiful thing about this is it's deterministic. Once you've defined the pronunciation for an acronym, it will always be read correctly. No guessing, no context-dependence, no fragile regex patterns.
Herman
The downside is maintenance. You need to build and update that lexicon. New acronyms appear all the time. Domain-specific ones, like medical acronyms or tech acronyms, can be completely different from what a general-purpose lexicon would cover. But for a podcast like ours, where we cover a finite set of topics and the acronyms tend to recur, a curated lexicon might actually be the most reliable approach.
Corn
There's another angle here that I think is worth pulling on. Daniel mentioned that the script-writing model itself could be part of the solution. Instead of fixing acronyms after the fact, you teach the model to format them correctly in the first place.
Herman
Right — prompt engineering at the script generation level. You tell the model, when you write an acronym, always format it with periods between letters. Or always format it with spaces. Or use a specific markup token. The model is smart enough to follow that instruction consistently, and then your downstream processing becomes trivial.
Corn
We've actually tried this, haven't we? The challenge is that the model sometimes forgets. You'll get ninety-five percent compliance, and then one acronym slips through unformatted, and your pipeline breaks. Or the model formats something that isn't an acronym because it's being cautious.
Herman
That's the fundamental tension with relying on the language model for formatting. Language models are probabilistic, not deterministic. You can get them to follow instructions most of the time, but most is not all. And in an automated pipeline where episodes are being generated without human review, a single failure creates a jarring listening experience.
Corn
Which brings me back to the sidecar approach Daniel was asking about. I think the most robust solution for a pipeline like this is actually a layered preprocessing stack. First, you run the text through a specialized acronym detection model — one of those small BERT classifiers. It labels each candidate sequence. Then you have a rule-based fallback for anything the model is uncertain about. Then you apply transformations based on the labels. And finally, you have a pronunciation lexicon as the ultimate backstop for high-frequency acronyms.
Herman
Here's the thing about BERT models for this task — they're shockingly good at it. Because acronym detection is actually a pretty constrained problem. The model doesn't need to understand the entire text. It just needs to look at a short sequence, check whether it's all caps, and then look at the surrounding context to determine if it's an acronym or emphasis. Context like nearby lowercase words, sentence position, punctuation patterns — these are strong signals.
Corn
There was a paper from a group at Carnegie Mellon a few years back that showed a fine-tuned BERT-base model achieving something like ninety-eight percent accuracy on acronym detection across multiple domains. The key insight was that the model learned to recognize the syntactic patterns around acronyms — things like the word the before an acronym, or parenthetical definitions after the first use.
Herman
Ninety-eight percent is high, but in a production pipeline, two errors per hundred acronyms is still a lot if you're generating thousands of episodes. That's why you need the layered approach. The BERT model handles the easy cases. The lexicon handles the frequent cases. And for the edge cases, you log them and have a human review queue.
Corn
Let me pull on a thread Daniel mentioned that I think is actually the deeper question here. He talked about ElevenLabs and their speech-specific markup as an emerging standard. And I think he's pointing at something real — the industry is slowly converging on the idea that text-to-speech needs a richer input format than plain text.
Herman
S S M L has been around forever — Speech Synthesis Markup Language. It's an XML-based standard from the early two thousands. But it's verbose and clunky, and most content creators don't want to write XML tags by hand. What ElevenLabs did was create a simplified version — almost like markdown for speech. You put asterisks around a word for emphasis, or you use a specific token for acronyms. It's human-readable and machine-parseable.
Corn
The key word there is standard. The problem right now is fragmentation. ElevenLabs has their format. Chatterbox might not support it. Other engines have their own conventions. If you're building a pipeline that might switch TTS engines in the future, you don't want to lock yourself into one vendor's markup.
Herman
Which is why I think the preprocessing approach Daniel is considering — the sidecar models — is actually the more portable solution. You keep the script in plain text or minimal markup. Then, depending on which TTS engine you're using downstream, you apply the appropriate transformations. If you switch from Chatterbox to something else, you only change the transformation layer, not the entire pipeline.
Corn
There's a practical implementation detail here that's worth spelling out. If you're going to use a BERT-based acronym detector, where does it sit in the pipeline? The cleanest architecture I've seen is to run it as a microservice — a small API endpoint that takes text in and returns labeled text out. The main pipeline calls it after script generation but before TTS synthesis.
Herman
The latency on these small BERT models is negligible. We're talking maybe fifty milliseconds for a full episode script. It's not going to add any meaningful delay to the generation process. The bigger challenge is deployment — you need to host the model somewhere, keep it running, monitor it. For a passion project like ours, that's real operational overhead.
Corn
Daniel's probably weighing exactly that trade-off. The regex approach is simple and doesn't require any infrastructure. It's just a few lines of code in the preprocessing script. But it's fragile. The BERT approach is robust but requires hosting a model. The lexicon approach is deterministic but requires ongoing curation.
Herman
I think there's a hybrid that might work well for our specific use case. The acronyms that appear in our episodes are actually pretty predictable. We talk about A I, F D A, N A T O, C I A, U N, E U — these come up over and over. A curated lexicon of maybe fifty to a hundred high-frequency acronyms would cover ninety percent of the cases. Then for the rest, you have a simple heuristic — if it's a sequence of two to five capital letters, and it's not at the start of a sentence, and it's not a common all-caps word like I or A or O K, treat it as an acronym and insert periods.
Corn
You'd still get edge cases. What about acronyms that are also words? Like W H O for the World Health Organization — the model might try to pronounce it as who. Or acronyms that people sometimes pronounce as words, like N A T O versus NATO.
Herman
That's actually a deeper distinction — initialisms versus acronyms. An initialism is pronounced letter by letter, like F B I or C I A. An acronym is pronounced as a word, like N A T O or U N E S C O. And some can go either way depending on context. A good preprocessing pipeline needs to handle both cases.
Corn
The lexicon approach handles this cleanly. You just specify the pronunciation for each entry. F B I is ef bee eye. N A T O is nay-toe. The model doesn't have to guess — it just looks it up.
Herman
For the BERT-based approach, you'd actually want a slightly different model — one that not only detects acronyms but also predicts whether they should be expanded or pronounced as words. That's a harder problem, but there are models that do it.
Corn
Let me bring this back to Daniel's actual question, because I think we've been circling it. What can you do if you're building a TTS pipeline and you want deterministic, reliable output? I think the answer is, you can't get perfect determinism from a single approach. You need layers. But you can get close enough that errors become rare enough to be acceptable.
Herman
Acceptable is the key word. For a podcast like ours, an occasional mispronounced acronym is noticeable but not catastrophic. Listeners like Dr. Schneiderman might pick up on it, but most people just hear it and move on. The question is whether the engineering effort to fix those edge cases is worth it.
Corn
Daniel seems to think it is, and I get why. He's been building this podcast as an experimental platform — a graph-backed knowledge base where the episodes connect to each other through topics. The production quality matters because the content is meant to have long-term value. An episode about sleep medicine from last week should still sound good when someone discovers it through the graph two years from now.
Herman
That's the real insight here. Most podcasts are disposable — you listen once and forget. But Daniel's vision for this show is different. The episodes are nodes in a knowledge graph. They're meant to be discovered, revisited, connected. That raises the bar for production quality. Mispronounced acronyms are like typos in a book — they undermine the credibility of the whole thing.
Corn
Let's get concrete. If I were building this pipeline tomorrow, here's what I'd do. Step one — add an acronym detection layer using a small fine-tuned BERT model. Host it as a lightweight microservice. Step two — maintain a curated pronunciation lexicon for the top hundred or so acronyms that appear in our content. Step three — for anything the BERT model flags with low confidence, fall back to the lexicon. Step four — for anything not in the lexicon, apply a deterministic rule, spacing out the letters with a very short pause marker if Chatterbox supports it, or periods if it doesn't.
Herman
Step five — log every acronym transformation so you can spot-check the output and improve the lexicon over time. That's the piece that takes this from a one-time fix to a continuously improving system. Every time an acronym slips through, you add it to the lexicon. Within a few months, you've covered virtually everything.
Corn
There's also a step zero here that I think is worth mentioning. Before you build any of this, you need to instrument your pipeline so you can actually see where the failures are happening. Daniel mentioned that Dr. Schneiderman pointed out the acronym issue — but how many other episodes had the same problem that nobody flagged? You need logging, monitoring, maybe even automated quality checks that scan the generated audio for common mispronunciations.
Herman
Automated quality checking for TTS output is a whole other rabbit hole. You'd essentially be running a speech recognition model on the generated audio, comparing the recognized text to the original script, and flagging discrepancies. If the script says F D A and the recognizer hears fda or fuh-duh, you know something went wrong.
Corn
That's clever but expensive in terms of compute. For a passion project, probably overkill. I think the logging approach makes more sense — just log every acronym the pipeline encounters and how it was transformed, and review a sample periodically.
Herman
Let me circle back to something Daniel said about ElevenLabs and their markup approach, because I think there's an important industry trend here. The big TTS providers are all moving toward richer input formats. Google Cloud Text-to-Speech supports S S M L. Amazon Polly supports S S M L. ElevenLabs has their own simplified markup. Microsoft Azure has their own variant. The direction of travel is clear — plain text is not enough for high-quality speech synthesis.
Corn
The fragmentation is real, and it's not going away. Every provider has their own slightly different flavor of markup. If you're building a pipeline that might switch providers, you need an abstraction layer — some kind of intermediate representation that you can translate into whatever the current TTS engine expects.
Herman
This is actually where the open-source TTS engines like Chatterbox have an interesting advantage. Because the code is open, you can modify the text processing pipeline directly. You don't need to work around a black-box API. You can add custom preprocessing steps, custom pronunciation rules, custom everything. The trade-off is that you have to do the work yourself rather than relying on the provider.
Corn
That's exactly the trade-off Daniel described. He said using open-source AI is like building a car yourself. ElevenLabs is the car that comes fully assembled with all the features. Chatterbox is the kit car — you get more control and lower cost, but you have to solve problems like acronym handling yourself.
Herman
There's a middle ground that's emerging, by the way. Some open-source TTS engines are starting to adopt community-standard markup formats. Coqui TTS, before they shut down, had been working on S S M L support. Piper TTS has some basic SSML capabilities. The ecosystem is slowly converging.
Corn
Let's talk about the specific BERT models Daniel mentioned, because I think some listeners might be curious about what's actually available. Hugging Face has a handful of acronym detection models. There's one called acronym-identifier that's a fine-tuned DistilBERT — tiny, fast, decent accuracy. There are a few others trained on scientific text that handle domain-specific acronyms well.
Herman
The scientific domain models are particularly interesting for our use case, because a lot of the acronyms we deal with are technical or medical. These models have been trained on papers from PubMed and arXiv, so they've seen C T scan and M R I and P C R and all the rest. They're much better at distinguishing technical acronyms from emphasis capitalization than a general-purpose model would be.
Corn
The other approach Daniel didn't mention but might be worth considering is using the language model itself for preprocessing. Instead of a separate BERT classifier, you could prompt the same model that generates the script to also annotate it. Something like, here's the script, now go through and mark every acronym with special tokens.
Herman
That's clever but it doubles the inference cost. You're running the full language model twice — once for generation, once for annotation. And language models are much more expensive to run than tiny BERT classifiers. For a production pipeline where cost matters, the sidecar approach is almost certainly cheaper.
Corn
Unless you're already running the language model for other post-processing steps. If you're already doing fact-checking or style review with the same model, adding acronym annotation is marginal. It depends on the pipeline architecture.
Herman
I think the broader lesson here — and this is something Daniel clearly understands — is that building a production TTS pipeline is about solving a long tail of small problems. Acronyms are one. Numbers are another — should one thousand two hundred be read as twelve hundred or one thousand two hundred? Dates are another — is zero three slash zero five March fifth or May third? Abbreviations, units, symbols, special characters. Each one seems trivial on its own, but together they create a lot of friction.
Corn
The reason text normalization for TTS is hard is that it requires understanding. You can't just write regex patterns. You need to know that St. can be street or saint depending on context. You need to know that Dr. is usually doctor but sometimes drive. You need to know that twenty twenty can be the year or a number. These are semantic distinctions, not syntactic ones.
Herman
Which is why the industry has been moving toward end-to-end TTS models that take raw text as input and learn to handle normalization implicitly. Models like Tortoise TTS or Bark or the newer diffusion-based systems — they're trained on vast amounts of audio-text pairs, and they learn to handle many of these normalization issues automatically. They're not perfect, but they're getting better.
Corn
Chatterbox, if I understand correctly, is not an end-to-end model in that sense. It still has a separate text processing frontend and a waveform generation backend. So the normalization has to happen explicitly. That's why Daniel is running into these issues.
Herman
That's not a knock on Chatterbox — most production TTS systems still use a pipeline architecture rather than end-to-end. Pipeline systems give you more control and are easier to debug. End-to-end systems sound more natural but are harder to steer. It's a trade-off.
Corn
Let me try to summarize what I think the actionable advice is here, because Daniel asked a very practical question. If you're building a TTS pipeline and hitting acronym issues, the most robust approach is a layered preprocessing stack. Start with a curated pronunciation lexicon for your most frequent acronyms. Add a small BERT-based acronym detector for everything else. Use rule-based fallbacks for edge cases. Log everything so you can improve over time. And design the preprocessing layer to be TTS-engine-agnostic, so you can swap backends without rebuilding the whole thing.
Herman
That's a solid summary. I'd add one more thing — if you have the resources, invest in building a test suite. A set of scripts that contain every acronym you care about, every edge case you've encountered, every tricky normalization scenario. Run those through the pipeline after every change and listen to the output. It's tedious, but it's the only way to be confident that you haven't introduced regressions.
Corn
That test suite approach is something Daniel would appreciate, given his software development background. It's the same idea as unit tests for code — you encode your expectations and check them automatically.
Herman
There's an interesting parallel here to the broader conversation about A I quality assurance. As more content gets generated by A I pipelines, we're going to need new kinds of testing infrastructure. You can't manually review every episode of every A I generated podcast. You need automated quality checks that can catch the obvious errors — mispronounced acronyms, garbled sentences, unnatural pauses.
Corn
We're already seeing startups emerge in this space. Companies building A I observability tools that monitor the output of generative pipelines and flag anomalies. It's early days, but the need is real.
Herman
Daniel's podcast is actually a fascinating case study in this, because he's been running an almost fully automated production pipeline for years now. The script is A I generated, the voices are A I generated, the topic selection is driven by A I. And he's been iterating on the quality issues — first it was disfluencies, now it's acronyms, next it'll be something else. Each iteration makes the output a little bit better.
Corn
That iterative approach is the right one, I think. You can't solve all the text normalization problems at once. You tackle the highest-impact ones first, build infrastructure that makes it easy to add new rules and models, and keep improving over time.
Herman
The acronym issue is a perfect example of a problem that's high-impact but relatively contained. It annoys listeners like Dr. It makes the podcast sound less professional. But the solution space is well-understood — there are established approaches, pretrained models, known patterns. It's not a research problem. It's an engineering problem.
Corn
Engineering problems have engineering solutions. Which is reassuring, in a way. Not everything in A I has to be a moonshot.
Herman
There's one more thing I want to touch on before we wrap up. Daniel mentioned that he envisions the podcast as a graph-backed knowledge base, where episodes are nodes connected by topics. And he said they haven't fully implemented that yet, but they've visualized the graph. I think that vision connects directly to the quality question. If the podcast is meant to function as a reference — something people discover and explore non-linearly — then production quality isn't just nice to have. It's essential. You can't have someone discover an episode from two years ago and hear garbled acronyms. It undermines the whole premise.
Corn
A linear podcast lives in the moment — listeners hear the latest episode, maybe a few recent ones, and that's it. Quality issues in old episodes don't matter much because nobody listens to them. But a graph-backed podcast is different. Every episode is a potential entry point. Every episode needs to hold up.
Herman
That raises the stakes for the acronym fix. It's not just about making future episodes better. It's about going back and fixing the old ones. If Daniel builds a robust preprocessing pipeline, he could theoretically re-render the entire back catalog with corrected pronunciation.
Corn
That's a big project. But if the pipeline is automated enough, it's just a matter of compute and time. Feed in the old scripts, run them through the new preprocessing stack, regenerate the audio, and swap the files. Two thousand five hundred episodes would take a while, but it's parallelizable.
Herman
That's the beauty of a fully automated pipeline. Once you've solved the problem once, you can apply the solution retroactively. It's not like a human narrator who would have to re-record everything. The A I voices are always available, always consistent.
Corn
Alright, I think we've given Daniel a pretty thorough answer. But let me ask you one more question, Herman. If you were starting a TTS pipeline from scratch today, would you choose an open-source engine like Chatterbox and deal with these normalization issues yourself, or would you go with a commercial provider like ElevenLabs that handles most of this out of the box?
Herman
That's a tough one. For a passion project where cost matters and you have the engineering skills, I'd go open-source. The control you get is worth the extra work. But for a commercial product where time-to-market matters and you need production quality from day one, I'd go with a managed provider. ElevenLabs has solved most of these problems already. You pay a premium, but you get to focus on content instead of infrastructure.
Corn
I think Daniel's case is interesting because it's somewhere in between. It's a passion project, but it has real listeners and real quality expectations. And he has the engineering skills to build the preprocessing pipeline. So open-source makes sense, but it also creates exactly the kind of problem he's asking about.
Herman
Which is why he's asking. And I think the answer is, yes, you build the sidecar models. You invest in the preprocessing infrastructure. You treat text normalization as a first-class engineering concern, not an afterthought. And over time, as the open-source TTS engines get better at handling these things natively, you can gradually simplify your pipeline.
Corn
The car analogy Daniel used is actually perfect. When you build the car yourself, you learn exactly how every part works. You can fix things when they break. You can upgrade individual components. But you also spend a lot of time under the hood. It's a trade-off between capability and convenience.

And now: Hilbert's daily fun fact.

Hilbert: The national animal of Scotland is the unicorn.
Corn
...right.
Corn
Here's the forward-looking thought I want to leave listeners with. As A I generated audio becomes more common — podcasts, audiobooks, voice assistants, video narration — the text normalization problem is going to become more visible. The companies that solve it well, whether through better end-to-end models or richer markup standards or smarter preprocessing, are going to have a real advantage. And the open-source community, projects like Chatterbox, are going to need to catch up or risk being left behind. It'll be interesting to see how this plays out over the next few years.
Herman
Thanks to our producer, Hilbert Flumingtop, for keeping this whole operation running. This has been My Weird Prompts. If you want to dig deeper into how we build this show, the technical posts and episode archive are at myweirdprompts.
Corn
We'll be back next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.