#2754: Why Your Dictation Setup Might Be Wrong

Modern ASR is shockingly robust. The biggest predictor of accuracy? How well your audio matches its training data.

Featuring
Listen
0:00
0:00
Episode Details
Episode ID
MWP-2915
Published
Duration
36:10
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Modern automatic speech recognition (ASR) systems like Whisper have upended conventional wisdom about dictation accuracy. Where older systems were brittle — demanding pristine audio, close microphone placement, and slow, clear speech — today's end-to-end neural models thrive on the kind of messy, real-world audio they were trained on.

Research from Johns Hopkins shows that moderate background noise (café-level) barely moves word error rates, shifting them from roughly 3% to 4-5%. A Carnegie Mellon meta-analysis found that speaking rate has almost no effect within a normal conversational range of 140-180 words per minute. And a 2024 ETH Zurich study demonstrated that Whisper handles whispered speech with only a 15-20% relative increase in word error rate.

The key insight: the single biggest predictor of accuracy is domain match — how similar your audio is to the model's training distribution. Whisper's 680,000 hours of training data is dominated by slightly compressed, variable-distance, real-world speech. A pristine studio recording is actually somewhat out of distribution, while a phone recording in a café with background conversation in another language may be nearly ideal.

This leads to the concept of computer-directed speech — the learned adaptation where humans unconsciously modify their speech patterns for machines. While hyper-articulation and monotone delivery can backfire with models trained on natural prosody, the co-evolution between human and machine continues to reshape how we think about dictation.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2754: Why Your Dictation Setup Might Be Wrong

Corn
Daniel's been running his own benchmarks on dictation accuracy — different microphones, different environments, whispering versus speaking normally, background noise, background conversations in different languages — and what he found is that a lot of the things you'd assume would wreck your word error rate just… didn't. Which honestly makes me wonder how much of the conventional wisdom around dictation setup is just audio superstition.
Herman
The thing is, he's not wrong to be surprised. I went through basically the same arc when I started looking at the actual research. Most people — and I include myself in this from my early days — assume that speech-to-text is like a microphone physics problem. Get closer, eliminate noise, speak clearly, and your accuracy goes up. And that's true for the acoustic signal. But modern ASR systems, especially end-to-end neural models like Whisper, are doing something fundamentally different than just matching waveforms to phonemes.
Corn
By the way, today's episode is powered by DeepSeek V four Pro. So if the discussion feels unusually precise, that might be why.
Herman
Fitting, given we're talking about precision. So let me start with the thing that I think most people get wrong, and it's the central finding in a lot of the recent literature. Modern ASR systems are robust to acoustic degradation in ways that would have seemed impossible ten years ago. There was a paper from a group at Johns Hopkins — they did a systematic evaluation of Whisper's robustness across different noise conditions, and what they found was that adding background noise at moderate levels, like café-level noise, barely moved the word error rate. We're talking a shift from maybe three percent to four or five percent. That's not nothing, but it's not the collapse people expect.
Corn
Daniel mentioned he was getting better results on his phone than on desktop microphones. So that tracks — the phone's mic array and the processing pipeline on modern smartphones are actually optimized for exactly this kind of far-field speech pickup. Meanwhile, a lot of desktop USB mics are designed for podcasting, where you want a wide, warm capture, which actually introduces room reflections that the ASR model then has to disentangle.
Herman
And this connects to something that I think is the real headline from the research. The single biggest predictor of word error rate isn't microphone proximity or background noise. It's domain match. What I mean by that is: how similar is the audio you're feeding the model to the kind of audio it was trained on? Whisper was trained on something like six hundred eighty thousand hours of labeled audio, and a huge chunk of that is — you guessed it — slightly compressed, variable-distance, real-world speech. Phone calls, YouTube videos, podcasts. So when you feed it a pristine studio recording, that's actually somewhat out of distribution.
Corn
You're saying the model is more comfortable with messy audio.
Herman
I'm saying the model was literally raised on messy audio. Its internal representations are built around the statistical regularities of slightly degraded speech. So when Daniel says he got his best benchmarks on a OnePlus phone speaking in what he calls a slightly clipped manner, he's not hallucinating. He's accidentally matching the training distribution.
Corn
This is where I want to dig into the rate of speech thing, because Daniel's finding — that speaking slowly didn't improve accuracy and speaking quickly didn't hurt it — that's counterintuitive but well-supported. There was a meta-analysis out of Carnegie Mellon, I think it was twenty twenty-three, that looked at speaking rate and ASR accuracy across about a dozen modern systems. The takeaway was that for neural end-to-end models, speaking rate has almost no effect within a pretty wide band. If you're speaking at a normal conversational pace — say, one hundred forty to one hundred eighty words per minute — the model doesn't care. The errors that do occur at very fast rates aren't from the rate itself. They're from coarticulation — you start blending phonemes together in ways that the model's training data might not have covered.
Herman
Right, and coarticulation is the actual mechanism. When you speak quickly, you're not just playing the same phonemes faster. You're changing the acoustic realization of the phonemes. "Did you" becomes "didja." "Going to" becomes "gonna." And the model handles those just fine if they appeared in training, which for Whisper they absolutely did. What the model struggles with is coarticulation patterns that are specific to a dialect or an individual speaker that weren't well-represented in the training data. So if Daniel has a particular clipped style that he's developed from months of optimizing for the model, he's essentially trained himself into the model's comfort zone.
Corn
Which raises a slightly unsettling question. Are we training the humans or the machines here?
Herman
Both, and that's actually the state of the field. There's a concept called speaker adaptation, and it goes both ways. The system adapts to you — Whisper doesn't do real-time speaker adaptation in the classic sense, but the large context window on the newer versions lets it use your earlier utterances to inform its predictions for later ones. And you adapt to the system. You learn which words it consistently gets wrong, you learn to pause slightly before proper nouns, you develop a dictation voice. It's a co-evolution.
Corn
Daniel mentioned something that I think is worth pulling apart, which is that he tried whispering and reading the same passage out loud, and the word error rate difference was negligible. That one surprised me. Whispered speech is acoustically totally different from phonated speech. No vocal fold vibration, so no fundamental frequency, no harmonic structure. It's basically noise shaped by the vocal tract.
Herman
Yet, Whisper handles it. There's a paper from twenty twenty-four — a group at ETH Zurich, I believe — that specifically evaluated Whisper on whispered speech. They found that the word error rate was higher than normal speech, but not dramatically so. We're talking maybe a fifteen to twenty percent relative increase. So if your baseline is three percent WER, whispered speech might push you to three point six or four percent. That's usable. The model is picking up on the formant structure — the resonant frequencies of your vocal tract — which are preserved in whispered speech even without the vocal fold vibration. And it turns out formants carry a huge amount of the information needed to disambiguate phonemes.
Corn
The secret is that the model doesn't actually need your vocal cords. It just needs your mouth shape.
Herman
More or less. And this is a good example of why the "better microphone equals better accuracy" heuristic breaks down. A better microphone will give you a cleaner recording of the acoustic signal, but the acoustic signal is only the input. The model is doing inference over a learned distribution of what speech sounds like. If the model has seen enough whispered speech in training — and Whisper's training data includes a lot of lecture recordings, ASMR content, all sorts of non-standard vocalizations — then it can generalize.
Corn
Let's talk about the background conversation thing, because Daniel raised a really specific and interesting question. He lives in Israel, so the background conversations he picks up are in Hebrew, Arabic, French — languages other than the English he's dictating in. And he wondered whether that actually makes them less problematic because they're in a different language. My intuition says yes, but I want to know if the research backs that up.
Herman
It does, and the reason is pretty elegant. Modern multilingual ASR systems — and Whisper is multilingual, it was trained on ninety-seven languages — have what's essentially a language identification module baked into the encoder. When the model processes audio, it's not just transcribing. It's simultaneously inferring what language is being spoken. So when background speech in Hebrew enters the audio stream, the model's language identification head is saying, "That's not English," and the English decoder can effectively ignore those acoustic features. It's not perfect — if the background Hebrew is loud enough to mask the English foreground, you'll get errors. But for the kind of faint background conversation Daniel described, the language mismatch is actually a feature, not a bug.
Corn
Whereas if he were dictating in English at Disneyland and someone behind him was also speaking English, the model would have a harder time separating the two streams because they'd be competing for the same decoder pathway.
Herman
This is the cocktail party problem, and it's one of the hardest problems in speech processing. Humans are remarkably good at it — we can focus on a single speaker in a room full of conversations, even when those conversations are in the same language. Machines are getting better, but they're not there yet. The language mismatch gives the model an extra dimension of separation. It's like having a color filter when you're trying to separate two overlapping images. If one is red and one is blue, it's easy. If they're both red, you need much more sophisticated processing.
Corn
The practical advice for someone who dictates in noisy environments is: hope the background noise is in a different language.
Herman
Or use a system that supports speaker diarization and target-speaker extraction. There are commercial systems now — Otter, Fireflies, some of the enterprise ASR platforms — that let you enroll your voice and then the model will specifically track your speech stream and suppress everything else. Whisper doesn't do that natively, but there are wrapper tools that add speaker diarization on top.
Corn
Daniel also touched on something that I think is under-discussed, which is that he's developed a slightly clipped manner of speaking for dictation. This is the human adaptation side of the co-evolution you mentioned. And I've noticed this in myself too — when I'm dictating, I speak differently than when I'm talking to you. It's not quite robotic, but it's flatter, more deliberate about consonant boundaries.
Herman
There's actual research on this phenomenon. It's sometimes called "computer-directed speech" or "Lombard speech for machines." The Lombard effect is the involuntary tendency to speak louder and more clearly in noisy environments. Computer-directed speech is a learned version of that — people unconsciously or consciously modify their speech to optimize for the machine. They hyper-articulate, they reduce their pitch variation, they insert micro-pauses between words. And the fascinating thing is that this can actually backfire with modern systems. Whisper was trained on natural speech, not hyper-articulated speech. So if you overdo the clipping, you might be pushing yourself out of the training distribution again.
Corn
Which would explain why Daniel's trying to stop doing it for the podcast but keeps slipping back into it for dictation. He's found a local optimum that works for his specific setup, but it might not generalize.
Herman
This is where I want to bring up a finding that I think is genuinely underappreciated. There was a study — I want to say it was from Microsoft Research, maybe twenty twenty-five — that looked at the effect of prosody on ASR accuracy. Prosody is the rhythm, stress, and intonation of speech. And what they found was that natural prosody actually improves accuracy. When people speak with normal intonation contours — the pitch goes up at the end of a question, down at the end of a statement, stress falls on the important words — the model does better than when people speak in a monotone. The hypothesis is that prosody carries syntactic information that helps the model disambiguate sentence structure.
Corn
That's counterintuitive. You'd think a flatter delivery would be easier to parse because there's less variation to deal with.
Herman
It's easier for a simple signal processing system. For a neural network that's learned statistical regularities across hundreds of thousands of hours of natural speech, the variation is information. Removing it removes cues the model relies on.
Corn
The ideal dictation voice isn't robot-speak. It's just your normal voice, maybe slightly slowed down, with clear consonant articulation but natural intonation.
Herman
That's what the literature suggests. And I think this is actually liberating. It means you don't have to develop a special dictation persona. You can just talk.
Corn
Let's get into the microphone proximity thing because Daniel said he's settled on the headset being the recommendation, and that makes intuitive sense — get the mic as close to your mouth as possible, maximize signal-to-noise ratio. But then he also said his phone outperformed desktop mics. So there's something more complicated going on.
Herman
Signal-to-noise ratio matters, but it's not the whole story. A headset mic close to your mouth gives you a high SNR, absolutely. But modern phone microphone arrays are doing beamforming — they use multiple microphones to create a directional pickup pattern that focuses on your voice and suppresses sound from other directions. That's computational audio, and it's incredibly sophisticated. The iPhone has had multiple microphones since the iPhone four, and the signal processing pipeline has gotten better with every generation. Android flagships like the OnePlus are doing the same thing. So even though the phone is farther from your mouth, the effective SNR after beamforming can be comparable to or better than a close-talked headset mic, especially if the headset mic is omnidirectional and picks up room reflections.
Corn
The desktop mic problem is even worse. Most people have their desktop mic on a stand, maybe a foot or two from their mouth. That's far enough that the direct-to-reverberant ratio drops significantly. The mic is picking up as much room sound as direct sound. And room acoustics are terrible for ASR — you're adding a convolutive distortion that smears the temporal structure of the speech signal.
Herman
There's a classic paper from the eighties — this is pre-neural, obviously — that established the direct-to-reverberant ratio as the key predictor of ASR accuracy in enclosed spaces. And the basic finding still holds. You want the direct path from your mouth to the microphone to dominate. A headset achieves that by proximity. A phone achieves it through beamforming. A desktop mic on a boom arm can achieve it if it's positioned correctly, but most people don't position it correctly.
Corn
If you're going to use a desktop mic for dictation, the setup matters enormously. You need it close, you need it aimed at your mouth, and you ideally want some acoustic treatment in the room to reduce reflections.
Herman
Or you can just use a headset and not worry about any of that. Which is why Daniel's recommendation is, for most people, the right one. It's not that headset mics are inherently superior. It's that they're idiot-proof. You put them on, the mic is in the right place, and you're done.
Corn
I want to circle back to something Daniel mentioned about his own benchmarks, because he said he was using Whisper, and that's a specific choice with specific implications. Not all ASR systems are created equal, and the factors that affect accuracy can vary significantly depending on the architecture.
Herman
The distinction that matters most is between hybrid systems and end-to-end systems. Hybrid systems — which were the state of the art until maybe five years ago — combine an acoustic model, a language model, and a pronunciation lexicon. Each component is trained separately. The acoustic model maps audio to phonemes, the language model predicts word sequences, and the lexicon tells you how words are pronounced. End-to-end systems like Whisper collapse all of that into a single neural network that goes directly from audio to text.
Corn
The robustness properties are different.
Herman
Hybrid systems are more brittle to acoustic variation because the acoustic model was typically trained on clean, close-talked speech. If you feed it noisy or distant speech, the acoustic model produces garbage phoneme probabilities, and the language model can only do so much to clean it up. End-to-end systems, because they're trained on diverse real-world audio, are much more robust to acoustic variation. But they have their own failure modes. They can hallucinate — they'll sometimes output text that has nothing to do with what was said, especially in long silences or when the input is mostly noise. And they can be overconfident — they'll output fluent, grammatical text that's completely wrong.
Corn
The hallucination thing is worth flagging because it's a failure mode that's very different from what people expect. With a traditional ASR system, when it's uncertain, it'll output something garbled or it'll just leave a blank. With Whisper, it might output a whole sentence that sounds plausible but was never spoken. That's a much more dangerous error because it's harder to catch in proofreading.
Herman
There was a study from Cornell — not related to me, unfortunately — that found Whisper hallucinates in about one percent of utterances, and the hallucinations are often grammatically coherent and topically related to the surrounding context. So you're reading through your transcript and everything looks fine, but there's a sentence in there that you never said. It's a real problem for high-stakes applications like medical dictation.
Corn
Which brings us to the practical question Daniel is really asking. If you're someone who's gone all-in on dictation, what should you actually optimize for?
Herman
I'd break it down into three tiers. Tier one, the things that matter the most: use a modern end-to-end ASR system — Whisper Large or the latest commercial equivalents. Keep your microphone reasonably close to your mouth, but don't obsess over it. Speak naturally with normal prosody. And if you're in a noisy environment, try to position yourself so the noise is behind you, not in front of you, so your phone's beamforming can reject it.
Herman
Tier two, the things that help at the margins: use a consistent setup so you're not constantly changing the acoustic conditions. If you're using a desktop mic, get it on a boom arm and position it six to eight inches from your mouth, slightly off-axis to reduce plosives. Consider a dynamic mic rather than a condenser — dynamics are less sensitive to room reflections. And if you're regularly dictating in noisy environments, look into systems with speaker enrollment.
Corn
Tier three, the things that probably don't matter as much as people think?
Herman
The exact microphone model, assuming it's not terrible. Expensive audio interfaces. Speaking in a special dictation voice — if anything, that might hurt. Ultra-quiet recording environments — a little bit of ambient noise is actually in-distribution for the model. And, surprisingly, speaking rate within normal bounds. Just talk like a person.
Corn
There's one more thing I want to dig into, and it's the thing Daniel didn't explicitly ask about but that I think is the real frontier here. All of this — microphone choice, environment, speaking style — is about optimizing the input to the ASR system. But the biggest gains in accuracy over the next few years aren't going to come from better input. They're going to come from better models.
Herman
Better integration with the applications you're using. This is something I've been thinking about a lot. Right now, ASR is mostly treated as a standalone component — you speak, it transcribes, the text goes into whatever you're working on. But the real accuracy improvements come when the ASR system has context about what you're doing. If you're dictating an email, the system should know it's an email and bias its language model toward email-appropriate vocabulary and syntax. If you're writing code, it should know the programming language and bias toward the relevant syntax.
Corn
Whisper doesn't do any of that. It's a general-purpose transcription model. It has no idea whether you're writing a text message or a legal brief.
Herman
Right, and that's why the hallucination problem is so tricky. Without context, the model is just guessing at the most probable sequence of words given the audio. With context, it could constrain those guesses in meaningful ways. Some of the newer commercial systems are starting to do this — they'll integrate with your calendar, your contacts, your previous messages, and use that to improve accuracy. Apple's on-device dictation has been doing a version of this for years, which is why it handles proper names reasonably well even though proper names are notoriously hard for ASR.
Corn
The dictation setup of the future isn't really about the microphone at all. It's about the software knowing who you are and what you're trying to do.
Herman
I think that's the right way to frame it. The hardware question is largely solved. A modern smartphone with a decent mic and a modern ASR system will give you usable accuracy in most real-world conditions. The remaining errors aren't acoustic — they're linguistic. They're about the model not knowing that you're talking about your friend Ezra, not the biblical Ezra, or that you're dictating a technical term that's not in the training data.
Corn
Which brings me back to Daniel's point about the clipped speaking style. I think what he's doing, probably without fully realizing it, is providing the model with clearer phonemic boundaries, which reduces the ambiguity that leads to linguistic errors. It's a workaround for the model's lack of context. If the model knew what he was talking about, he wouldn't need to do it.
Herman
The clipped style is a form of hyper-articulation that reduces the search space for the model. By making each word more acoustically distinct, he's giving the model less room to hallucinate. But it's a band-aid. The real solution is better contextual integration.
Corn
Let's talk about one more thing that I think is under-discussed in the dictation community, which is the post-processing pipeline. Daniel's benchmarks are measuring raw word error rate from Whisper, but in practice, most people aren't using raw Whisper output. They're running it through some kind of cleanup — punctuation restoration, capitalization, formatting. And those post-processing steps can introduce their own errors.
Herman
Or they can fix errors. A good punctuation model can actually resolve ambiguities in the transcript. If the ASR output is "let's eat grandma" and the punctuation model inserts a comma — "let's eat, grandma" — it's effectively corrected a semantic error. But a bad punctuation model can turn a correct transcript into nonsense. And formatting is even trickier. If you're dictating a list, does the system know to format it as bullet points? If you're dictating a date, does it know your preferred date format? These are all places where the pipeline can degrade accuracy even if the raw ASR is perfect.
Corn
The word error rate you measure at the Whisper level isn't necessarily the word error rate you experience as a user.
Herman
And this is why I always tell people to benchmark their actual end-to-end workflow, not just the ASR component in isolation. Dictate a real email, with formatting and punctuation, and see what lands in your outbox. That's the metric that matters.
Corn
Daniel also raised the question of whether background conversations in different languages are less problematic, and we established that they are. But there's a flip side to the multilingual thing that's worth mentioning. Whisper's multilingual training means it can sometimes switch languages on you. If you're dictating in English and you say a word that sounds like a word in another language, the model might transcribe it in that language. This is especially common with proper names and loanwords.
Herman
The language switching problem is real, and it's one of the trade-offs of multilingual models. A monolingual English model would never output Hebrew text, but it would also be worse at handling accented English. The multilingual model handles accents better because it's seen more variation, but it introduces the possibility of language confusion. You're trading one failure mode for another.
Corn
Which failure mode would you rather have?
Herman
For most users, the multilingual model is better. The accent robustness alone is worth it. And the language switching errors tend to be obvious — you'll see a word in a different script or a clearly non-English word, and you'll catch it in proofreading. The hallucination errors from a monolingual model are harder to catch because they look like valid English.
Corn
That's a good heuristic. Prefer errors that are easy to catch.
Herman
It sounds obvious, but it's actually a useful design principle for ASR systems. You want the error distribution to be skewed toward detectable errors rather than plausible-looking errors. A garbled word is annoying but fixable. A fluent hallucination is dangerous.
Corn
I want to pull on one more thread from Daniel's message. He mentioned that he's been doing this for a year and a half and has settled into a core setup that hasn't changed much. And I think that's actually the sign of a mature dictation practice. At some point, you stop optimizing the gear and start optimizing the workflow.
Herman
The gear acquisition phase is real, and I went through it too. You buy the gooseneck mic, the lav mic, the desktop condenser, the audio interface, the acoustic panels. And then at some point you realize that your phone with the default earbuds is giving you ninety-eight percent of the accuracy with zero setup time. And you stop caring about the last two percent.
Corn
Unless the last two percent matters for your use case. If you're a medical transcriber where an error could be life-threatening, you care about every fraction of a percent. If you're dictating emails, you probably don't.
Herman
Even in high-stakes domains, the research suggests that human review is still necessary. No ASR system is perfect, and the error modes of neural systems are unpredictable enough that you can't fully trust them for critical applications. The best practice is ASR plus human verification, not ASR alone.
Corn
Which is basically what Daniel's doing. He's dictating, then he's proofreading. The ASR is an efficiency tool, not a replacement for human judgment.
Herman
And I think that's the healthy way to think about it. Dictation isn't about never touching a keyboard again. It's about reducing the amount of keyboard time. If you can dictate a first draft at a hundred sixty words per minute and then spend two minutes cleaning it up, you're still way ahead of typing it from scratch at forty words per minute.
Corn
The numbers bear that out. Average typing speed is about forty words per minute. Average speaking speed is about a hundred fifty. Even with a five percent word error rate and some editing overhead, dictation is a huge net win for most people.
Herman
The editing overhead decreases over time as you learn the system's quirks. You learn which words it consistently gets wrong, and you either avoid them or correct them automatically in your head during proofreading. It becomes a kind of muscle memory.
Corn
To synthesize what we've covered: Daniel's findings are consistent with the literature. Microphone proximity matters but isn't everything. Background noise matters less than you'd think, especially with modern beamforming phones. Speaking rate is largely irrelevant within normal bounds. Whispering works surprisingly well. Background conversations in other languages are less problematic than same-language background speech. And the biggest accuracy gains going forward are going to come from contextual integration, not better hardware.
Herman
I'd add one more thing that we touched on but didn't fully unpack, which is that the training data distribution is the hidden variable behind all of this. When Daniel got better results on his phone than on his desktop mic, that's not because phone mics are inherently better. It's because the model was trained on phone-quality audio. When he got good results with his clipped speaking style, it's because that style happens to match patterns in the training data. The model is a mirror of its training distribution, and understanding that distribution is the key to understanding what will and won't work.
Corn
The -advice is: be average. Speak the way people speak in the training data. Use the kind of microphone people used in the training data. Don't try to be too clever.
Herman
Be boringly normal, and the model will love you.
Corn
Which is maybe the most counterintuitive advice in a field where everyone is trying to optimize everything.
Herman
The optimization instinct is natural, but it often leads people to over-engineer their setup in ways that actually hurt. I've seen people build elaborate soundproof booths for dictation, and then they wonder why their accuracy dropped. It dropped because the model had never heard audio that clean before. It didn't know what to do with the silence between words.
Corn
The sound of silence as an out-of-distribution problem. That's going to be my new favorite example of how weird modern AI is.
Herman
It's a perfect illustration. The model learned that silence has a certain acoustic texture — the hum of a computer fan, distant traffic, room tone. When you remove all of that, the silence becomes unnatural, and the model's voice activity detection gets confused. It starts hallucinating words in the gaps.
Corn
The ideal recording environment isn't an anechoic chamber. It's a slightly noisy living room.
Herman
Which is, conveniently, where most people already are.
Corn
All right, I think we've covered the factors pretty thoroughly. Before we wrap, let me ask you one last thing. If you had to give someone who's just starting with dictation exactly three pieces of advice, based on everything we've discussed and the research, what would they be?
Herman
First, use a modern end-to-end system — Whisper or a good commercial equivalent. Don't use the built-in dictation on a ten-year-old device. Second, use a headset or your phone's built-in mic — either will work fine, and the headset is more consistent across environments. Third, speak naturally and proofread carefully. The errors you miss are the ones that matter, and the most dangerous errors are the ones that look correct.
Corn
That's solid. I'd maybe add a fourth: don't chase the last two percent. You can spend a thousand dollars and a hundred hours optimizing your setup for a two percent improvement, or you can just proofread for an extra thirty seconds per email.
Herman
The economics of perfectionism are brutal in dictation. The Pareto principle applies — eighty percent of the accuracy comes from twenty percent of the optimization effort. Get the basics right and move on with your life.
Corn
Now: Hilbert's daily fun fact.

Hilbert: In nineteen hundred, the British colonial administrator in Papua New Guinea attempted to standardize local trade by defining one pig as equal to three hundred coconuts, but the conversion collapsed because the islanders correctly identified that pig size varied while coconuts were roughly constant, making the exchange rate a one-way bet against anyone holding pigs.
Corn
That's a surprisingly sophisticated critique of fixed exchange rates from people who were supposedly being introduced to standardized trade.
Herman
The Papua New Guineans invented the short squeeze before Wall Street did.
Corn
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. You can find us at myweirdprompts.com or wherever you get your podcasts. If you enjoyed this episode, leave us a review — it helps.
Herman
Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.