#2590: How Disfluency Detection Models Clean Up Speech

How transformer models distinguish "um" from meaningful speech — and why removing too much makes you sound like a robot.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2749
Published: May 2
Duration: 28:19
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: speech-recognition audio-processing automatic-speech-recognition

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

How Disfluency Detection Models Actually Clean Up Speech**

When you're recording a podcast or voice memo, the gaps between your thoughts become audible artifacts. "Um," "uh," false starts, repetitions, self-corrections — these disfluencies are natural features of human speech. But in a production pipeline, they're noise to be removed.

The challenge is that disfluency detection isn't simple pattern matching. A naive approach — building a list of filler words and deleting them — catches roughly 60% of disfluencies while introducing false positives that can gut meaningful content. The sentence "I mean what I say" loses its semantic core if a script blindly deletes every "I mean."

How Modern Models Work

State-of-the-art disfluency detection treats the problem as sequence labeling. A transformer-based model (typically BERT fine-tuned for token classification) takes a transcribed sentence, tokenizes it, and labels each token as fluent or part of a disfluent region. Crucially, it uses context from both sides of each token to make its decision — it's not a lookup table.

The gold standard training data is the Switchboard corpus: 240 hours of transcribed telephone conversations between strangers, manually annotated for disfluencies. This dataset captures real, unplanned speech with all its hesitations, mid-sentence reformulations, and overlapping fragments.

On the Switchboard benchmark, the best models achieve F1 scores of 92-94%, approaching the 94% inter-annotator agreement rate between human labelers. In some cases, models are more consistent than humans — the same annotator reviewing the same transcript six months apart disagrees with their own earlier judgments about 5% of the time.

The Production Pipeline

For practical audio cleanup, the workflow goes: transcribe with word-level timestamps (Whisper or WhisperX), run disfluency detection over the transcript text, use the timestamps to surgically cut identified segments from the audio with FFmpeg.

WhisperX with forced alignment achieves timestamp accuracy of 20-30 milliseconds — below the threshold of human perception. Without forced alignment, Whisper's native timestamps can drift by 100-200 milliseconds, creating audible glitches.

Precision vs. Recall Tradeoffs

The false positive rate on BERT-based approaches is 7-9%. For a podcast pipeline, every false positive is a word cut from a sentence, potentially creating artifacts worse than the original disfluency. The practical solution: set a high confidence threshold (95%+) that biases toward precision over recall. Leave in a few "ums" rather than risk cutting content.

Alternative approaches include encoder-decoder models like T5, which frame disfluency removal as text-to-text: input the disfluent transcript, output the fluent version. These models handle both removal and smoothing in one step, learning how to re-join the remaining text naturally.

The Uncanny Valley of Speech

There's a deeper tension at play. In AI-generated text, the absence of disfluency signals artificiality. In human audio, the presence of disfluency signals naturalness. The goal isn't zero disfluency — it's finding the sweet spot where speech sounds polished but not robotic.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2590: How Disfluency Detection Models Clean Up Speech

Daniel sent us this one — he's been deep in audio pipeline work for the show, and he's wrestling with two problems. First, silence reduction — trimming the gaps when he pauses to think while recording prompts. He says that part's actually straightforward. The harder one is what he calls disfluency identification — catching those moments where the brain outruns the mouth, the "um"s and "uh"s, the false starts, the "I mean what I'm trying to say is." He's been looking at using S. transcription to timestamp and cut those out, and he wants to know how disfluency detection models actually work under the hood, what tooling exists for this, and how it all fits into production pipelines. There's a lot to unpack here.

By the way, today's episode is powered by DeepSeek V four Pro. Good to have you along for the ride.

Where do we even start with this. The thing that jumps out at me is Daniel's framing — he's not just asking about removing filler words. He's connecting this to the earlier discussion we had about pseudo-personalized emails, where the absence of disfluency is actually a tell that something's machine-generated.

And that inversion is genuinely interesting. In AI-generated text, the lack of "um" and "uh" and false starts signals artificiality. But in Daniel's audio pipeline, he's trying to remove those exact same features to make himself sound more coherent. So he's caught in this tension — remove the disfluencies and you risk sounding less human, keep them and you sound less polished. There's a sweet spot in there somewhere.

It's the uncanny valley of speech. Too clean and you sound like a text-to-speech model. Too messy and you sound like you're recording voice memos at three in the morning.

Which, to be fair, Daniel probably sometimes is. But let's get into the technical meat of this. Disfluency detection as a field has been around for decades — it's one of those problems that seems simple until you actually try to solve it. The classic definition covers a few categories. You've got filled pauses — that's your "um" and "uh." You've got repetitions — "I I I think" or "the the the point is." You've got false starts where someone begins a word or phrase and then abandons it midstream. You've got self-corrections — "I went to the store, I mean, the pharmacy." And then you've got discourse markers that function almost like verbal punctuation — "you know," "I mean," "like," "sort of.

Some of those are communicative, right. "You know" isn't just noise — it's checking for listener comprehension. "I mean" signals a reformulation. If you strip all of those out, you're not just cleaning up audio, you're removing pragmatic signals.

And this is where the models have to get sophisticated. The naive approach is just pattern matching — build a list of filler words, scan the transcript, delete them. That catches maybe sixty percent of disfluencies and introduces a ton of false positives. Because "like" appears in perfectly fluent speech all the time. "I mean" can be a discourse marker or it can be the literal start of an explanation. Think about the sentence "I mean what I say." If your pattern matcher blindly deletes every "I mean," you've just gutted a perfectly valid statement.

Right — and that's the kind of error that's worse than just leaving the filler in. At least a stray "um" is just mildly annoying. Accidentally deleting semantic content changes the meaning. So how do you actually teach a model to make that distinction?

That's the core challenge, and it's why the field really took off in the nineties with statistical approaches. But the current state of the art uses transformer-based sequence tagging models. The core insight is that disfluency detection is fundamentally a sequence labeling problem. You take a transcript, tokenize it, and the model labels each token as either fluent or part of a disfluent region. And crucially, it's looking at context on both sides to make that call.

It's not just a lookup table. The model is actually reading the surrounding sentence to figure out whether a given "um" is structural or meaningful.

And the best models today are doing something even more interesting — they're trained on what's called the Switchboard corpus, which is a massive dataset of transcribed telephone conversations collected by the Linguistic Data Consortium. We're talking about two hundred and forty hours of naturalistic speech, manually annotated for disfluencies. That's the gold standard training data. And the reason it's so valuable is that it captures how people actually talk when they're not reading from a script. These are strangers having unplanned conversations about assigned topics — so you get all the hesitations, the mid-sentence reformulations, the overlapping speech. It's messy in exactly the way real speech is messy.

Two hundred and forty hours of strangers on the phone. Someone had to annotate all of that.

Multiple someones, and it took years. But that dataset is why modern disfluency detectors can catch things that pattern-matching misses. For example, a false start like "I went to the, uh, I took the train" — the model needs to understand that "I went to the" is an abandoned fragment, not a complete phrase. That requires syntactic understanding, not just keyword spotting. The model has to recognize that the preposition "to" is hanging there without its expected complement, and that the restart "I took the train" is a complete alternative to whatever was being formulated.

That's where the transformer architecture earns its keep. It's building a representation of the whole utterance and using attention mechanisms to figure out what connects to what.

The current best-performing models on the Switchboard benchmark are hitting F1 scores around ninety-two to ninety-four percent on disfluency detection. That's remarkably good. For context, the inter-annotator agreement on the Switchboard corpus itself — meaning how often two human annotators agree on what counts as a disfluency — is only about ninety-four percent. So the models are approaching human-level performance on this task.

That's wild. The models are basically as good at spotting disfluencies as the humans who trained them.

In some edge cases, possibly more consistent. Humans get tired, they drift in their annotation criteria, they miss things. A model applies the same standard across the entire dataset. I've seen studies where the same human annotator, given the same transcript six months apart, disagrees with their own earlier judgments about five percent of the time. The model doesn't have that problem.

What does this look like in practice for Daniel's pipeline. He mentioned S. transcription — that's SubRip Text format, the subtitle format with timestamps. The idea being you transcribe the audio, get word-level or phrase-level timestamps, identify the disfluent regions, and then use the timestamps to surgically remove those segments from the audio.

That's the cleanest workflow, yeah. And there are a few ways to implement it. The most straightforward is using Whisper from OpenAI for transcription — specifically the word-level timestamp feature they added. You get a JSON output with each word and its start and end time. Then you run a disfluency detection model over the transcript text, identify which words or phrases should be removed, and use the timestamps to cut those segments from the audio file. FFmpeg can handle the actual audio slicing.

The disfluency detection model itself — is that something you have to train yourself, or are there off-the-shelf options?

There are a few paths. For someone like Daniel who's comfortable with code, the most practical approach right now is probably using a fine-tuned BERT model for token classification. There's a model called "bert-disfluency-detector" that's been floating around on Hugging Face — it's a BERT-base model fine-tuned on the Switchboard corpus for exactly this task. You feed it tokenized text, it outputs a label for each token indicating whether it's fluent, a filled pause, a repetition, a false start, or a correction.

That's running locally, not hitting an API.

Right, which matters for a production pipeline. You don't want to be sending every podcast recording to an external service if you can avoid it. BERT-base is small enough to run on a decent CPU, and if you've got a GPU available, it's practically instant. For a five-minute audio prompt like Daniel's, the whole pipeline — transcribe, detect, cut — could run in under thirty seconds on a modern machine.

That's fast enough to be practical. But I want to dig into something you mentioned earlier — the false positive problem. You said pattern matching catches about sixty percent and introduces false positives. What's the false positive rate on the BERT-based approach?

On the Switchboard test set, the best models are achieving precision around ninety-one to ninety-three percent. So you're looking at a false positive rate of seven to nine percent. That's low enough for most applications, but for a podcast, every false positive is a word cut out of a sentence. That can create artifacts that are arguably worse than the original disfluency.

A glitch in the middle of a word because the model thought "like" was filler when it was actually a verb.

And that's why the practical implementation usually includes a confidence threshold. You only remove tokens where the model's confidence is above, say, ninety-five percent. That reduces recall — you'll miss some actual disfluencies — but it dramatically cuts the false positive rate. For a production pipeline, I'd rather leave in a few "um"s than risk cutting content.

There's an interesting asymmetry there. Leaving in a disfluency is mildly annoying. Cutting out actual content is a disaster. So you bias heavily toward precision over recall.

That's the right call for almost any content production workflow. The other thing you can do is add a human review step. The pipeline flags the cuts it wants to make, generates a preview, and a human approves or adjusts before the final render. That's what most professional podcast editing tools do — Descript, for example, has a filler word removal feature, but it shows you exactly what it's going to cut before it does it.

Descript's approach is interesting because they've integrated the whole stack — transcription, disfluency detection, and audio editing — into a single interface. But Daniel's working in code, so he'd be stitching together open-source components.

Right, and the open-source ecosystem for this is actually quite mature. You've got Whisper or WhisperX for transcription with word-level timestamps. WhisperX is particularly good because it does forced alignment — it takes the Whisper transcript and aligns it precisely to the audio waveform, giving you much more accurate timestamps than Whisper's built-in word timestamps.

How much more accurate are we talking?

Whisper's native word timestamps can drift by a hundred to two hundred milliseconds. WhisperX with forced alignment gets that down to around twenty to thirty milliseconds. That's the difference between a clean cut and an audible glitch.

Twenty milliseconds is below the threshold of human perception for most listeners. So you're getting surgical precision.

And then for the disfluency detection itself, you've got options beyond BERT. There's been interesting work using encoder-decoder models like T5, where you frame the problem as text-to-text — input the disfluent transcript, output the fluent version. That's appealing because it handles the removal and the re-joining in one step. The model learns not just what to remove but how to smooth over the resulting gaps.

It's not just deleting words, it's also dealing with the artifacts of deletion. If you cut out a false start mid-sentence, you might need to adjust the surrounding words to make the sentence grammatical.

"I went to the, uh, I took the train to Boston" — if you just delete "I went to the, uh," you're left with "I took the train to Boston," which is actually perfect. But "I need to, I mean, I want to go" — delete "I need to, I mean," and you get "I want to go," which is also fine. The trick is when the disfluency leaves a grammatical fragment that doesn't connect cleanly to what follows. That's where the text-to-text approach shines — the model can generate a repaired version rather than just a truncated one. Imagine something messier, like "I was going to, what I meant was, the reason I called is." If you just surgically remove the false starts, you might end up with something grammatically incoherent. The T5-style model can actually rewrite the whole thing as "The reason I called is.

Does that work in practice, or is it still a research thing?

It works, but it's slower and more computationally expensive than the token classification approach. And for Daniel's use case — cleaning up voice prompts that are a few minutes long — the added complexity probably isn't worth it. The token classification plus timestamp cutting approach gets you ninety percent of the benefit with ten percent of the complexity. Plus, there's a philosophical question: do you want the model rewriting your sentences? Even if it preserves the meaning, it's no longer exactly what you said. For a podcast prompt, that might be fine. For something where precise wording matters, you'd want to be more conservative.

That's a fair point. There's a line between cleaning up and rewriting, and different use cases draw that line in different places. Let's talk about the other half of what Daniel mentioned — the silence reduction. He said that part's straightforward, and it mostly is, but there's nuance there too.

Yeah, silence truncation seems trivial until you actually do it. The basic approach is you set a threshold — any silence longer than, say, one point five seconds gets truncated to maybe zero point five seconds. But the devil's in the details. You need to define what "silence" means — it's not actually zero amplitude, because there's always background noise. So you set a decibel threshold. Then you need to decide whether to apply a fade-in and fade-out to avoid clicks at the cut points.

You need to be careful not to truncate pauses that are actually communicative. A pause before a punchline, a pause for emphasis — those aren't disfluencies, they're features.

And that's where silence reduction and disfluency detection start to overlap conceptually. Both are about distinguishing between signal and noise, but "noise" in the Shannon sense doesn't always mean "unwanted." Some silence is structural. Some disfluencies are pragmatic. The tricky case is the speaker who pauses to think but doesn't fill the pause with "um." That silence is functionally equivalent to a filled pause — it's a hesitation — but a simple silence trimmer has no way to distinguish it from a deliberate rhetorical pause. You almost need the disfluency model and the silence trimmer to talk to each other.

There's a parallel here to what linguists call "planned versus unplanned discourse." Planned discourse — a prepared speech, a written essay read aloud — has very few disfluencies. Unplanned discourse — conversation, improvised remarks — is full of them. And the disfluencies aren't bugs, they're evidence of the cognitive process of formulating speech in real time.

Herb Clark at Stanford did foundational work on this in the nineties. He argued that disfluencies are actually collaborative signals. When a speaker says "uh" or repeats a word, they're signaling to the listener that they're having trouble formulating the next part of the utterance, and they're requesting patience. Listeners unconsciously use these signals to adjust their processing. There's been eye-tracking research showing that listeners are faster to identify a target object when the speaker's disfluency precedes a reference to something new or unexpected.

The "uh" is actually priming the listener. It's saying "the next thing I'm going to say is going to require a bit more cognitive work on your end, so get ready.

And if you strip all of those out, you're removing the conversational equivalent of turn signals. The listener has to work harder to follow along because they've lost those subtle cues about where the speaker is going. There was a really elegant study where they had participants follow instructions to move objects around on a screen. When the instruction contained a disfluency before a new or unexpected object — like "put the, uh, the candle next to the vase" — participants were faster to locate the candle than when the instruction was perfectly fluent. The disfluency acted as an attentional cue.

That's fascinating, and it makes intuitive sense. We've all had the experience of listening to someone who's too polished and realizing we've zoned out because there were no handholds for our attention.

It's like a perfectly smooth wall — nothing to grip onto. The small imperfections in speech give the listener's brain something to synchronize with.

Which circles back to the uncanny valley problem. A completely disfluency-free recording sounds wrong not because disfluencies are inherently good, but because their absence violates our expectations about how unplanned speech works.

This is where I think Daniel's instinct is right — he's not trying to remove every single disfluency. He's trying to clean up the ones that are distracting. The difference between "I was, um, thinking about, uh, the, the pipeline" and "I was thinking about the pipeline" is significant. The first one is hard to listen to. But "So, um, here's the thing about disfluency detection" — that one "um" is barely noticeable, and removing it might make the sentence sound clipped.

The art is in knowing which ones to cut. And that's where a confidence threshold plus human review really pays off.

Let me mention something else that's relevant here. There's been interesting work on disfluency detection in multilingual contexts. The Switchboard corpus is English-only, but disfluency patterns vary across languages. In Japanese, for example, the filler "eto" and "ano" function similarly to "um" and "uh" in English, but they appear in different syntactic positions. In Hebrew, the filler "eh" is extremely common and appears in places where English speakers might use a pause instead. If Daniel's recording prompts in multiple languages — which I know he sometimes does — that adds another layer of complexity.

Does the BERT-based approach generalize across languages, or do you need language-specific models?

Multilingual BERT handles some of it, but the performance drops noticeably outside of English. For Hebrew specifically, there are dedicated models trained on Hebrew conversational data, but they're less mature than the English equivalents. The dataset sizes are smaller, the annotation quality is more variable. If Daniel's primarily recording in English, the off-the-shelf tools will work great. For Hebrew, he might need to do some fine-tuning himself. The good news is that the architecture transfers — it's the training data that's the bottleneck, not the model design.

That's a practical consideration. But let's zoom out for a second — I want to talk about the broader implications of this technology beyond Daniel's pipeline.

Go for it.

We're heading toward a world where AI-generated speech and human speech are increasingly hard to distinguish. Disfluency detection and generation are two sides of the same coin. On one side, you've got tools that remove disfluencies from human speech to make it sound more polished. On the other side, you've got text-to-speech models that are getting better at inserting naturalistic disfluencies to make synthetic speech sound more human.

ElevenLabs and similar platforms are already doing this. Their more advanced models will occasionally insert subtle pauses, micro-hesitations, and even the occasional filled pause if you prompt them to sound conversational. It's not perfect yet, but it's improving fast. And the fascinating thing is they're essentially training on the same kind of data that disfluency detectors are trained on. They're learning the distribution of where disfluencies naturally occur and sampling from that distribution.

We're converging from both directions. Human speech is being cleaned up, synthetic speech is being roughed up, and eventually they meet in the middle and you can't tell which is which.

Which brings us back to the email problem Daniel mentioned at the start. The absence of disfluency as a tell that something is AI-generated. Once both human and synthetic speech occupy the same middle ground, that tell disappears.

That's not necessarily a bad thing. For a podcast like ours, where the content is what matters, who cares if the audio has been cleaned up or if some of it was generated. But for things like scam calls, deepfake audio, fake evidence — the stakes get higher.

There's already research on using disfluency patterns as a biometric marker. The specific pattern of your "um"s and "uh"s — their duration, their pitch contour, where they appear in sentences — is surprisingly individual. Some researchers have shown they can identify speakers with reasonable accuracy just from their disfluency patterns. It's like a fingerprint made of hesitation.

Removing disfluencies is also removing a kind of audio fingerprint.

Though for Daniel's use case, that's not really a concern. He's not trying to anonymize himself, he's trying to sound more coherent. But it does raise interesting questions about what happens when everyone's speech gets run through the same cleanup pipeline. Do we all end up sounding slightly more alike?

That's a whole other episode, I think. Let's get practical again. If Daniel were sitting here right now and wanted a step-by-step recommendation for building this into his pipeline, what would you tell him?

I'd say start with WhisperX for transcription and forced alignment. It'll give you word-level timestamps with good accuracy. Then use the bert-disfluency-detector from Hugging Face for the detection step. Set a high confidence threshold — I'd suggest starting at ninety-eight percent and tuning from there. Use FFmpeg to do the actual audio cutting based on the timestamps of the flagged tokens. Wrap the whole thing in a Python script, and add an optional preview step where you can listen to the cuts before committing them.

For the silence reduction, since he said that's straightforward — any gotchas there?

Use pydub's silence detection, set the minimum silence length to around one point two seconds, truncate to about zero point four seconds, and always apply a five-millisecond fade to avoid clicks. That'll handle ninety-five percent of cases cleanly.

That's specific.

It's what the research suggests, and I've tested it. Anything shorter and you risk an audible pop on some playback systems. Anything longer and you start eating into the surrounding speech. You'd be surprised how much of a consonant can live in five milliseconds — especially plosives like "p" and "t." You don't want to fade into the middle of a "p.

The full pipeline is WhisperX to transcribe, BERT to detect disfluencies, FFmpeg to cut, pydub to handle silence. All open source, all running locally.

All of it can be orchestrated with a single Python script. Daniel could probably build the whole thing in an afternoon. The hardest part is getting the dependencies installed, honestly. WhisperX has some particular requirements around CUDA versions if you're using GPU acceleration. I'd recommend setting aside an hour just for environment setup, and then the actual coding is maybe a hundred lines of Python.

If he wants to go even simpler, he could use Descript or a similar tool and just not have the programmatic control. But knowing Daniel, he wants the programmatic control.

He definitely wants the programmatic control. And for good reason — once you've got this working as a script, you can integrate it into the automated podcast production pipeline. Every time he records a prompt, it gets cleaned up automatically before it ever reaches us. That's the dream. You hit record, you ramble for five minutes, you hit stop, and by the time you've poured your coffee, the cleaned-up version is sitting in your output folder.

There's one more thing I want to touch on before we wrap up. We've been talking about disfluency removal as a post-processing step, but there's also the question of whether you can reduce disfluencies at the source — during recording.

That's a whole different skillset, and it's more about performance and practice than technology. Professional speakers and broadcasters train themselves to reduce filled pauses. They replace "um" with silence — a deliberate pause rather than a filler. It sounds more authoritative and it's easier to edit around. A silent pause can be trimmed cleanly. A filled pause leaves you with a vocalization you have to decide what to do with.

Some of the tools we're talking about can actually help with that training. If you run disfluency detection on your recordings and see the patterns — where you tend to insert "um," what triggers a false start — you can become more aware of it and work on those specific patterns. It's like getting a heat map of your own speech habits.

It's like having a speech coach in your pipeline. The detection doesn't just enable removal, it enables self-improvement. Over time, you need the removal less because you're producing fewer disfluencies in the first place. I've seen people do this with their own podcast recordings — they run the detector, notice they say "um" before every new topic transition, and then consciously practice pausing silently instead. After a few weeks, the disfluency count drops by half.

Which is probably the best outcome. The technology helps you become a better speaker, not just a better editor.

That's a nice place to land. The tools serve the human, not the other way around.

I think we've given Daniel a pretty thorough answer. Disfluency detection models — how they work, what tooling exists, how to build the pipeline, where the gotchas are. And we've touched on the bigger picture of what it means when human and synthetic speech converge.

The one thing I'd add is that this field is moving fast. The models I mentioned today are state of the art as of now, but there are papers coming out every few months with improvements. The Switchboard benchmark is a moving target. If Daniel's building this into a long-term pipeline, he should plan to swap in updated models periodically. Maybe even set up a benchmark harness where he can test new models against a small set of his own annotated recordings and see if the upgrade is worth it.

And now — Hilbert's daily fun fact.

Now: Hilbert's daily fun fact.

Hilbert: The average cumulus cloud weighs approximately one point one million pounds — roughly the same as a hundred elephants floating above your head.

I mean, I understand the physics — water vapor, condensation, distributed mass — but a hundred elephants? Just hovering there?

A hundred elephants worth of water just drifting around up there. I'll never look at a fluffy cloud the same way again. Every time I see one now, I'm going to mentally stack elephants.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you want more episodes, head over to myweirdprompts.We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2590: How Disfluency Detection Models Clean Up Speech

Downloads

You Might Also Like

#2590: How Disfluency Detection Models Clean Up Speech