You're in a car, windows up, AC running, and you say "Hey, start navigation.You say it again. The third time you practically shout it, and the assistant finally wakes up. That's not a speech recognition problem. That's a VAD problem. The system never even knew you were talking.
That's the part that almost never gets blamed, which is what makes it so frustrating. Voice activity detection is the gatekeeper. Before any transcription, before any language model, before any response generation, something has to decide: is there a human speaking right now, or is that just road noise? Get that wrong and the whole pipeline either mishears you constantly or ignores you entirely.
Daniel sent us this one, and it's a deliberate shift in focus. We've done the mechanics of VAD before, how it works under the hood. Today is the map, not the engine. He wants to know who the players are, what the actual differences are between Silero, WebRTC, Picovoice Cobra, Pyannote, the VADs baked into Whisper wrappers, and the proprietary stuff inside Deepgram, AssemblyAI, Google, and the OpenAI Realtime API. And then there's a second thread: why is VAD, unlike almost every other audio AI task, genuinely happy running on a CPU? What makes these architectures so lean?
By the way, today's script is being written by Claude Sonnet four point six, which feels appropriate for an episode about audio pipelines and inference efficiency.
It does feel on-brand. Anyway, the thing that strikes me about that car scenario is how invisible the failure is. You don't think "my voice activity detector misfired." You think "this thing is useless." VAD is load-bearing infrastructure that nobody sees until it collapses.
The failure modes are asymmetric, which is underappreciated. A false negative, where the system misses your speech, is annoying. A false positive, where it treats background noise as speech and starts transcribing your air conditioning, is potentially catastrophic in a production system. You're burning compute, you're feeding garbage into your ASR model, and in a real-time voice assistant, you're interrupting the user mid-sentence because you thought a truck passing outside was the end of their turn.
The endpointing piece is its own whole disaster category. Because knowing when someone stopped talking is not the same as knowing whether they were talking in the first place.
Completely different problem, and a lot of these systems conflate them or handle them with the same underlying model, which is part of why this field is still unsettled. People assume VAD is a solved problem. It is not. The 2025 Silero benchmark put their model at ninety-five percent accuracy in noisy environments, which sounds impressive until you think about what the other five percent means at scale across millions of voice interactions daily.
WebRTC VAD, which is from 2011, is still the baseline that everything gets compared to. That's not a sign of a solved problem. That's a sign of a field where the defaults are sticky and the improvements are hard to communicate to people who aren't deep in audio engineering.
Which is exactly why this episode exists.
What actually makes it an active research problem rather than just a legacy tooling problem? Because there's a difference. Some fields have old defaults because nobody bothered to replace them. VAD feels like it has old defaults because replacing them is hard.
The legacy stickiness is real, but the hard problems underneath are also real. The core challenge is that speech is not a clean signal. It never was. You've got coarticulation, where sounds blur into each other at the edges of words. You've got prosodic pauses mid-sentence that look like silence to a naive detector. You've got environments that change dynamically within a single utterance. A model that's calibrated for a quiet office falls apart in a kitchen. And the definition of what counts as speech is itself contested. Is a cough speech? A filled pause? A breath before a sentence?
VAD's job description sounds simple until you actually write it down.
Then there's the pipeline role, which is what makes errors so expensive. VAD isn't just a binary classifier sitting at the front of a system. It's setting the tempo for everything downstream. It's deciding what chunks get sent to the ASR model, when the system should be listening versus saving compute, and in real-time applications, it's essentially controlling turn-taking behavior. That's a lot of responsibility for what people dismiss as a preprocessing step.
Which is why the field keeps producing new approaches rather than just incrementally tuning the 2011 baseline. There's genuine pressure from applications that didn't exist when WebRTC VAD was designed. Streaming voice assistants, real-time meeting transcription, low-latency telephony on edge hardware. The use cases have outpaced the old assumptions.
The deployment environment shift is the biggest driver. When VAD was a server-side problem, you could afford more compute. Now people want it on a phone, on a Raspberry Pi, in a hearing aid. The constraints are completely different, and that's pulling the research in directions that wouldn't have been obvious a decade ago. Take WebRTC VAD, for example—it’s a perfect case study in how those constraints shape the solutions.
So let's walk through the players, because the differences are real and they matter. WebRTC VAD is a Gaussian mixture model running on handcrafted features, mostly energy and spectral characteristics. Released by Google in 2011 as part of the WebRTC project. It has three aggressiveness modes, zero through three, where zero is the most permissive and three filters most aggressively. In a quiet room, mode one or two is basically fine. You get low latency, very low CPU overhead, and it just works.
The moment you take it outside, though, you start to see the seams. Road noise, wind, cafes, any environment with broadband low-frequency noise, and the spectral features it relies on start to look ambiguous. Mode three, the aggressive setting, will suppress a lot of that background, but it also starts clipping the beginnings and ends of words. You lose onsets. And for endpointing especially, that's a real problem because you're now systematically cutting off the last syllable of sentences.
Which is the tradeoff you can't engineer away with that architecture. It's doing arithmetic on features that were designed in a world of relatively controlled audio.
Silero VAD is the interesting contrast. It's a small recurrent neural network, trained on a large and diverse dataset, and the approach is fundamentally different. Instead of handcrafted spectral features, it's learning representations directly from the raw audio, or very close to it. The 2025 benchmark puts it at ninety-five percent accuracy in noisy environments, and that number holds up in independent testing better than most. What's notable is how it handles overlapping speech. It doesn't do speaker diarization, it's not trying to separate two voices, but it doesn't catastrophically fail the way WebRTC does when two people talk simultaneously. It tends to stay latched on to speech rather than flickering.
That flickering behavior in WebRTC is actually one of the more annoying practical problems. You get these rapid on-off transitions when there's background noise, and then your endpointing logic is trying to make sense of a signal that looks like morse code.
Silero's chunk-based processing, it works on thirty millisecond windows, helps with that. The RNN carries state across chunks, so it has a short-term memory of what it just heard. That continuity smooths out a lot of the flicker.
Picovoice Cobra is a different category of thing, though. It's not open source, it's licensed, and it's explicitly designed for edge deployment. What's their pitch?
The pitch is deterministic low-latency performance on constrained hardware with a calibrated confidence score rather than a binary decision. Most VADs give you speech or not speech. Cobra gives you a probability between zero and one on every frame, which lets the application layer set its own threshold. That's actually a meaningful design choice because different applications have different cost functions. A voice assistant where false positives are catastrophic wants a high threshold. A transcription tool where you'd rather over-include can set it lower.
They've demonstrated it running on a Raspberry Pi without breaking a sweat, which is the edge device benchmark that gets cited a lot.
Sub-millisecond latency per frame on a Pi 4, which is impressive. The tradeoff is the licensing model and the fact that you're in their ecosystem. You can't inspect the model, you can't fine-tune it on your own data.
Pyannote is a different beast entirely. It's not really trying to be a low-latency VAD in the same sense. It's a segmentation model.
Right, the framing is different. Pyannote Segmentation three point zero is doing something closer to full audio scene understanding. The diarization error rate on structured audio is around sixteen point six percent, which sounds worse than the VAD accuracy numbers until you realize it's solving a harder problem. It's not just detecting speech, it's labeling who is speaking and when, with a ten-second sliding window. The weights are about six megabytes, which is tiny for what it does. But the offline orientation means it's not built for real-time streaming the way Silero or Cobra are.
The window size alone tells you something. Ten seconds of latency is fine for transcribing a recorded meeting. It is not fine for a voice assistant where someone is waiting for a response.
That's where the Whisper-adjacent VADs come in, because they're occupying a middle space. Faster-whisper ships with a VAD filter that's essentially a bundled Silero instance. You enable it and it pre-segments the audio before feeding it to the ASR model. The benefit is you're not running Whisper on silence, which is where a lot of the wasted compute goes. The four to six times CPU speedup they claim over standard Whisper is partly the model quantization and partly the VAD filtering eliminating dead frames.
WhisperX takes a similar approach but leans harder into the preprocessing, right? It's doing word-level alignment on top of the VAD segmentation.
And the inference speedup numbers vary a lot depending on the audio content. Dense speech gets less benefit than audio with a lot of silence. Some of the third-party tooling around it claims thirty-five to over a hundred percent faster inference, but that range is wide enough that it's really a function of your specific audio characteristics.
Then there's the proprietary layer, which is where things get harder to evaluate because you're mostly reasoning from benchmarks rather than architecture details.
Deepgram Nova-3 and AssemblyAI Universal-2 both have VAD baked in as part of their streaming ASR pipeline. Sub three hundred millisecond latency, good noise robustness, and they've both clearly invested in domain-specific vocabulary handling which indirectly helps VAD because if your language model priors are better, your confidence on speech boundaries improves. The thing you can't easily do is decouple their VAD from their ASR to evaluate it independently.
The OpenAI Realtime API is the interesting outlier here because the public benchmark data is less flattering. The FLEURS multilingual benchmark puts it behind ElevenLabs Scribe and several others on word error rate, and there's reason to think some of that is VAD-level issues rather than pure transcription quality.
It's hard to attribute precisely, which is part of the problem with evaluating proprietary systems. You see the output quality, you don't see where in the pipeline the error originated. And Google's VAD, which is embedded in their Speech-to-Text and Chirp models, is similarly opaque. Strong performance on clean audio, less clear on how gracefully it degrades in difficult acoustic conditions.
The pattern across all of these is that the open-source tools give you transparency and control at the cost of integration work, and the proprietary ones give you convenience and often strong baseline performance—but you're flying blind on what's actually happening inside. That tradeoff applies to hardware too, especially when we talk about CPUs.
And I’d argue the CPU question is the most underappreciated part of this whole landscape.
It does seem like the obvious follow-up. You've got these systems ranging from a fifteen-year-old Gaussian mixture model to small transformers to RNNs, and almost all of them run fine on CPU hardware. That's unusual for audio AI. Most of what we talk about in this space needs a GPU or it's not viable.
The reason is architectural, and it comes down to what these models are actually doing computationally. A full ASR model like Whisper is doing sequence-to-sequence learning across long context windows. You're running attention mechanisms over hundreds or thousands of tokens, and the matrix multiplications stack up fast. That's where GPUs win, because they're built for massively parallel floating point operations on large tensors. VAD is not doing that. A small RNN like Silero is processing thirty millisecond chunks sequentially, carrying a hidden state forward. The computation per chunk is tiny. There's no large matrix multiplication bottleneck.
The chunk size is doing a lot of work there. Thirty milliseconds of audio at sixteen kilohertz is four hundred and eighty samples. That's the entire input.
Four hundred and eighty numbers. And then you're running a relatively shallow network over them. The hidden state in Silero's RNN is small enough that the entire model fits comfortably in L2 cache on a modern CPU core. When your working set fits in cache, memory bandwidth stops being the bottleneck, and CPUs become competitive.
WebRTC is even more extreme. There's no neural network at all. It's digital signal processing, energy computation, a Gaussian mixture model with fixed parameters. The compute per frame is almost negligible.
Which is why it still has a place in environments where you need essentially zero overhead. An always-on microphone on a battery-powered device, something like a hearing aid or a wearable sensor, WebRTC VAD is burning so little power that it barely registers in your energy budget. Compare that to running even a quantized Whisper model for continuous ASR, and you're looking at orders of magnitude difference in energy consumption.
The Picovoice Cobra numbers on a Raspberry Pi are a good concrete anchor for this. Sub-millisecond per frame on a Pi 4 is not a marketing claim, it's a consequence of the architecture. A Pi 4 has a one-and-a-half gigahertz ARM Cortex-A72 with no GPU to speak of. The fact that Cobra runs in real time on that hardware tells you something structural about what VAD requires computationally.
The Pi is a useful benchmark because it's roughly analogous to a lot of real edge deployment targets. Smart speakers, industrial sensors, embedded systems in vehicles. These devices have some CPU headroom but no dedicated neural processing unit in many cases. VAD being CPU-native means it drops into those environments without requiring hardware changes.
There's also a latency argument that's separate from raw compute. GPU inference has overhead from memory transfers and kernel launch times. For a model this small, that overhead can actually dominate the inference time. You'd spend more time moving data to the GPU than doing the computation.
That's a real effect. Silero's benchmarks show that on small chunk sizes, CPU inference is actually faster than GPU inference because you're not paying the transfer penalty. The crossover point where GPU wins only happens when you batch a lot of audio together, which defeats the purpose of low-latency real-time detection.
The architecture isn't just CPU-friendly by accident. It's CPU-optimal for the specific task structure.
The DSP hybrid approaches take this even further. Some of the embedded VADs use a pipeline where you do energy thresholding in pure DSP first, and only if that passes do you invoke the neural component. It's a cascade. You're spending neural compute only on audio that has already cleared a cheap heuristic filter. In a typical environment, that might mean the neural model runs on maybe twenty percent of frames.
Which is a elegant design. You're not trying to make the neural model fast enough for everything. You're making it so the neural model rarely needs to run.
This is where the implications for edge AI broadly get interesting, because VAD is sort of a proof of concept for a design philosophy that doesn't get applied as often as it should. Task-specific architectures that are sized for their actual computational requirements rather than general-purpose models that happen to be fine-tuned. The field keeps reaching for transformer-based everything, and sometimes a small RNN with careful feature engineering is just the right tool.
Herman's been saying this about inference workloads for a while now.
I have, and VAD is one of the cleaner examples of it working out. The accuracy numbers for Silero, ninety-five percent in noise, are competitive with or better than approaches that use far more compute. You don't always need more parameters. You need the right inductive biases for the problem.
The practical upshot for someone deploying this is that VAD is one of the few components in a voice pipeline where you don't need to provision GPU infrastructure. You can run it on the same CPU that's handling your application logic, and in many cases that's exactly what you should do.
The caveat being that if you're already running GPU inference for your ASR model, you might as well use one of the GPU-optimized VAD options and keep everything on the same device to avoid data movement. The architecture decision follows from your broader infrastructure, not just the VAD requirements in isolation.
Which brings us to the practical question: how do you choose between these options? The landscape we've mapped out has a lot of possibilities, and the right answer isn't the same for every use case.
And the first question to ask yourself is whether you're in a streaming context or a batch context, because that cuts the decision tree roughly in half. Batch processing, recorded audio, offline transcription, Pyannote is suddenly much more viable. You can absorb that ten-second window latency and you get useful segmentation and diarization on top of VAD. But if someone is talking to your application right now and waiting for a response, that window closes immediately.
Within the streaming side, the next cut is whether you're on constrained hardware. If you're deploying to something Raspberry Pi-class or below, Cobra is the most proven option at that end of the spectrum, with the caveat that you're accepting a licensing cost. If you want to stay open source on constrained hardware, Silero is the call. The ninety-five percent noise accuracy holds up and the CPU footprint is small enough to share a core with other processes.
For most developers building a voice feature into a web or server application, I'd actually push back against jumping straight to a proprietary API. The instinct is to reach for Deepgram or AssemblyAI because the integration is fast and the baseline performance is good. But you're coupling your VAD decision to your ASR vendor, and you lose the ability to tune or swap components independently. Starting with Silero as a standalone VAD that feeds into whatever ASR you want gives you a lot more flexibility.
The integration pattern that tends to work well is treating VAD as a preprocessing gate. You run it continuously on your audio stream, and only when it signals speech onset do you open the buffer to your ASR model. The framing is: VAD is cheap, ASR is expensive, so let the cheap thing protect the expensive thing.
One thing that catches people is endpointing configuration. Most VAD libraries expose a speech-end threshold, a silence duration after which the model declares the utterance finished. The default values are usually tuned for clean lab audio. In a real environment with hesitations and background noise, you often need to extend that threshold or you'll clip the end of sentences. It's worth spending an hour with real audio from your deployment environment just tuning that one parameter before you do anything else.
Which is a very Herman tip. Spend an hour with real audio.
It's where most integration problems live. People test on clean recordings and then wonder why their production system keeps cutting off users mid-sentence. The other thing worth doing early is logging VAD decisions alongside your ASR output. When your transcription quality drops, you want to know whether the VAD is the culprit before you start blaming the language model.
The open-source first principle here is solid. WebRTC VAD is worth understanding even if you end up not using it, because it gives you a baseline that costs you nothing and runs anywhere. Silero is probably the right default for most production use cases today. Then you escalate to proprietary options if you have a specific requirement those don't meet, rather than starting there and working backward.
The escalation path is usually noise robustness in a extreme environment, or you need the VAD and ASR deeply integrated for latency reasons you can't close any other way. Both are real cases. They're just not the common case.
Where does this all go? Because the trajectory feels like it's accelerating.
The honest answer is that transformer-based VADs are coming, and some are already here in limited deployment. The question is whether they displace the RNN and DSP approaches or just occupy a different tier. My instinct is the latter. The efficiency argument for small RNNs doesn't disappear just because transformers get better. If anything, the pressure on edge devices increases as more voice features move to hardware without cloud connectivity.
The overlapping speech problem is the one I keep coming back to. The diarization error rates we talked about, sixteen-point-six percent on structured audio for Pyannote, those numbers get worse in group conversations. And group conversations are where a lot of the interesting real-time voice applications are headed. Multi-party meetings, ambient AI assistants, collaborative tools. The VAD layer has to get significantly better at separating concurrent speakers before those use cases really work.
That's probably where the next meaningful architectural shift happens. Not just detecting speech versus silence but attributing speech to sources in real time, which is a much harder problem. There are research groups working on joint VAD and diarization models that do both in a single pass. Whether those stay CPU-friendly is an open question.
The implication for real-time voice applications broadly is that VAD quality becomes a ceiling. You can have the most accurate ASR model available and the most capable language model behind it, and if the VAD is clipping utterances or firing on noise, the whole experience degrades. The gatekeeper metaphor runs deep.
And I think the field is starting to take that seriously in a way it didn't five years ago. VAD used to be the thing you bolted on and forgot about. Now people are publishing dedicated benchmarks, the Silero team is putting out noise-specific accuracy numbers, the proprietary vendors are competing on endpointing quality. That's a sign of a maturing problem space, not a solved one.
Good place to land. Big thanks to Hilbert Flumingtop for producing this one. And Modal is keeping our GPU inference running smoothly so we can spend our time thinking about the stuff that actually runs on CPUs. This has been My Weird Prompts. If the episode was useful, a review on Spotify goes a long way.
Until next time.