#speech-recognition
42 episodes
#3854: From Coos to Conversation: Baby's Hidden On-Ramp
How do babies go from babbling to real back-and-forth dialogue? The hidden architecture of early conversation.
#3493: Murmuring Scriptures and Wandering Wilds: Ancient Meditation
How "hagah" (murmuring scripture) and "hitbodedut" (wilderness solitude) reveal meditation hidden in the Bible.
#3443: What Makes a Pediatrician's Diagnostic Skill Unique
How pediatricians diagnose without patient history, reading cries, body language, and parent-child dynamics.
#3363: Why the Teletubbies Sun-Baby Makes Infants Cry
The Teletubbies was engineered for pre-verbal brains. Here's why adult discomfort is a feature, not a bug.
#2801: Why Baby Babble Sounds Like Foreign Languages
Your baby isn't speaking Korean — but here's why the overlap isn't a coincidence.
#2754: Why Your Dictation Setup Might Be Wrong
Modern ASR is shockingly robust. The biggest predictor of accuracy? How well your audio matches its training data.
#2643: How Stenographers Type 300 Words Per Minute
Court reporters don’t type letters—they chord syllables at 300 words per minute. Here’s how it works and why AI can’t replace them yet.
#2618: Text Normalization's Hidden Complexity
How to handle acronyms in text-to-speech pipelines using BERT models, lexicons, and layered preprocessing.
#2590: The Uncanny Valley of Clean Speech
How transformer models distinguish "um" from meaningful speech — and why removing too much makes you sound like a robot.
#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server
getUserMedia returns audio, but not raw audio. Here's what browsers actually do to your mic feed before it hits your server.
#2563: How Audio Fingerprinting Actually Works
Spectrogram peaks, constellation maps, and hash matching — the elegant mechanics behind identifying any song in seconds.
#2543: Why Base64 Adds 33% Overhead (And Why You Still Need It)
Base64 isn’t compression — it’s a safe transport encoding. Here’s how it works with audio APIs and where its limits are.
#2510: The Design That Makes Voice Agents Tolerable
Drive-thru accuracy, healthcare triage, and the design secret that makes people *want* to talk to a machine.
#2486: Why Noise Reduction Can Ruin Transcription Accuracy
Cleaning audio before transcription can increase errors by up to 46%. Here's the right approach for your voice app.
#2479: The Screaming Baby Stress Test
Choosing the right headset and control method for dictation when you're holding a baby who won't stop screaming.
#2443: How Podcast RSS Feeds Can Speak Every Language
One RSS feed, a transcript tag, and TTS voice cloning — the emerging standard for letting any podcast speak any language.
#2337: When Diarization Fails Silently
Discover how PyAnnote and other tools tackle the critical task of identifying "who spoke when" in audio—and why it’s harder than it sounds.
#2311: Danish AI: Bridging the Localization Gap
How does AI handle Danish? Explore the challenges and progress in making AI tools work for small-language populations.
#2288: The Invisible Gatekeeper of Voice Tech
How voice activity detection shapes every step of the voice tech pipeline, and why it’s harder than it seems.
#2272: The AI Transcription Sweet Spot
Does higher-quality audio make AI transcription worse? New research reveals a surprising "sweet spot" for bitrate, challenging a core assumption of...
#2192: How We Built a Podcast Pipeline
Hilbert reveals the complete technical architecture behind 2,000+ episodes—from voice memos to GPU-powered TTS, with Claude models, LangGraph workf...
#2183: Making Voice Agents Feel Natural
Turn-taking, interruptions, and latency are destroying voice AI UX—and the fixes are deeply technical. Here's what's actually happening underneath.
#2027: The Missing Photoshop for Words
Why is editing text with AI so clunky? We explore the "TITO" paradigm—using small, local models for fast, private text transformation.
#1752: Whisper Small Beats Whisper Large in Speed & Accuracy
A 4GPU benchmark on Ubuntu shows the 1.5B parameter Whisper Large is slower and less accurate than the tiny Whisper Small.