Audio & Speech

Speech recognition, TTS, voice cloning, audio engineering

42 episodes RSS Feed

The technology of voice and sound. From text-to-speech systems and voice cloning to speech recognition and audio engineering, this channel covers the cutting edge of how machines learn to speak, listen, and sound convincingly human.

#2781: When Voice AI Sounds Too Real

Voice AI platforms now let you simulate background noise, hesitation, and natural conversation — and that's a problem.

voice-cloningai-ethicsfinancial-fraud

#2754: Why Your Dictation Setup Might Be Wrong

Modern ASR is shockingly robust. The biggest predictor of accuracy? How well your audio matches its training data.

automatic-speech-recognitionspeech-recognitionaudio-processing

#2707: The Perfect Dictation Trigger: Foot Pedals vs USB Buttons

Foot pedals, USB buttons, and under-desk macro pads for voice dictation — a deep dive into the hardware that makes AI dictation work.

ergonomicsaudio-engineeringhardware-engineering

#2618: Fixing Acronyms in TTS Pipelines

How to handle acronyms in text-to-speech pipelines using BERT models, lexicons, and layered preprocessing.

text-to-speechspeech-recognitionaudio-processing

#2602: Mastering Spoken Word Audio with AI Agents

How to use AI for podcast mastering — and why agentic AI works better for small tasks than big promises.

audio-engineeringconversational-aiai-agents

#2591: Can You Swap Our Podcast Voices?

How dynamic voice replacement could let listeners choose who narrates each host's lines.

voice-cloningtext-to-speechaudio-processing

#2590: How Disfluency Detection Models Clean Up Speech

How transformer models distinguish "um" from meaningful speech — and why removing too much makes you sound like a robot.

speech-recognitionaudio-processingautomatic-speech-recognition

#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server

getUserMedia returns audio, but not raw audio. Here's what browsers actually do to your mic feed before it hits your server.

audio-processingspeech-recognitionbrowser-audio-pipeline

#2563: How Audio Fingerprinting Actually Works

Spectrogram peaks, constellation maps, and hash matching — the elegant mechanics behind identifying any song in seconds.

audio-processingsignal-processingspeech-recognition

#2543: Base64 for Audio: What Developers Need to Know

Base64 isn’t compression — it’s a safe transport encoding. Here’s how it works with audio APIs and where its limits are.

audio-engineeringspeech-recognitionapi-integration

#2512: How Speech-to-Speech Models Eliminate the Robot Voice

Why AI voice agents sound robotic, and how natively integrated speech-to-speech models fix it.

speech-to-speechaudio-processinglatency

#2510: Where Voice AI Actually Works (Not Cold Calls)

Drive-thru accuracy, healthcare triage, and the design secret that makes people *want* to talk to a machine.

voice-firstaccessibilityspeech-recognition

#2486: Why Noise Reduction Can Ruin Transcription Accuracy

Cleaning audio before transcription can increase errors by up to 46%. Here's the right approach for your voice app.

speech-recognitionaudio-processingautomatic-speech-recognition

#2479: Hands-Free Dictation with a Screaming Baby

Choosing the right headset and control method for dictation when you're holding a baby who won't stop screaming.

speech-recognitionvoice-firstdiy

#2443: How Podcast RSS Feeds Can Speak Every Language

One RSS feed, a transcript tag, and TTS voice cloning — the emerging standard for letting any podcast speak any language.

speech-recognitionvoice-cloningaudio-processing

#2337: How Speaker Diarization Powers Everything From Call Centers to Courts

Discover how PyAnnote and other tools tackle the critical task of identifying "who spoke when" in audio—and why it’s harder than it sounds.

audio-processingspeech-recognitionautomatic-speech-recognition

#2311: Danish AI: Bridging the Localization Gap

How does AI handle Danish? Explore the challenges and progress in making AI tools work for small-language populations.

speech-recognitiontext-to-speechlarge-language-models

#2288: The Invisible Gatekeeper of Voice Tech

How voice activity detection shapes every step of the voice tech pipeline, and why it’s harder than it seems.

speech-recognitionaudio-processingedge-computing

#2272: The AI Transcription Sweet Spot

Does higher-quality audio make AI transcription worse? New research reveals a surprising "sweet spot" for bitrate, challenging a core assumption of...

speech-recognitionaudio-processingai-training

#2183: Making Voice Agents Feel Natural

Turn-taking, interruptions, and latency are destroying voice AI UX—and the fixes are deeply technical. Here's what's actually happening underneath.

speech-recognitionconversational-ailatency