#multimodal-ai
17 episodes
#1964: AI Glasses That See Through Your Eyes
See a 3D arrow pointing to the exact bolt you need, or read a street sign in real-time translation.
#1792: Google's Native Multimodal Embedding Kills the Fusion Layer
Google’s new embedding model maps text, images, audio, and video into a single vector space—cutting latency by 70%.
#1724: YouTube's Invisible AI Dubbing Machine
How does YouTube translate a video with one click? We explore the tech behind auto-dubbing, from sandwich models to voice cloning.
#1592: Mastering Embedding Models: From Gemini 2 to Vector Debt
Stop treating embedding models like plumbing. Learn how to navigate vector debt, multimodal retrieval, and database configuration for RAG.
#1586: Whiteboard Notebooks: Bridging the Pen and AI
Bridge the gap between handwritten notes and AI. Discover the best whiteboard notebooks and markers for seamless digital transcription.
#1568: Is Your AI Listening or Just Lip-Reading?
Is Gemini a brilliant audio engineer or just a talented lip-reader? Explore the "signal vs. symbol" gap in AI audio processing.
#1564: Why AI is Trading Transcripts for Raw Audio
Forget basic transcription. Explore how native omni-modal models are capturing the "soul" of speech with near-instant latency.
#1482: The Multimodal Shift: Navigating the New Vector Landscape
From Matryoshka models to multimodal search, discover how the fundamental units of AI memory are being optimized for efficiency and scale.
#1085: The Tokenization Lie: How AI Actually Processes Media
Think 1,000 tokens equals 750 words? For audio and video, that rule is a lie. Discover the hidden math behind multimodal AI.
#786: Mastering the Hoard: AI-Powered Inventory Management
Learn how to manage thousands of parts without losing your mind using AI, QR codes, and professional logistics strategies.
#769: The Living Manual: AI and AR for High-Tech Repairs
Discover how AI and spatial computing are turning complex hardware repairs into real-time, interactive experiences.
#749: Breaking the Fourth Wall: Moving to Real-Time AI Audio
Can AI podcasts move from polished scripts to raw, real-time conversation? Explore the technical and financial shift to live multimodal models.
#132: Can AI Map Your House Just by Looking Around?
Discover how spatial-temporal tokenization and 3D world modeling are revolutionizing real-time video-to-video AI interaction.
#64: AI's Senses: Seeing, Hearing, Understanding
AI is evolving beyond text, learning to see, hear, and understand our world. Discover the future of human-AI interaction!
#54: Tokenizing Everything: How Omnimodal AI Handles Any Input
Omnimodal AI: How do models process images, audio, video, and text all at once? Discover the engineering behind AI that accepts anything.
#53: Instructional vs. Conversational AI: The Distinction Nobody Talks About
Instructional vs. conversational AI: a crucial distinction reshaping how AI is built. Discover why it matters for the future of AI development.
#46: Pixels, Prompts & Pseudo-Text: AI's Word Problem
AI paints stunning images, but can't spell "cat." Why do advanced models struggle with simple text? Dive into AI's weird word problem!