#hallucinations
17 episodes
#2213: Grading the News: Benchmarking RAG Search Tools
How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.
#2190: Simulating Extreme Decisions With LLMs
LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.
#2186: The AI Persona Fidelity Challenge
Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...
#2129: Building the Anti-Hallucination Stack
Stop hoping your AI doesn't lie. We explore the shift to deterministic guardrails, specialized judge models, and the tools making agents reliable.
#2046: AI Hallucinations Are Just How Brains Work
We asked an AI to curate films about AI and reality, exploring the psychedelic overlap between machine hallucinations and human perception.
#2007: AI Grading AI: The Snake Eating Its Tail
We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.
#1959: How Constrained AI Models Handle the Unexpected
Your AI assistant promised to only use your documents. Instead, it invented a case law that doesn't exist. Here's why.
#1932: How Do You QA a Probabilistic System?
LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.
#1914: Google Invented RAG's Secret Sauce
Before LLMs, Google solved the "hallucination" problem with a two-stage trick that's making a huge comeback.
#1762: Testing AI Truthfulness: Beyond Vibes
Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.
#1735: The Agentic Stone Age: A Retrospective
We revisit the chaotic rise of BabyAGI and AutoGPT, exploring why their promise of total autonomy led to spectacular failure.
#1636: Agent Interview: Grok four point one Fast
Can Elon Musk’s newest AI model handle a time-traveling toaster, or is it just a glorified search bar with an attitude?
#1579: Weird AI Experiment: The Compliment Battle
What happens when two top-tier AI models are forced to out-compliment each other? Witness a chaotic, heartwarming battle of cosmic proportions.
#1568: Is Your AI Listening or Just Lip-Reading?
Is Gemini a brilliant audio engineer or just a talented lip-reader? Explore the "signal vs. symbol" gap in AI audio processing.
#136: The Ghost in the Machine: Why AI Voices Hallucinate
Why does your AI suddenly start shouting or whispering like Darth Vader? Herman and Corn dive into the glitchy world of TTS hallucinations.
#116: The Science of Lazy Prompting: Why AI Still Gets You
Ever wonder why AI understands your messy typos? Explore how models "denoise" chaotic input through tokenization and semantic context.
#83: Echoes in the Machine: When AI Talks to Itself
What happens when two AIs talk forever with no human input? Herman and Corn explore the weird world of digital feedback loops.