#hallucinations

20 episodes

#2736: Why AI Flagged Your Em Dash

Punctuation isn't a fixed system handed down by grammarians. It's a two-thousand-year story of contraction, invention, and now AI suspicion.

ai-detectionhallucinationscultural-bias

Apr 25

#2412: When AI Caves: Progressive vs. Regressive Sycophancy

Why do LLMs agree with you even when you're wrong? We break down the SycEval benchmark and the 78% persistence problem.

ai-safetyai-alignmenthallucinations

Apr 25

#2404: What Tool-Calling Benchmarks Miss About Production Failures

BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.

ai-agentsbenchmarkshallucinations

Apr 14

#2213: When Ground Truth Moves Hourly

How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.

ragbenchmarkshallucinations

Apr 12

#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

large-language-modelsai-safetyhallucinations

Apr 12

#2186: The AI Persona Fidelity Challenge

Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...

ai-safetyai-alignmenthallucinations

Apr 9

#2129: Shifting Left on Hallucinations

Stop hoping your AI doesn't lie. We explore the shift to deterministic guardrails, specialized judge models, and the tools making agents reliable.

ai-agentshallucinationsrag

Apr 5

#2046: The Cinema of Constructed Reality

We asked an AI to curate films about AI and reality, exploring the psychedelic overlap between machine hallucinations and human perception.

hallucinationsgenerative-aiai-ethics

Apr 4

#2007: AI Grading AI: The Snake Eating Its Tail

We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.

llm-as-a-judgehallucinationsai-ethics

Apr 3

#1959: How Constrained AI Models Handle the Unexpected

Your AI assistant promised to only use your documents. Instead, it invented a case law that doesn't exist. Here's why.

ai-agentsraghallucinations

Apr 2

#1932: How Do You QA a Probabilistic System?

LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.

ai-agentsai-safetyhallucinations

Apr 2

#1914: Google Invented RAG's Secret Sauce

Before LLMs, Google solved the "hallucination" problem with a two-stage trick that's making a huge comeback.

raghallucinationsre-ranking

Mar 29

#1762: Testing AI Truthfulness: Beyond Vibes

Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.

ai-safetyhallucinationsprompt-engineering

Mar 29

#1735: The Agentic Stone Age: A Retrospective

We revisit the chaotic rise of BabyAGI and AutoGPT, exploring why their promise of total autonomy led to spectacular failure.

ai-agentshallucinationsagentic-workflows

Mar 28

#1636: The Mosh Pit Model: Can Chaos Train a Better Storyteller?

Can Elon Musk’s newest AI model handle a time-traveling toaster, or is it just a glorified search bar with an attitude?

ai-agentsprompt-engineeringhallucinations

Mar 26

#1579: When AI Flattery Breaks Reality

What happens when two top-tier AI models are forced to out-compliment each other? Witness a chaotic, heartwarming battle of cosmic proportions.

prompt-engineeringconversational-aihallucinations

Mar 26

#1568: The Signal Versus Symbol Gap

Is Gemini a brilliant audio engineer or just a talented lip-reader? Explore the "signal vs. symbol" gap in AI audio processing.

multimodal-aiaudio-processinghallucinations

Jan 2

#136: The Ghost in the Machine: Why AI Voices Hallucinate

Why does your AI suddenly start shouting or whispering like Darth Vader? Herman and Corn dive into the glitchy world of TTS hallucinations.

text-to-speechhallucinationsautoregressive-modelsaudio-glitcheslatent-space

Dec 28

#116: How AI Deciphers Your Typo-Ridden Prompts

Ever wonder why AI understands your messy typos? Explore how models "denoise" chaotic input through tokenization and semantic context.

prompt-engineeringlarge-language-modelshallucinations

Dec 23

#83: Echoes in the Machine: When AI Talks to Itself

What happens when two AIs talk forever with no human input? Herman and Corn explore the weird world of digital feedback loops.

model-collapsesemantic-bleachingai-conversationsdigital-feedback-loopsai-safety