← All Tags

#hallucinations

17 episodes

#2213: Grading the News: Benchmarking RAG Search Tools

How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.

ragbenchmarkshallucinations

#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

large-language-modelsai-safetyhallucinations

#2186: The AI Persona Fidelity Challenge

Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...

ai-safetyai-alignmenthallucinations

#2129: Building the Anti-Hallucination Stack

Stop hoping your AI doesn't lie. We explore the shift to deterministic guardrails, specialized judge models, and the tools making agents reliable.

ai-agentshallucinationsrag

#2046: AI Hallucinations Are Just How Brains Work

We asked an AI to curate films about AI and reality, exploring the psychedelic overlap between machine hallucinations and human perception.

hallucinationsgenerative-aiai-ethics

#2007: AI Grading AI: The Snake Eating Its Tail

We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.

llm-as-a-judgehallucinationsai-ethics

#1959: How Constrained AI Models Handle the Unexpected

Your AI assistant promised to only use your documents. Instead, it invented a case law that doesn't exist. Here's why.

ai-agentsraghallucinations

#1932: How Do You QA a Probabilistic System?

LLMs break traditional testing. Here’s the 3-pillar toolkit teams use to catch hallucinations and garbage outputs at scale.

ai-agentsai-safetyhallucinations

#1914: Google Invented RAG's Secret Sauce

Before LLMs, Google solved the "hallucination" problem with a two-stage trick that's making a huge comeback.

raghallucinationsre-ranking

#1762: Testing AI Truthfulness: Beyond Vibes

Stop trusting confident AI. We explore the formal science of testing LLMs for hallucinations and knowledge cutoffs.

ai-safetyhallucinationsprompt-engineering

#1735: The Agentic Stone Age: A Retrospective

We revisit the chaotic rise of BabyAGI and AutoGPT, exploring why their promise of total autonomy led to spectacular failure.

ai-agentshallucinationsagentic-workflows

#1636: Agent Interview: Grok four point one Fast

Can Elon Musk’s newest AI model handle a time-traveling toaster, or is it just a glorified search bar with an attitude?

ai-agentsprompt-engineeringhallucinations

#1579: Weird AI Experiment: The Compliment Battle

What happens when two top-tier AI models are forced to out-compliment each other? Witness a chaotic, heartwarming battle of cosmic proportions.

prompt-engineeringconversational-aihallucinations

#1568: Is Your AI Listening or Just Lip-Reading?

Is Gemini a brilliant audio engineer or just a talented lip-reader? Explore the "signal vs. symbol" gap in AI audio processing.

multimodal-aiaudio-processinghallucinations

#136: The Ghost in the Machine: Why AI Voices Hallucinate

Why does your AI suddenly start shouting or whispering like Darth Vader? Herman and Corn dive into the glitchy world of TTS hallucinations.

text-to-speechhallucinationsautoregressive-modelsaudio-glitcheslatent-space

#116: The Science of Lazy Prompting: Why AI Still Gets You

Ever wonder why AI understands your messy typos? Explore how models "denoise" chaotic input through tokenization and semantic context.

prompt-engineeringlarge-language-modelshallucinations

#83: Echoes in the Machine: When AI Talks to Itself

What happens when two AIs talk forever with no human input? Herman and Corn explore the weird world of digital feedback loops.

model-collapsesemantic-bleachingai-conversationsdigital-feedback-loopsai-safety