#llm-as-a-judge
6 episodes
#2640: Batch Inference Use Cases and Instructional AI in 2026
Beyond cheaper tokens—how batch inference changes AI workflows and why instructional models beat conversational ones for automated jobs.
#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals
Why most benchmark claims in AI are statistically indefensible — and what to do about it.
#2007: AI Grading AI: The Snake Eating Its Tail
We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.
#2006: How Do You Measure an LLM's "Soul"?
Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."
#2005: Why Your GPU Changes LLM Output
Running the same LLM on different GPUs can produce different results. Here’s why that happens and how to test for it.
#81: The Reverse Turing Test: Can AI Spot Its Own Kind?
Can a robot tell if you’re human? Herman and Corn explore the "Reverse Turing Test" and why being "messy" might be our best defense.