#llm-as-a-judge
6 episodes
#2640: Why Instructional Models Beat Conversational for Batch AI
Beyond cheaper tokens—how batch inference changes AI workflows and why instructional models beat conversational ones for automated jobs.
#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals
Why most benchmark claims in AI are statistically indefensible — and what to do about it.
#2007: AI Grading AI: The Snake Eating Its Tail
We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.
#2006: How Do You Measure an LLM's "Soul"?
Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."
#2005: Beyond Vibes: The Hard Science of LLM Evaluation
Running the same LLM on different GPUs can produce different results. Here’s why that happens and how to test for it.
#81: When AI Judges Can't Tell Humans from Bots
Can a robot tell if you’re human? Herman and Corn explore the "Reverse Turing Test" and why being "messy" might be our best defense.