#llm-as-a-judge

6 episodes

Beyond cheaper tokens—how batch inference changes AI workflows and why instructional models beat conversational ones for automated jobs.

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.

Traditional benchmarks can't measure tone or empathy. Here's how to evaluate if an AI model truly "gets it right."

Running the same LLM on different GPUs can produce different results. Here’s why that happens and how to test for it.

Can a robot tell if you’re human? Herman and Corn explore the "Reverse Turing Test" and why being "messy" might be our best defense.

Related Topics