So Daniel sent us this one, and it's a question that's been nagging at anyone who follows AI development seriously. He's asking about AI evaluations — the benchmarks used to measure large language model performance. How did they become the dominant way we talk about AI progress? What do they actually measure, and what do they miss? And maybe most importantly: when a lab announces their new model just crushed the competition on some benchmark, should we believe them? There's a real tension here between evaluation as a scientific tool and evaluation as a marketing instrument. Good place to dig in.
And it's one of the messiest corners of the field. I mean, I've read probably a dozen papers in the last few months specifically on benchmark contamination and evaluation methodology, and the more you read, the more you realize the whole ecosystem is kind of held together with tape.
Tape and press releases.
Ha. Not wrong. So let me just orient people on what we're actually talking about. The classic benchmarks — MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, HumanEval — these emerged over roughly a five-year window, most of them between 2019 and 2022. The idea was straightforward: you need a standardized way to compare models, so you build a test set, you run every model against it, you report accuracy. Clean, reproducible, comparable.
Which sounds totally reasonable until you think about it for more than thirty seconds.
Right, and the problems are layered. The first and most obvious one is contamination. MMLU, which stands for Massive Multitask Language Understanding, has around fourteen thousand questions across fifty-seven subject areas. It was released publicly. The training data for most frontier models is scraped from the internet. You do the math.
The test answers are on the internet.
The test answers, the questions, discussions of the questions, tutoring forums where people work through the exact problems. There was a paper — I want to say from the University of Washington and Allen AI, around 2023 — that found statistically significant evidence of benchmark contamination in several major models. The methodology was clever: they looked for whether models performed dramatically better on questions that appeared verbatim online versus paraphrased versions. And the gap was substantial.
So the model hasn't learned chemistry. It's learned that particular chemistry question.
Which is the core problem. And labs know this. Some of them now do internal contamination checks, but the incentive structure cuts the other way. If your model scores ninety-two on MMLU and your competitor scored eighty-nine, you're not rushing to publish a paper saying "actually we think some of that gap might be contamination."
By the way, today's episode is being generated by Claude Sonnet four point six. Just flagging that. Carry on.
Ha, noted. So contamination is problem one. Problem two is what I'd call construct validity — whether the benchmark actually measures what it claims to measure. Take HellaSwag, which tests commonsense reasoning by asking models to complete sentences. The task sounds natural but the wrong answers were generated adversarially, which means models can learn to detect artifacts of the generation process rather than actually reasoning about the scenario.
They're pattern-matching the wrongness of wrong answers.
Which is a very different skill than understanding context. And this came up explicitly in the original HellaSwag paper — the authors were actually documenting how hard it was to build a benchmark that models couldn't game, and the paper itself is a kind of arms-race diary. Every time they made the distractors harder, models found new shortcuts.
So the benchmark designers are in this perpetual cat-and-mouse with the models they're trying to evaluate.
And increasingly losing that race. GSM8K is a good example of the other end of the spectrum. It's eight thousand grade-school math problems, and when it launched it was hard for models — GPT-3 scored around thirty-five percent. Within two years, frontier models were above ninety percent. The benchmark saturated.
Which is actually a success story in one sense. Models got better at grade-school math.
It is! But now the benchmark tells you almost nothing about differentiating models at the frontier. You've got GPT-4, Claude, Gemini, all bunched above ninety, and the benchmark can't distinguish between them. So you need harder benchmarks. And then those saturate. And you keep escalating.
It's a treadmill. You're always running to stay in the same place.
MATH and MATH-500 were the next escalation — competition math, AMC and AIME level problems. Then FrontierMath, which involves research-level mathematics problems that were unsolved or at least not publicly worked through. The idea being that if the problems don't exist on the internet yet, you can't contaminate them.
That's a clever approach. Though presumably once you publish FrontierMath, it's on the internet, and you're back to square one for the next generation of models.
Which is exactly the arms race. And there's a deeper issue here that I think gets underappreciated, which is that these benchmarks test isolated capabilities. You give a model a multiple-choice question, it picks A, B, C, or D. That's not how anyone actually uses these models. The gap between "can answer a chemistry multiple-choice question" and "can help a chemist debug a synthesis procedure" is enormous, and MMLU doesn't bridge it.
So we've been measuring the wrong things, or at least incomplete things, this whole time.
Partially the wrong things. MMLU and ARC and HellaSwag were useful for catching major capability differences, especially in the earlier years. The problem is the field kept using them past their expiration date and started treating high scores as proof of general intelligence rather than proof of "good at this specific test."
Which is a distinction that matters enormously if you're a lab trying to deploy these models in the real world.
And it matters for users trying to make decisions. If you're choosing between models for, say, a customer service application, knowing that one scores ninety-three on MMLU and another scores ninety-one is almost useless information. What you actually want to know is: how do they perform on realistic customer queries, with tool use, with retrieval, with multi-turn dialogue?
Enter the newer generation of benchmarks.
LMSYS Chatbot Arena is probably the most interesting methodological departure. Instead of a fixed test set with right and wrong answers, it's a continuous crowdsourced evaluation where real users submit prompts, two models respond anonymously, and users pick which response they prefer. You accumulate enough of these pairwise comparisons and you get an Elo ranking. It's the same system chess uses.
Which is elegant because it's measuring something you actually care about — which model do people find more useful — rather than which model got more multiple-choice questions right.
And it's contamination-resistant by construction, because the prompts are live and diverse and coming from actual users. The problem is it has its own biases. Users tend to prefer longer, more confident-sounding answers. They prefer formatting. There are studies showing that adding bullet points and headers can improve Arena ratings even when the underlying content is identical.
So the benchmark adapts to measure something real, and then models learn to optimize for the superficial features of what humans say they prefer.
It's turtles all the way down. And the Elo system has another issue: the rating depends heavily on which other models you're being compared against. If the Arena population is mostly weak models, your Elo looks great. As stronger models enter, your rating can drop without your actual performance changing.
There's also a question of who the users are. If the Arena is dominated by English-speaking tech workers in their twenties, the preferences you're measuring are those preferences, not some universal notion of quality.
Which is a real limitation. There's been work on multilingual Arena variants, but the English-language bias in the core rankings is substantial. And it affects which models look good — models that are strong on English creative and technical tasks do well, models that are strong on, say, low-resource languages or domain-specific professional tasks get undercounted.
Okay, so we've got the classic benchmarks that are contaminated and saturated, and the crowdsourced approach that has its own systematic biases. What's the current frontier for people actually trying to do this rigorously?
A few different threads. One is agentic evaluation — instead of asking a model to answer a question, you give it a task that requires multiple steps, tool use, error correction. SWE-bench is the canonical example: you give the model a GitHub issue and a codebase, and it has to write a patch that makes the failing tests pass. That's a real software engineering task with an objective outcome.
And you can't contaminate it easily because the answer isn't a multiple-choice letter, it's working code that either passes the tests or doesn't.
Right, the evaluation is executable. Though contamination is still possible in a different form — if a specific repository's issues are in the training data, the model might have seen the solution. SWE-bench had some controversy around this when researchers found that certain repositories appeared in common training datasets.
Of course they did.
But the methodology is much more resistant than MMLU-style benchmarks. And it's measuring something closer to real capability. When SWE-bench launched, the best models were solving maybe four percent of problems. We're now seeing numbers above fifty percent on the verified subset for the leading models. That's a meaningful signal of genuine progress.
What about evaluations for the things that are harder to measure? Reasoning about ethics, handling sensitive topics, calibration — knowing what you don't know?
TruthfulQA was an early attempt at calibration — does the model say true things, and does it avoid confidently stating falsehoods? The results were humbling. GPT-3 was around fifty-eight percent truthful, which is worse than just saying "I don't know" to everything. But TruthfulQA has its own issues: the questions were human-curated and reflect particular assumptions about what counts as a misconception.
And there's something philosophically tricky about evaluating truthfulness by checking answers against a ground truth that the evaluators decided on.
The calibration problem is actually one I find interesting, because it's not just about whether a model gets the right answer, it's about whether its expressed confidence matches its actual accuracy. A well-calibrated model that says "I'm seventy percent confident" should be right about seventy percent of the time when it says that. Most models are overconfident, which is a serious problem for high-stakes use.
And measuring calibration requires you to get the model to express probabilities, which most chat interfaces don't surface.
It's a real deployment gap. The evaluations that researchers run and the conditions under which users actually interact with models are quite different. You can measure logit probabilities in a research setting; your average user is not reading logit probabilities.
Let's talk about the gaming problem more directly. Because there's a version of this where benchmark optimization is just... rational behavior for a lab. You know you'll be evaluated on these metrics. You train to improve those metrics. That's not deception, that's just responding to incentives.
And it's not always distinguishable from genuine improvement. If a model improves on GSM8K because it got better at math, great. If it improved because the training data was enriched with GSM8K-style problems, that's more ambiguous. If it improved because the test set leaked into training, that's a problem. But from the outside, the score is the same.
Which is why the field has been moving toward held-out evaluations. Some labs now have private test sets they don't release publicly, and they run models against those internally.
Epoch AI and Scale AI both have internal evaluation programs. Metatransparency is the term that's been floated — you don't reveal the specific questions, but you commit to a methodology and have external auditors verify that you're following it. It's not perfect but it's better than self-reported scores on public benchmarks.
Though "external auditor" in AI evaluation is a pretty thin concept right now. The evaluators are often funded by or closely connected to the labs they're evaluating.
That's a fair criticism. HELM, the Holistic Evaluation of Language Models from Stanford, was an attempt to build something more independent — they ran dozens of scenarios across multiple dimensions, not just accuracy but also toxicity, bias, efficiency, robustness. The idea was to make gaming harder because you'd have to optimize across a wide surface simultaneously.
How'd that work out?
HELM is more informative than a single number, but it's also harder to communicate. You end up with a radar chart of twenty metrics and nobody knows how to aggregate them into a purchasing decision. The field has this tension: rigorous evaluation is complex, and complex evaluation doesn't make good headlines.
"Model X achieves state-of-the-art on MMLU" is a much cleaner press release than "Model X has a nuanced profile of strengths and weaknesses across heterogeneous evaluation dimensions."
And the press release version is what gets cited in funding rounds and newspaper articles. So the incentive to produce a single clean number is enormous even when everyone in the field knows that single number is inadequate.
I want to push on something. We've been talking about this as if the problem is primarily technical — better benchmarks, held-out test sets, executable evaluations. But there's a deeper question, which is: what are we actually trying to evaluate for? And who decides?
That's the question that doesn't get asked enough. Most benchmark construction implicitly encodes a set of values. MMLU privileges academic knowledge. Chatbot Arena privileges the preferences of people who use Chatbot Arena. SWE-bench privileges the kind of software engineering that shows up on GitHub. None of these are neutral.
And for certain applications — medical diagnosis, legal analysis, education — the relevant evaluation criteria are quite different from general-purpose benchmarks. A model that aces MMLU might still give dangerous medical advice in ways that MMLU would never catch.
There's been real work on domain-specific evaluation. MedQA, which is based on United States Medical Licensing Examination questions, is one example. LegalBench is another. But these still have the multiple-choice limitation. Passing USMLE-style questions is not the same as being a good clinical reasoner. I know this from my pediatrics days — the exam tests a specific kind of knowledge recall, not the full cognitive process of sitting with an uncertain diagnosis.
And that gap is where the real risk lives. Not in benchmark scores but in the deployment contexts where the benchmark never anticipated.
Which is why the most sophisticated evaluation programs now include red-teaming — adversarial probing specifically designed to find where models break. Anthropic has published extensively on their Constitutional AI approach and their internal safety evaluations. OpenAI has similar programs. But these are internal, they're not standardized, and you can't compare across labs.
So we've got this landscape where the public benchmarks are compromised or saturated, the private benchmarks are not comparable, the crowdsourced evaluations have their own biases, and the domain-specific evaluations don't generalize. Is there a version of this that actually works?
I think the honest answer is: partially, and for specific purposes. If you want to track broad capability trends over time — are models in general getting more capable? — the benchmarks, taken in aggregate, are actually reasonably informative. The noise is real but the signal is there. A model that scores well on MMLU and HumanEval and GSM8K and Arena simultaneously is almost certainly more capable than one that scores poorly on all of them.
So they work as a portfolio but fail as individual tests.
For deployment decisions, the answer is increasingly: run your own evaluations on your own data. If you're building a legal document analysis tool, build a test set of legal documents with known correct outputs, and run every candidate model against that. The generic benchmarks are a starting point for shortlisting, not a final answer.
Which requires resources that most organizations don't have. Running serious evaluations is expensive and requires expertise.
It does, and that's a real access problem. Larger organizations can do this; smaller ones end up relying on the published benchmarks even knowing their limitations. And there's a startup ecosystem growing around this — companies building evaluation-as-a-service products, which is interesting but also creates its own alignment of incentives.
The evaluators need to be paid by someone.
Usually the people being evaluated, which is uncomfortable. Though some academic groups and nonprofits are trying to carve out independent evaluation space. The Allen Institute for AI has done some rigorous independent work. There's been discussion of government-funded evaluation infrastructure, particularly in the United States and European Union, though that's moved slowly.
Let me ask about the capability overhang problem. There's a version of this where benchmarks are actually concealing how capable models are, not just exaggerating it. You could imagine a model that has capabilities that no existing benchmark tests for.
This is real and it's probably underappreciated. The emergence of chain-of-thought reasoning, for example — the ability of models to solve problems better when prompted to think step by step — wasn't captured by standard benchmarks until people specifically designed evaluations to probe for it. The capability existed in the weights before anyone had the right eval to measure it.
So the map is always behind the territory.
Always. And for more exotic capabilities — long-horizon planning, genuine scientific hypothesis generation, the kind of reasoning that might constitute early research-level contributions — we don't really have good benchmarks yet. The benchmarks that exist were designed around capabilities that were already understood.
Which means the published capability rankings might be systematically misleading in both directions. Models look better than they are on things we've benchmarked to death, and we have no idea how they compare on things we haven't figured out how to test yet.
That's not a comfortable place for a field that's trying to make safety-relevant decisions based on capability assessments. If you're trying to determine whether a model has crossed some threshold of autonomous capability that requires additional safety measures, and your evaluation tools are built around known capabilities, you might miss the relevant threshold entirely.
This is where the evaluation problem stops being an academic question and becomes something with real stakes.
The frontier labs are aware of this. The model evaluation work that Anthropic has published around their responsible scaling policies — the idea that certain capability thresholds trigger additional safety requirements — requires them to have evaluations that can actually detect those thresholds. And they've been reasonably transparent about the difficulty of that problem. They don't claim to have solved it.
Okay, practical question. For a listener who is trying to make sense of model announcements and benchmark claims — what should they actually do with this information?
First, always look at what benchmarks are being cited. If a lab announces a new model and the headline number is on a benchmark that's three or four years old and well-known to be saturated, that's a yellow flag. They're choosing to highlight that benchmark for a reason.
Usually because it looks good.
Usually. Second, look for contamination disclosures. Some labs now include contamination analysis in their technical reports. If there's no contamination analysis at all, that's worth noting. Third, look at the breadth of evaluation. A model that scores well on one benchmark might be highly optimized for that specific benchmark. A model that performs consistently across a wide range of diverse evaluations is more credibly capable.
And fourth, I'd add: look at what's being evaluated versus what you actually need. If you're making a decision about a specific use case, the generic benchmarks are mostly irrelevant. You want to see evaluation on something resembling your actual task.
That's the most important one for practitioners. The MMLU score of a model you're deploying to help field customer support tickets is almost entirely uninformative. What you want is: how does this model perform on your customer support tickets?
There's also something to be said for skepticism about the evaluation ecosystem in general. Not nihilism — the benchmarks aren't meaningless — but recognizing that the people publishing the scores have strong incentives to publish scores that look good.
I'd put it this way: treat benchmark scores the way you'd treat a self-reported resume. Informative, worth reading, but verify independently before you rely on it for anything important.
That's probably the most useful frame for this whole conversation.
The deeper thing I keep coming back to is that we're trying to measure something we don't fully understand. What does it mean for a model to be "intelligent" or "capable"? We don't have a settled answer to that question. So our benchmarks are proxies for proxies, and the thing we actually care about keeps retreating as we try to pin it down.
Which is uncomfortable but also kind of the honest state of the field.
It is. And I think the researchers who are most thoughtful about this are the ones who hold their evaluation results loosely. The benchmark score is a data point, not a verdict.
Alright. Practically speaking, where does this leave us? If you had to say what the next meaningful development in evaluation methodology looks like, what would it be?
Honestly, I think it's a combination of things. More executable evaluations like SWE-bench, where the answer can be verified by running code or checking an objective outcome. More longitudinal evaluation — not just "can the model do this task" but "can the model do this task reliably across many attempts and contexts." And more evaluation of the things that actually matter for safety, which is a hard problem but one the field is starting to take seriously.
And probably some institutional infrastructure. Independent evaluation bodies, standardized methodologies, something that doesn't depend on labs grading their own homework.
That would be transformative if it happened. There are early signs — the National Institute of Standards and Technology in the United States has been working on AI evaluation frameworks, the European Union's AI Act has evaluation requirements baked in. Whether those become rigorous or end up as checkbox exercises remains to be seen.
My bet is checkbox exercises, but I'd be happy to be wrong.
You're probably right that the initial versions will be inadequate. But having the institutional expectation at all creates pressure over time. The financial auditing standards we have now look nothing like what existed a hundred years ago.
Fair point. Slow but real progress. Which is kind of the story of evaluation in general.
Yeah. It's a field that's hard important, and not solved. The honest answer to "can we trust benchmark scores" is: partially, with caveats, and you need to understand the caveats to use the information well.
Which is also the honest answer to most questions in this space, if we're being real about it.
Ha. Unfortunately yes.
Thanks to Hilbert Flumingtop for producing the show, and to Modal for keeping the infrastructure running — serverless GPU compute that makes the whole pipeline possible. Find all two thousand one hundred and sixty-four episodes at myweirdprompts.com. This has been My Weird Prompts. Leave us a review if you want more people to find the show.
See you next time.