#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2563
Published: Apr 25
Duration: 28:03
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: benchmarks interpretability llm-as-a-judge

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Problem With LLM Benchmarks**

When a major AI lab releases a new model, the blog post almost always includes a table showing benchmark scores. Model A: 72.3%. Model B: 70.8%. The implication is clear: the new model is better. But is that 1.5-point difference real, or is it just noise?

A review published last year found that 84% of LLM benchmark papers don't report any statistical significance testing at all. These are papers making claims about which model is better, and they're not even doing basic statistical checks. This episode breaks down why that's a problem and what proper evaluation looks like.

Why Two Points Is Probably Nothing

The fundamental issue is sampling noise. Benchmarks are samples from a larger population of possible questions. Even if two models are truly equal in capability, you'll rarely get identical scores on a finite test set. The question is how large a difference needs to be before you can confidently say it's not random fluctuation.

For a two-percentage-point difference with typical benchmark variance, you need somewhere between 1,500 and 3,000 test items to have an 80% chance of detecting it at standard significance levels. Many benchmark subsets are much smaller than that. A chemistry subset of MMLU might have only 150 questions — nowhere near enough to detect a two- or three-point difference with any confidence.

The Right Statistical Test

When evaluating two models on the same set of prompts, you have paired observations. For each question, you know whether model A got it right, whether model B got it right, and whether they agreed or disagreed. The correct test for this structure is McNemar's test, which focuses only on the discordant pairs — questions where one model got it right and the other got it wrong.

McNemar's test has been around since 1947. Any undergraduate statistics course covers it. Yet most model comparisons use a standard chi-square test, which treats the samples as independent and massively overstates the effective sample size, producing false positives.

Bootstrapped Confidence Intervals

Bootstrapping is one of the most practical tools available for quantifying uncertainty in benchmark evaluations. You take your observed data, repeatedly sample from it with replacement, and build up an empirical distribution of the accuracy estimate. The 2.5th and 97.5th percentiles give you a 95% confidence interval.

That interval might be plus or minus five points or more. If two models' confidence intervals overlap substantially, you cannot claim one is better than the other, regardless of what the point estimates say.

The Decimal Point Tell

When someone reports accuracy to a tenth of a percent without any measure of uncertainty, they're either statistically unsophisticated or hoping you are. Either way, the comparison shouldn't be trusted.

The Allen Institute for AI published a piece called "Signal and Noise" showing that for many common benchmarks, the difference between the top several models is smaller than the noise floor of the evaluation itself. A separate paper, "Measuring All the Noises of LLM Evals," cataloged how prompt formatting, sampling temperature, and even the order of multiple-choice options could swing scores by several percentage points.

Chatbot Arena and Elo Misunderstandings

Chatbot Arena uses a Bradley-Terry model to produce Elo-style ratings from user votes. The most fundamental misunderstanding about Elo ratings is that they're relative, not absolute. An Elo rating tells you how a model performs relative to other models in the Arena at a given time. It doesn't tell you anything about absolute capability. As the population of models changes over time, ratings shift in ways that make direct comparisons across time periods unreliable.

The bottom line: most public LLM benchmarking is statistically indefensible. The incentives reward overclaiming, and the tools for proper evaluation are simple and well-known. Until the field adopts basic statistical rigor, benchmark tables are best read as marketing material — not science.

Mentions

Allen Institute for AI Nonprofit AI research institute (AI2)
Anthropic AI safety company behind Claude models
Bradley-Terry model Statistical model for pairwise comparisons
Chatbot Arena Crowd-sourced platform for LLM comparisons
Elo ratings Relative skill rating system from chess
LMSYS Research organization behind Chatbot Arena
Measuring All the Noises of LLM Evals Paper cataloging variance in LLM evaluations
MMLU Massive multitask language understanding benchmark
OpenAI Leading AI research and deployment company
Signal and Noise AI2 piece on statistical noise in benchmarks

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Daniel sent us this one — he wants us to get into the weeds on statistical rigor in LLM evaluations. Why your supposedly meaningful benchmark win might be nothing more than noise. He's asking us to cover power analysis, the right statistical tests for paired evaluations, bootstrapped confidence intervals, why almost every two-point accuracy claim is within noise, and the math underneath Chatbot Arena rankings. Basically, he wants us to explain why most of what you read in model release blog posts about benchmark scores is statistically indefensible. And he wants us to be unflinching about it.

Oh, this is my kind of episode. And fun fact — DeepSeek V four Pro is writing our script today, so let's see if it can keep up with the statistics.

Bold of you to challenge the script-writing model on statistical rigor, Herman.

I'm not challenging it, I'm setting expectations. I've been reading model evaluation papers for years, and the state of public benchmarking is honestly embarrassing. There was a review published last year that found eighty-four percent of LLM benchmark papers don't report any statistical significance testing at all. These are papers making claims about which model is better, and they're not even doing basic statistical checks.

When OpenAI or Anthropic or Google puts out a blog post saying their new model beats the previous one by two points on MMLU, the first question should be — was that two-point difference actually detectable?

The answer, more often than not, is no. Let's start with the fundamentals. When you're comparing two models on a benchmark, you have a set of test questions, you run both models on those same questions, and you count how many each gets right. The question is whether any difference you observe is real or just sampling noise.

Sampling noise here means what? If I flip a fair coin a hundred times, I might get fifty-three heads. That doesn't mean the coin is biased.

That's exactly the right intuition. Benchmarks are samples from some larger population of possible questions. Even if two models are truly equal in capability, you'll rarely get identical scores on a finite test set. The question is how large a difference needs to be before you can confidently say it's not just random fluctuation.

Walk me through power analysis. Daniel specifically asked about how many samples you need to detect a two-point accuracy difference.

Suppose you have two models, and you suspect one is genuinely two percentage points better. How many test questions do you need to reliably detect that? The answer depends on your desired statistical power, your significance threshold, and the variance in the data. For a two-point difference with typical benchmark variance, you need somewhere between fifteen hundred and three thousand test items to have an eighty percent chance of detecting it at standard significance levels.

Most benchmarks people use for these comparisons — how many questions do they actually have?

MMLU has about fourteen thousand questions across all subjects, but individual subject categories might have only a hundred or two hundred. HellaSwag has about ten thousand. GSM8K has about thirteen hundred test questions. So for some of these, you might actually have enough items. But here's the problem — people don't use the full benchmark. They report aggregate scores, then slice and dice. They'll say model A beats model B on the chemistry subset of MMLU by three points. That subset might have only a hundred and fifty questions.

Which is nowhere near enough to detect a two or three point difference with any confidence.

Let me give you a concrete example. Suppose two models both have a true accuracy of seventy percent on some task. You test them on a hundred questions. The standard error on that accuracy estimate is about four point six percentage points. Your observed scores could easily differ by five or six points just by chance, even though the models are identical in capability.

When someone reports a two-point win on a hundred-question eval, they're reporting something well within the noise floor.

And this gets worse when you consider that people often run multiple benchmarks and cherry-pick the ones where their model looks good. If you run twenty benchmarks and use standard significance testing, you'd expect one to show a statistically significant difference just by chance, even if the models are identical.

The multiple comparisons problem. File drawer effect, p-hacking — all the classic sins of sloppy science, now running rampant in AI.

What makes it especially egregious is that these are major labs with enormous compute budgets. They could run proper statistical analyses. They choose not to, or they relegate them to an appendix nobody reads. The blog post headline says "our new model outperforms the competition on twelve out of fifteen benchmarks," and the asterisk about confidence intervals is buried on page forty-seven of the technical report.

Alright, let's get into the specific tests Daniel mentioned. McNemar's test. Why is that the right one for paired evaluations, and not a chi-square?

This is where the structure of the data matters. When you evaluate two models on the same set of prompts, you have paired observations. For each question, you know whether model A got it right, whether model B got it right, and critically, whether they agreed or disagreed. A standard chi-square test treats the samples as independent, which they're not — you're testing the same questions on both models.

McNemar's test accounts for that correlation.

McNemar's test focuses only on the discordant pairs — the questions where one model got it right and the other got it wrong. If model A and model B are equally capable, you'd expect the number of questions where A wins and B loses to be roughly equal to the number where B wins and A loses. McNemar's test checks whether the observed split among those discordant pairs is statistically unusual.

You're ignoring all the questions where both models agree — both right or both wrong — because those don't tell you anything about which model is better.

And this is important because on many benchmarks, the agreement rate is very high. Models might agree on eighty or ninety percent of questions. If you used a chi-square test that treats the samples as independent, you'd be massively overstating your effective sample size and getting false positives all over the place.

I want to pause on that because it has huge practical implications. If you use the wrong test, you'll think you've found a real difference when you haven't. And that's exactly what's happening across the industry.

McNemar's test isn't even complicated. It's been around since nineteen forty-seven. Any undergraduate statistics course covers it. There's no excuse for not using it.

Except that using it would often reveal that your exciting result isn't statistically significant, and that's bad for marketing.

The incentives in AI research right now reward overclaiming. If you're a lab releasing a model, you want to show improvement. If proper statistical testing shows your improvement isn't significant, you have a choice — be honest and get less attention, or sweep the statistics under the rug and put a bold number in your blog post.

The tech press amplifies this. Every model release gets covered as though a two-point difference on some benchmark is a meaningful leap forward.

The Allen Institute for AI published a piece on this called "Signal and Noise" that really laid it bare. They showed that for many common benchmarks, the difference between the top several models is smaller than the noise floor of the evaluation itself. So the rankings people obsess over are essentially meaningless. And AI2 is not exactly a minor player — they're one of the most respected research organizations in the field.

There was also that paper, "Measuring All the Noises of LLM Evals," that systematically cataloged all the sources of variance — prompt formatting, sampling temperature, even the order of multiple-choice options. They found that just changing the order of answer choices could swing scores by several percentage points.

The benchmark score isn't measuring some stable property of the model. It's measuring an interaction between the model, the prompt, the sampling parameters, and a bunch of other things, all of which introduce variance. And bootstrapping can help you quantify that. Daniel mentioned bootstrapped confidence intervals, and this is one of the most practical tools available. Bootstrapping is a resampling technique where you take your observed data and repeatedly sample from it with replacement to build up an empirical distribution of whatever statistic you're interested in.

Walk me through what that looks like for a benchmark evaluation.

Say you've tested your model on a thousand questions and got an accuracy of seventy-two percent. You take those thousand questions, randomly sample a thousand with replacement — some questions might appear multiple times, some might not appear at all. You compute the accuracy on that resampled set. Then you do it again, maybe ten thousand times. You now have ten thousand accuracy estimates. The two point fifth and ninety seven point fifth percentiles give you a ninety-five percent confidence interval.

That interval might be quite wide.

It might be plus or minus five points, or more. And if you do this for two models and their confidence intervals overlap substantially, you cannot claim one is better than the other, regardless of what the point estimates say.

This is what drives me crazy about benchmark reporting. You'll see a table with accuracies reported to one decimal place — seventy-two point three percent versus seventy point eight percent — as if that precision means anything. No confidence intervals, no error bars, no mention of variance.

The decimal places are a tell. When someone reports accuracy to a tenth of a percent without any measure of uncertainty, they're either statistically unsophisticated or they're hoping you are. Either way, you shouldn't trust the comparison.

Alright, let's move to Chatbot Arena and the Elo stuff, because I think this is where the statistical misunderstandings get even more subtle.

This is a great case study. Chatbot Arena, run by LMSYS, is easily the most visible LLM evaluation platform. Users submit prompts, get responses from two anonymous models, vote on which is better, and those votes get fed into a Bradley-Terry model to produce Elo-style ratings.

Elo ratings — this comes from chess originally, right?

Yes, developed by Arpad Elo. The core idea is that each model has a latent skill level, and the probability that model A beats model B is a function of the difference in their skill levels. If two models have the same rating, the probability of either winning is fifty percent. If one model is a hundred points higher, it's expected to win about sixty-four percent of the time.

Here's the thing people get wrong — an Elo rating isn't an absolute measure of capability. It's a relative ranking within a specific population of competitors.

This is the most fundamental misunderstanding. An Elo rating tells you how a model performs relative to other models in the Arena. It doesn't tell you anything about absolute capability. A model with an Elo of thirteen hundred isn't necessarily "better" at anything in particular — it's just better at winning Arena matchups than models with lower Elo.

The population of models changes over time. New models get added, old models drop out, the rating pool shifts.

This creates all sorts of problems. If you add a bunch of weak models, everyone else's Elo goes up because they're beating those weak models. If you add strong models, everyone's relative rating might drop. The absolute Elo number is not anchored to anything stable.

When someone says "this model gained thirty Elo points since last month," what are they actually saying?

They're saying that relative to the current pool of competitors and the current distribution of user prompts, this model is winning a higher proportion of its matchups. That could be because the model improved, or because the competitor pool changed, or because the types of prompts users submit shifted toward areas where this model is stronger.

There's also a selection effect in the prompts. People aren't submitting a representative sample of all possible queries.

Huge selection effect. Chatbot Arena users tend to submit challenging, interesting prompts — coding problems, reasoning tasks, creative writing challenges. They're not submitting "what's the capital of France." So the Arena ratings reflect performance on a distribution that skews toward harder tasks. Which might be exactly what you want if you're a power user, but it's not a general capability measurement.

There's another subtlety with the Bradley-Terry model itself. It assumes the probability of A beating B depends only on the difference in their latent scores. But model A might be better at coding and model B better at creative writing, and which one wins depends entirely on the prompt category.

You're collapsing a multidimensional capability space into a single number. That's inherently lossy. The Arena does provide category-specific ratings now, which helps. But even within a category, there's variance. And the confidence intervals on these ratings are wider than people think. For models with tens of thousands of votes, the confidence intervals can be plus or minus ten or fifteen points. For models with fewer votes, much wider. So when you see two models separated by five Elo points, that difference is almost certainly not statistically significant.

Yet the leaderboard presents them in strict rank order, and people treat that ordering as ground truth.

The LMSYS team is actually more careful about this than most. They publish confidence intervals. They note that small differences shouldn't be overinterpreted. But the community and the press ignore all those caveats and just look at the ranking number.

This is the core problem. The people building these evaluation frameworks often understand the limitations. But the incentives to extract simple, headline-friendly numbers are overwhelming.

The labs know this. They know a one-point lead on the Arena leaderboard will generate more coverage than a careful statistical analysis showing the top five models are essentially indistinguishable. So they optimize for the headline number.

Let's talk about what good evaluation practice would actually look like. If you're a listener who wants to evaluate models properly, or just wants to know when to trust a benchmark claim, what should you look for?

First, look for confidence intervals. If a paper or blog post reports accuracy without any measure of uncertainty, discount it. If they can't be bothered to compute a confidence interval, they're not serious about evaluation. Second, look at the sample size. If they're claiming a two-point difference on a hundred-question test, be very skeptical. Third, check whether they're using the right statistical test — McNemar's for paired comparisons, not chi-square.

What about bootstrapping? Is that something labs should be doing as standard practice?

Bootstrapping is computationally cheap and gives you honest confidence intervals without making strong parametric assumptions. There's no reason not to do it. Some of the better papers do include bootstrapped confidence intervals, but it's still not standard.

For Elo-style rankings, what should people understand?

Understand that Elo is a relative measure within a specific population, not an absolute capability score. A model with an Elo of thirteen hundred isn't ten percent better than a model with an Elo of eleven hundred and eighty — the scale isn't linear in that way. And small differences, especially under twenty points, are often within the noise.

I think there's also a more fundamental point about what benchmarks actually measure. Even if you do all the statistics perfectly, you're still measuring performance on a specific set of questions that may or may not represent the tasks you actually care about.

This is the validity problem, and it's arguably even harder than the statistical rigor problem. A benchmark can be perfectly reliable — giving consistent results with tight confidence intervals — and still not measure anything useful. If the benchmark questions are in the training data, you're measuring memorization. If the benchmark only covers multiple-choice factual questions, you're not measuring reasoning or creativity.

Even with perfect statistics, you still need to ask whether the benchmark is measuring what it claims to measure.

In AI, we have a particularly acute version of this because the benchmarks themselves become targets. Once a benchmark is widely used, labs optimize their models to score well on it. The benchmark stops being an independent measure of capability and becomes a training objective.

Goodhart's law in action — when a measure becomes a target, it ceases to be a good measure.

We've seen this cycle play out repeatedly. A new benchmark is introduced, it's useful for a while, then models saturate it or game it, and we need a new benchmark. The whole field is chasing a moving target.

Let me ask you something. You spent your career in medicine before this. How would the statistical standards in medical research compare to what you're seeing in AI?

There's no comparison. In medicine, if you tried to publish a clinical trial without confidence intervals, without a pre-registered analysis plan, without accounting for multiple comparisons, you'd be laughed out of the journal. The CONSORT guidelines for clinical trials are incredibly detailed about statistical reporting. In AI, we're basically in a pre-scientific era of evaluation.

The stakes in medicine are obviously higher — people's lives are on the line. But as AI systems get deployed in high-stakes settings, the stakes in AI evaluation are rising too.

If you're using an LLM to help with medical diagnosis or legal analysis or financial decisions, you need to know whether model A is actually better than model B, or whether the apparent difference is just noise. The sloppy evaluation practices that are standard today won't be acceptable when these systems are making consequential decisions.

What would it take to fix this? What would a mature evaluation culture look like?

A few things. First, pre-registration of evaluation plans. Before you run your model on a benchmark, specify what tests you'll run, what comparisons you'll make, and what statistical thresholds you'll use. This prevents p-hacking and cherry-picking after the fact. Second, standardized reporting requirements. Every model release should include confidence intervals, sample sizes, and the specific statistical tests used. Journals and conferences should require this. The major labs should adopt it voluntarily as a sign of rigor. Third, independent evaluation. Right now, most benchmark results are self-reported by the labs that built the models. That's an obvious conflict of interest. We need more third-party evaluation, like what LMSYS does with Chatbot Arena, but with proper statistical rigor applied throughout.

I'd add a fourth — humility about what the numbers mean. A benchmark score is not a model's IQ. It's a noisy measurement on a specific task distribution, and small differences are usually meaningless. If labs and the press would just internalize that, the whole discourse would improve.

There's a great line from a researcher I follow — the most thoughtful people in the field hold their evaluation results loosely, treating benchmark scores as data points rather than verdicts. That's the mindset we need.

Right now, the dominant mindset is the opposite. Every benchmark win is treated as a verdict. This model is better, period. The nuance gets stripped out.

Because nuance doesn't generate headlines. "New model achieves statistically indistinguishable performance on most benchmarks" is a true statement about many model releases, but nobody's going to write that story.

Alright, let's get practical. If someone's listening and they want to do their own model evaluations with proper statistical rigor, what does that workflow look like?

Step one — decide what you actually want to measure. Not just "which model is better," but better at what, on what kind of tasks, for what purpose. Step two — collect or create a test set that's representative of those tasks. Step three — run both models on the same prompts, with the same formatting, at the same temperature settings. Step four — compute your accuracy or whatever metric you care about. Step five — run McNemar's test if you're comparing accuracy on paired data. Step six — bootstrap confidence intervals for your metrics. Step seven — report everything transparently, including the confidence intervals and the test results. If the difference isn't statistically significant, say so.

How many samples do you need for this to be worth doing?

As a rough rule of thumb, if you have fewer than three hundred test items, be very cautious about claiming any difference under five percentage points. If you have fewer than a hundred, you probably can't detect anything but the largest effects.

Most of the benchmark subsets people get excited about are in that under-a-hundred range.

The MMLU professional law subset has about fifteen hundred questions, which is decent. But then people will report results on "high school physics" with a hundred and fifty questions and claim a two-point win. It's just not statistically meaningful.

What about the cost of doing this properly? Is there an argument that bootstrapping ten thousand resamples is computationally expensive?

It's trivial. Bootstrapping ten thousand resamples on a dataset of a few thousand questions takes seconds on a laptop. The expensive part is running the model inferences to get the predictions in the first place. Once you have those, the statistical analysis is essentially free. So there's really no excuse. And that's what makes the current state of affairs so frustrating. The fixes are well-known, computationally cheap, and have been standard practice in other fields for decades. The AI community just hasn't adopted them.

Do you think it's improving at all? Are you seeing more papers with proper statistics?

There's been more discussion of these issues in the past year. The AI2 "Signal and Noise" piece got a lot of attention. Some major conferences are starting to require more rigorous evaluation. But the dominant practice is still to report point estimates without uncertainty, and the dominant discourse still treats small benchmark differences as meaningful.

Part of the problem might be that statistical sophistication isn't evenly distributed. A lot of people working in AI come from computer science backgrounds where they might not have had much formal training in statistics.

That's fair, but it's not an excuse for the major labs. These labs employ hundreds of PhDs. They have the expertise in-house. The issue isn't capability, it's incentives. Evaluation is supposed to be about finding the truth. But in a competitive market, evaluation becomes a marketing tool. And when evaluation is marketing, rigor is the enemy.

That's a pretty bleak assessment.

It's realistic. But there's a countervailing force — as AI systems get deployed in higher-stakes settings, the demand for honest evaluation will increase. If you're a hospital choosing a model to help with diagnosis, you need to know which one is actually better, not which one had the better marketing team. The people making consequential decisions based on these evaluations have a strong incentive to demand rigor. The question is whether that demand will be enough to overcome the labs' incentive to overclaim.

I want to circle back to the multiple comparisons problem, because I think it's one of the most pervasive and least understood issues in benchmark reporting.

Here's the classic scenario — a lab releases a new model and evaluates it on twenty benchmarks. It outperforms the previous model on fourteen of them. They put that in the headline. But if you apply a Bonferroni correction or control the false discovery rate, how many of those fourteen wins are actually significant? Often fewer than half. And the lab almost never reports this. They don't adjust for multiple comparisons. They just count the number of benchmarks where the point estimate is higher and call it a win.

This is the kind of thing that would get a paper rejected in psychology or medicine.

In psychology, the replication crisis forced a reckoning with these exact issues. Pre-registration, correction for multiple comparisons, reporting of effect sizes and confidence intervals — all of this became standard because the field realized it had been generating false positives for decades.

AI is heading for its own replication crisis if it doesn't clean up its act.

I think it's already happening. People are increasingly skeptical of benchmark results. There's a growing recognition that many claimed improvements aren't reproducible or aren't meaningful. The question is whether the field will respond by improving its methods or by finding new ways to generate exciting-looking numbers.

Let's end with something forward-looking. If you could wave a magic wand and change one thing about how LLM evaluation is done and reported, what would it be?

I'd require every benchmark result to be reported with a confidence interval, and I'd require every claim of superiority to be backed by a proper statistical test with correction for multiple comparisons. That one change would eliminate maybe eighty percent of the misleading claims out there.

For listeners who want to be smarter consumers of these claims?

Be skeptical of any comparison that doesn't report uncertainty. Be especially skeptical of small differences on small test sets. And remember that a benchmark score is a measurement, not a truth. It's noisy, context-dependent, and only as good as the evaluation methodology behind it.

Treat benchmark scores as data points, not verdicts.

If the lab releasing the model isn't doing that, you should do it for them.

This has been a thoroughly satisfying dive into the statistical sins of AI evaluation. Thanks to our producer Hilbert Flumingtop for keeping us on track, and thanks to Modal for powering our pipeline. This has been My Weird Prompts. Find us at myweirdprompts dot com or on Spotify. We'll be back with more.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Mentions

Downloads

You Might Also Like

Featured In

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals