Episode #105

Beyond Math Puzzles: The Truth About AI Benchmarks

Are AI models getting smarter, or just better at memorizing tests? Herman and Corn dive into the controversial world of 2025 AI benchmarks.

0:00/0:00

Download Episode

Episode Details

Published: Dec 26, 2025
Duration: 22:25
Audio: Direct link
Pipeline: V4
TTS Engine
Topics: ai benchmarks data contamination benchmark gaming swe-bench livebench ai testing machine reasoning coding productivity

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the rapidly evolving landscape of 2025, the way we measure artificial intelligence has become as much of a talking point as the models themselves. In the latest episode of My Weird Prompts, hosts Herman and Corn take a deep dive into the world of AI benchmarks, sparked by a listener's question about the real-world utility of new models like Claude 4.5 and GLM 4.7. What emerges is a nuanced discussion about why a model’s high score on a math test might not mean it’s ready to handle your next software project.

The Mirage of the High Score

Herman Poppleberry, the show’s resident expert, begins by addressing a growing frustration in the tech industry: the discrepancy between benchmark scores and user experience. We often see companies touting scores of 98% or higher on specific logic tests, yet users find those same models struggling with basic coding tasks.

Herman explains that this is often due to "data contamination." Because AI models are trained on vast swaths of the internet, the questions and answers to popular benchmarks are often accidentally—or sometimes intentionally—included in the training data. Corn likens this to a student who memorizes a practice test rather than learning the subject matter. In this scenario, the AI isn’t "reasoning" through a problem; it is simply recalling a memorized answer.

Why the Focus on Math?

A central theme of the discussion is the industry's obsession with mathematical puzzles. Corn questions why these abstract problems are used to judge a model's ability to write code or analyze legal documents. Herman points out that math is a preferred metric because it is objective and easy to grade programmatically. More importantly, math is often used as a "proxy" for raw reasoning. The assumption is that if a model can solve a complex calculus problem, it possesses the logical framework to handle other high-level tasks.

However, Herman argues that this logic is flawed when applied to software engineering. Coding requires more than just pure logic; it requires an understanding of context, style, and the ability to manage how different parts of a massive system interact. A model might be a "math genius" but a "coding catastrophe" if it cannot maintain the integrity of a 3,000-line script.

The Shift to Real-World Productivity

Despite the skepticism surrounding academic benchmarks, the episode highlights some positive trends in 2025. Herman references "The State of AI Coding 2025" report, which shifted the focus from test scores to actual output. Interestingly, the report found that the median size of a "pull request" (a set of code changes) increased by 33% over the course of the year.

This suggests that developers are becoming more productive with AI, but not necessarily because of the models with the highest math scores. Instead, "in-context" performance—the ability of a model to understand a user's specific codebase and personal coding style—is proving to be the more valuable trait. This explains why some mid-tier or more affordable models, like the GLM 4.7 mentioned by listener Daniel, are gaining popularity despite not always sitting at the top of the logic leaderboards.

The New Gold Standards: LiveBench and SWE-bench

For those seeking objective ways to measure AI, Herman recommends moving away from static tests and toward dynamic, "live" benchmarks. He highlights two in particular:

LiveBench: To combat data contamination, LiveBench releases new problems based on information and events that occurred after the models finished their training. This forces the AI to "think on its feet" rather than rely on memory.
SWE-bench (Software Engineering Benchmark): This is described as the gold standard for coding. Rather than isolated puzzles, SWE-bench tasks the AI with resolving real-world bugs from open-source GitHub projects. The AI must navigate the entire codebase, write a fix, and pass existing tests without breaking other features.

Herman also points to the Aider leaderboard, which specifically measures a model’s ability to "refactor" and edit code. This is a critical skill for developers, as it involves cleaning up and restructuring code without changing its functionality—a task where many models frequently fail by accidentally deleting necessary components.

The Risks of "AI Grading AI"

As tasks become more complex, the episode touches on a controversial trend: using stronger AI models to grade the work of weaker ones. While efficient, Herman warns of "Oracle validity" issues. If the "teacher" AI shares the same biases or knowledge gaps as the "student" AI, the grading system becomes a closed loop of errors. While coding provides a safety net—the code either runs or it doesn't—other fields like creative writing or legal analysis are much harder to verify when robots are grading robots.

Conclusion: Asking the Right Questions

The episode concludes with a piece of advice for anyone looking at a new AI model’s impressive stats. Instead of taking a percentage at face value, Herman suggests asking: "Was this benchmark part of the training data?"

The takeaway for 2025 is clear: as AI becomes more integrated into our professional lives, our methods for testing it must become more sophisticated. We are moving past the era of the "math puzzle" and into an era where the only benchmark that truly matters is how well a model can handle the messy, unscripted reality of a human's workload.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Cover · OG · Instagram

Episode #105: Beyond Math Puzzles: The Truth About AI Benchmarks

Hey everyone, welcome back to My Weird Prompts! I am Corn, and I am so glad you are hanging out with us today in our little corner of Jerusalem. As always, I am joined by my brother and resident expert on just about everything.

Herman Poppleberry, at your service. It is good to be here, Corn. Although, I have to say, the weather today makes me want to just stay inside and read research papers all day.

Well, that is perfect, because our housemate Daniel sent us a prompt that is right up your alley. It is all about the world of artificial intelligence benchmarks. And honestly, it is something I have been seeing all over my feed lately, but I feel like I am only catching the surface level of it.

It is a massive topic right now, especially as we wrap up twenty twenty-five. The landscape of AI has changed so much just in the last twelve months, and the way we measure these models is becoming a bit of a controversial subject. Being a donkey, I can be a bit stubborn about wanting the actual data instead of just the marketing hype, so I have been digging into this quite a bit.

I remember you mentioned you were looking at some new models recently. Daniel specifically mentioned Claude Opus four point five and this new GLM four point seven model from Z dot A-I. He was asking about how these things actually perform when it comes to coding, not just solving math puzzles.

That is such a great point from Daniel. And it touches on a real frustration in the industry right now. We are seeing these huge announcements where a company says, our new model scored ninety-eight percent on this specific test! And everyone cheers, but then people actually try to use it to write software and they find it still makes the same silly mistakes.

It is like that kid in school who memorizes the practice test but does not actually understand the subject, right?

Exactly! That is the perfect analogy, Corn. You are a natural at this, even if you are a bit of a slow-moving sloth sometimes. But seriously, that is what we call data contamination or benchmark gaming. If the questions from the benchmark are included in the AI's training data, which is basically the entire internet, then the AI is not reasoning through the problem. It is just remembering the answer it saw during training.

So, it is basically cheating?

In a way, yes. Though the developers might not always be doing it on purpose. When you are scraping billions of pages of data, it is hard to make sure you have not accidentally sucked in the answers to the most popular tests. But Daniel’s point about the focus on math is really interesting. Why do you think they focus so much on these complex mathematical puzzles, Corn?

I mean, I guess because math has a right or wrong answer? It is not like asking it to write a poem where everyone has a different opinion.

That is a big part of it. Math is objective. It is easy to grade. You can run a script that checks if the model got the number forty-two or not. But there is another reason. Mathematics is often seen as a proxy for raw reasoning capability. The idea is that if a model can solve a high-level calculus problem or a complex logic puzzle, it must be smart enough to handle other things, like coding or legal analysis.

But is that actually true? Does being good at math puzzles make you a good programmer?

Not necessarily. Coding is about more than just logic. It is about understanding context, following style guidelines, and managing how different parts of a large system interact. A model might be a genius at a isolated math problem but completely fall apart when you ask it to edit a three-thousand-line script without breaking the existing features.

That is what Daniel was getting at with the skepticism. He mentioned that manufacturers might be targeting benchmark performance rather than making the most useful models. Is that actually happening in twenty twenty-five?

Oh, absolutely. There is a lot of pressure to be at the top of the leaderboards. It helps with funding, it helps with sales, and it creates headlines. But we are starting to see a pushback. There was a report recently called The State of AI Coding twenty twenty-five that looked at actual productivity gains instead of just test scores. They found that the median size of a pull request—that is basically a set of code changes—increased by thirty-three percent between March and November of this year. It went from about fifty-seven lines to seventy-six lines.

Wait, so people are actually getting more done?

They are. But the interesting thing is that the models that score the highest on the math-heavy benchmarks are not always the ones driving that productivity. Some models are better at what we call "in-context" work. That means they are better at looking at your specific project and understanding how you personally write code, rather than just knowing the general rules of Python or Java.

That makes sense. I would rather have a tool that knows my specific mess than a tool that knows every math formula but has no idea what I am trying to build. But what about those specific models Daniel mentioned? Claude Opus four point five and GLM four point seven. What is the deal there?

Well, Claude has been a favorite for a long time because it tends to be very careful. It does not hallucinate as much as some other models. But then you have these new challengers like GLM four point seven. The big story there is often the price-to-performance ratio. You might get ninety-five percent of the capability of a top-tier model for a fraction of the cost.

I love a good bargain. But I worry that if I go with the cheaper one, it is going to break my website.

And that is why we need better benchmarks! Daniel asked for recommendations for benchmarks that are objective and free from vendor bias. And honestly, the old ones are starting to fail us. But there are a few new ones that I really trust right now.

Before we get into the nitty-gritty of those benchmarks, I think we should take a quick break for our sponsors.

Larry: Are you tired of your garden looking like a boring collection of plants? Do you want your backyard to reflect the true mystery of the universe? Introducing Chronos-Seeds! These are not your grandmother’s petunias. Chronos-Seeds have been exposed to high-frequency tachyon bursts in a basement in New Jersey. We cannot guarantee what will grow, but we can guarantee it will be "interesting." Some customers report flowers that bloom yesterday. Others report vines that hum in a language that sounds suspiciously like ancient Aramaic. Do they need water? Maybe. Do they need your secret thoughts whispered to them at midnight? Definitely. Chronos-Seeds - because the linear flow of time is just a suggestion. BUY NOW!

...Alright, thanks Larry. I think I will stick to my regular tomatoes, thank you very much. Anyway, Herman, back to the world of AI benchmarks that actually mean something.

Right. So, if we are looking for things that are hard to "game" and actually represent real-world coding ability, I have a few favorites. The first one I want to mention is called LiveBench.

LiveBench? Like, it is happening live?

Exactly! That is the whole point. One of the biggest problems with benchmarks is that they become stagnant. Once a test is released, it is only a matter of weeks before it ends up in the training data for the next AI model. LiveBench tries to solve this by constantly releasing new problems that are based on very recent information—stuff that happened after the models were already trained.

Oh, that is clever. It is like a surprise quiz that changes every day so you cannot memorize the answers.

Precisially. It is designed with test set contamination in mind. They use objective evaluation, meaning they have very strict ways of measuring if the answer is correct, but the problems are fresh. If a model scores well on LiveBench, it is a much better indicator that it can actually think on its feet.

Okay, that sounds great for general intelligence. But what about the coding stuff Daniel was asking about?

For coding, the gold standard right now is something called SWE-bench. That stands for Software Engineering Benchmark. Instead of asking the AI to solve a tiny puzzle, SWE-bench gives it a real-world issue from a popular open-source project on GitHub.

Like a real bug that a human had to fix?

Yes. The AI is given the entire codebase, a description of the bug, and it has to actually write the code to fix it. Then, the benchmark runs the project's existing tests to see if the AI actually fixed the problem without breaking anything else.

That sounds incredibly hard.

It is! For a long time, even the best models were scoring less than ten percent on this. But in late twenty twenty-four and throughout twenty twenty-five, we have seen those numbers start to climb. Anthropic’s Claude models and Google’s Gemini models have been doing some really impressive work here. It is a much better reflection of what a software engineer actually does all day. It is not just writing code; it is navigating a complex system.

So if I am a developer and I want to know which model to use, I should look at the SWE-bench scores?

That is one of the best places to look. Another one that Daniel might find useful is the Aider leaderboard. Aider is a popular tool that people use to code with AI inside their actual projects. They maintain a leaderboard that specifically tests how well models can perform "refactoring" and "editing" tasks.

Refactoring... that is just a fancy word for cleaning up code, right?

You got it. It is about changing the structure of the code without changing what it does. It is one of the most common tasks for a programmer, and it is something that models often struggle with. They might fix the thing you asked for, but accidentally delete three other things in the process. The Aider leaderboard is great because it is based on real-world usage of these models in a very popular coding tool. It is much harder to "game" because it is testing the model's ability to follow complex editing instructions.

That sounds way more useful than a math puzzle. Why are we still even talking about the math puzzles then?

Well, to be fair to the math puzzles, they do show us something about the "ceiling" of a model's logic. If a model cannot solve a high-school level math problem, it is probably going to struggle with complex logic in a program. But you are right, we are seeing a shift. There is a growing sentiment that "benchmarking is broken" if we only focus on those static, academic tests.

I saw a headline about that. It said something about AI reviewing AI? That sounds like a recipe for a disaster.

It can be! That is another trend in twenty twenty-five. Because these tasks are getting so complex, humans sometimes have a hard time grading them quickly. So developers use a "stronger" AI to grade the "weaker" AI's work.

Wait, so the teacher and the student are both robots?

Exactly. And you can see the problem there. If the teacher-AI has the same biases or the same gaps in knowledge as the student-AI, it might give it a passing grade even if the answer is wrong. There was a case study recently about "Oracle validity"—basically, how do we know the person or thing giving the answer actually knows the truth? In coding, we are lucky because we can run the code and see if it works. In other fields, it is much harder.

This is all making me a bit skeptical of any number I see now. If Daniel is looking at a new model like GLM four point seven and he sees a high score, what should his first question be?

His first question should be: "Was this an open-ended benchmark or a closed one?" And then: "How does it perform on SWE-bench or LiveBench?" If a company only reports scores on something like HumanEval—which is a very old and very famous benchmark—you should be a bit suspicious.

Why HumanEval?

HumanEval was released years ago. It is a collection of about one hundred and sixty-four coding problems. Because it is so old and so famous, it is almost certain that every single one of those problems is in the training data of every new model. It is like taking a history test when you already have the answer key in your pocket. It does not prove you know history; it just proves you can read the answer key.

So if a model says it got one hundred percent on HumanEval, it is basically just saying "I have read the internet."

Pretty much! It is still a useful baseline—if a model fails HumanEval, it is definitely not ready for prime time. But a high score there does not mean it is a great coding assistant. For that, you really want to see how it handles those larger-scale tasks.

You mentioned Google’s Gemini earlier. I feel like they were a bit behind for a while, but I have been hearing more about them lately. How are they doing in the coding world at the end of twenty twenty-five?

Gemini has actually made a huge comeback. They have a massive "context window," which means the model can "remember" or "look at" a huge amount of information at once—sometimes millions of tokens. For a programmer, that means you can feed the AI your entire documentation, your entire codebase, and all your style guides, and ask it a question. It does not have to guess; it can actually see your whole project.

That sounds like a game changer. I mean, I can barely remember what I had for breakfast, let alone a whole codebase.

Exactly. And that is a different kind of "smart" than being good at a math puzzle. It is about information retrieval and synthesis. So, when Daniel asks which benchmarks to recommend, I would say look for those that test "long-context" reasoning too.

Okay, so let me see if I can summarize this, because my sloth brain is starting to get full. We have LiveBench for fresh, non-contaminated problems. We have SWE-bench for real-world software engineering tasks on GitHub. And we have the Aider leaderboard for practical, day-to-day code editing.

You nailed it, Corn. And I would add one more thing for Daniel: personal benchmarking. In twenty twenty-five, the most sophisticated users are not just trusting the online scores. They have their own little "test set" of problems that they know are hard or specific to their work. When a new model comes out, they run it through their own five or ten problems to see how it handles them.

That makes so much sense. Like, if I have a specific way I like my Python scripts to look, I should see if the new model can actually follow my style.

Exactly. No benchmark is going to be as perfect as your own experience. But the industry is moving in the right direction. We are moving away from those "look how smart I am at puzzles" tests toward "look how much work I can actually help you get done" tests.

It feels like the "hype" phase of AI is finally starting to settle into a "utility" phase.

I think that is a very astute observation. We are seeing a lot more focus on reliability. People are realizing that a model that is eighty percent accurate but tells you when it is unsure is actually much more useful than a model that is ninety percent accurate but lies to you the other ten percent of the time.

Oh, I hate it when they lie. It is so confident too! It will give you a piece of code that looks perfect, and then you run it and your computer starts smoking.

Hopefully not literally smoking! But yes, that "hallucination" problem is why benchmarks like SWE-bench are so important. They require the code to actually pass a test. You cannot just look good; you have to work.

So, looking forward to twenty twenty-six, do you think we will ever have a "perfect" benchmark? One that everyone agrees on?

Probably not. As the models get smarter, the benchmarks have to get harder. It is a constant game of cat and mouse. But I do think we will see more "dynamic" benchmarks. Instead of a static list of questions, we might see benchmarks that are generated by other AIs in real-time to test specific weaknesses. It is going to be a very interesting year for AI evaluation.

Well, I feel a lot better about this now. It is not just about the numbers; it is about what those numbers are actually measuring. Daniel, I hope that helps you navigate the sea of AI announcements. It sounds like Claude and Gemini are still the big players, but keep an eye on those newer models like GLM if they start showing up on the more rigorous leaderboards.

And don't be afraid to try them out! A lot of these newer models offer free trials or very cheap API access. The best way to see if a model is "gaming" the benchmark is to give it a task it has definitely never seen before—something unique to your own life or your own project.

That is great advice, Herman. Even for a donkey, you are pretty sharp.

Hey, I take that as a compliment! Donkeys are very intelligent and hardworking animals, you know.

I know, I know. And I am just here to keep us relaxed and ask the questions everyone else is thinking. This has been such a great deep dive. I feel like I actually understand why those math puzzles are everywhere, even if I still think they are a bit silly for testing a coding bot.

They have their place, but they are definitely not the whole story. I am glad we could clear that up. It is always fun to dig into the data and see what is actually happening behind the marketing curtain.

Absolutely. Well, I think that is all the time we have for today. Thank you so much for joining us for this episode of My Weird Prompts.

Yes, thank you everyone. And thank you to Daniel for such a thoughtful and timely prompt. It is exactly the kind of thing we love to explore.

If you have a prompt you want us to tackle, whether it is about AI, history, or why sloths are clearly the superior species, head over to our website at myweirdprompts dot com. We have a contact form there, and you can also find our RSS feed and links to all our episodes on Spotify.

We love hearing from you, so don't be shy.

Until next time, stay curious, and maybe don't buy those time-traveling seeds Larry was talking about.

Definitely don't do that. Goodbye, everyone!

Bye! This has been My Weird Prompts. See you in the next one!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.