You know, Herman, I was looking at some old folders on my hard drive the other day, and I found a script I wrote in early twenty twenty-four. It was this convoluted Python mess just to automate sorting my downloads. It took me three hours to debug a simple regex string. I remember staring at the screen, questioning my entire career because I couldn't remember if the parentheses needed to be escaped or not.
I remember that. You almost threw your monitor out the window because of a misplaced backslash. It’s that classic "regex blindness" where the more you look at it, the less it looks like language and the more it looks like a cat walked across your keyboard.
I did. But today, I could probably just whisper that requirement into a terminal and a model would not only write it but handle the edge cases I didn't even think of—like what to do with duplicate filenames or hidden system files. It’s wild how fast the floor has moved. Today’s prompt from Daniel is about exactly that—the state of autonomous coding as measured by the SWE-bench Verified leaderboard. We are looking at the gold standard for how well AI can actually navigate a real codebase and fix real bugs.
This is essentially the Olympics for AI software engineers. And the numbers coming out of late March twenty twenty-six are staggering. We’ve gone from models barely being able to find the right file to models like Claude four point five Opus hitting nearly eighty percent accuracy on human-verified GitHub issues. By the way, fun fact—Google Gemini three Flash is actually writing our script today, which is fitting since Gemini is currently a massive player on these leaderboards. It’s like the model is reporting on its own sports highlights.
It’s a bit meta, isn't it? An AI writing a script about how good AI is at writing code. But let’s look at that eighty percent number. Claude four point five Opus is sitting at seventy-nine point two percent on the Verified set as of December twenty twenty-five. On the surface, that looks like we’re five minutes away from the AI just taking over the entire Sprint board. But Daniel’s prompt points out that there is a massive amount of nuance behind that single percentage. For context, a human software engineer—given the same environment and time constraints—typically scores around ninety to ninety-five percent on these specific tasks. We are getting remarkably close to the human baseline.
There really is a lot under the hood there. If you look at the historical progression, it’s a vertical line that’s suddenly starting to curve. In April twenty twenty-four, GPT-four with the original SWE-agent scaffold was scoring twenty-two point four percent. People thought that was a miracle at the time. Then Claude three point five Sonnet came out and jumped it to thirty-three percent. By October twenty twenty-four, we crossed the fifty percent threshold. Now we are knocking on the door of eighty percent. But the leap from seventy-three to seventy-nine took almost twice as long as the leap from thirty to fifty.
We’re hitting the wall of diminishing returns. It’s like a sprinter trying to shave a millisecond off a world record versus a middle schooler improving their hundred-meter dash by three seconds. The easy bugs—the "low hanging fruit" typos and simple logic errors—are gone. What’s left on that leaderboard are the weird, deep-seated architectural bugs in Django or Matplotlib that require a genuine understanding of how a thousand different files interact.
But wait, how does the benchmark actually work? Is the AI just guessing?
Not exactly. It’s given a folder containing the entire repository and a text description of the issue. It has to explore the files, reproduce the bug by writing a test case, modify the code, and then verify the fix. It’s an end-to-end process. If it fails to find the file, it fails the task. If it fixes the bug but breaks three other things, it fails the task.
And that brings us to the most important realization of the twenty twenty-five coding boom: the model is only half the battle. The "scaffold" or the agent framework is the other half. You can take the exact same model—let's use Claude four Sonnet as the example— and its performance swings wildly depending on who built the "harness" it’s sitting in. With the standard SWE-Agent scaffold from May twenty twenty-five, it scored sixty-six point six percent. But when the team at EPAM AI put their own custom agentic wrapper around it in August, that same model jumped to seventy-six point eight percent.
That is a ten-point spread on the same brain. It’s like giving two people the same IQ but one of them has a library card and a high-speed internet connection while the other is trapped in a dark room with a single textbook. What are these scaffolds actually doing, Herman? Is it just better prompting, or is there something more mechanical happening?
It’s highly mechanical. A good scaffold like Live-SWE-agent or OpenHands isn't just sending a long prompt. It’s managing a file-searching tool, a linting tool, and a test-running environment. It’s essentially a loop: "Search for the bug, propose a fix, run the tests, see the failure, read the traceback, try again." The scaffolds that are winning right now are the ones that have perfected the "inner loop" of debugging. They know how to prune the search space so the model doesn't get lost in a massive repository like scikit-learn. Think of it as a specialized OS for the AI.
I love the idea of "pruning the search space." Because if you just dump ten thousand lines of code into a context window, the model gets "lost in the middle." It’s that classic needle-in-a-haystack problem. The best agents act like a senior developer who says, "Don't look at the database logic, the bug is definitely in the routing middleware." They guide the LLM’s attention. But how does the agent know where to look without reading everything first?
It usually starts with a semantic search or a "grep" on the keywords in the issue description. A sophisticated scaffold will look at the file tree, identify relevant-looking filenames, and then "peek" at the imports. If models.py imports validators.py, the agent follows that trail. It's mimicry of how you or I would navigate a new codebase. We don't read every line; we look for the "scent" of the bug.
And if you look at the Live-SWE-agent results from late twenty twenty-five, you see Claude four point five Opus hitting seventy-nine point two percent. But look at Gemini three Pro—it’s right there at seventy-seven point four percent. The gap between the absolute top-tier Western models is shrinking to a margin of error. And what’s even more fascinating is the cost-to-performance ratio. In the mini-v-two controlled comparisons, Claude four point five Opus costs about three hundred and seventy-seven dollars to run across the benchmark. Gemini three Flash, which is a much smaller, faster model, scores seventy-five point eight percent—only a tiny bit lower—but it costs a hundred and seventy-eight dollars.
Wait, hold on. So you’re telling me that for half the price, you get ninety-eight percent of the performance? That’s the real story for engineering managers. If I’m running a fleet of these agents to maintain a corporate codebase, I don't care about the two percent "ego" points at the top of the leaderboard if it doubles my compute bill.
It gets even more extreme. There’s a model from a Chinese lab called MiniMax M-two point five. It’s a "high reasoning" model that scored seventy-two point four percent on that same controlled test. That’s only four points behind the world-leading Opus model. But the cost? Thirty-six dollars.
Thirty-six bucks versus nearly four hundred? That is a ten-x difference in efficiency. That basically means the "intelligence" is being commoditized at a rate that should make the big labs very nervous. If a "budget" model can solve seventy percent of real-world GitHub issues for the price of a lunch, the era of the "expensive AI developer" might be shorter than we thought. I mean, why would I pay for the Ferrari to go to the grocery store when the Honda does it for a fraction of the cost?
It also highlights the "Chinese lab" factor that Daniel mentioned. We’ve seen Doubao Seed Code from ByteDance hitting seventy-eight point eight percent. That’s third place globally. Kimi K-two, Qwen-three-Coder from Alibaba—these models are not just "catching up," they are effectively parity with Anthropic and OpenAI in the coding domain. Coding is a very structured, logical task where you have a "ground truth" in the form of a compiler or a test suite. It turns out that’s the perfect environment for these labs to optimize their training. They can generate millions of synthetic coding problems, run them through a compiler, and use that as a perfect feedback loop for Reinforcement Learning.
It’s also an environment where you can't really "fake" it. You either pass the tests or you don't. But I want to push back on what SWE-bench actually represents. Because when I’m working on a project, I’m not just fixing a bug in one file. I’m thinking about how the new feature affects the API design, whether it breaks the documentation, and if the security team is going to scream at me. Does SWE-bench capture any of that "big picture" stuff? Like, does it know if the fix is "elegant" or just a hack?
Not really, and that’s the big caveat. SWE-bench tasks are almost entirely single-file patches. They are "micro-tasks." You are given a specific issue description—usually a bug report—and you have to find the tiny logic error and flip a bit or add a check. It doesn't measure architectural vision. It doesn't measure the ability to refactor a whole system to use a new design pattern. It’s essentially a very sophisticated version of LeetCode for agents. It’s testing the "mechanic" skills, not the "architect" skills.
So it’s the world’s best junior developer. It can fix the "broken button" or the "null pointer exception," but it’s not going to sit in a meeting and tell you that you should migrate from REST to GraphQL because your mobile latency is too high.
Precisely. And there’s another issue that’s been popping up lately: data contamination. OpenAI actually did an audit recently on GPT-five-point-two and Claude four point five Opus. They found that these models could sometimes reproduce the "gold patches"—the actual human-written solutions—verbatim from their training data. Because these are real GitHub issues from famous projects, they were likely in the training set. If the model has seen the commit that fixed the bug in twenty twenty-two, it’s not solving it; it’s just recalling it.
Oh, that’s a massive "asterisk" on those seventy-nine percent scores. If the model has already seen the answer key, it’s not "solving" the problem; it’s just remembering it. That explains why the "Verified" set had to be created. It was an attempt to filter out the most obvious examples of contamination, right?
The "Verified" set involved human engineers looking at the tasks to make sure they were actually solvable and that the tests were robust. But even then, if the code is on public GitHub, it’s likely in the training data. It’s why the industry is moving toward SWE-bench Pro. That’s a newer benchmark where they use tasks that were published after the models were trained, or they use private repositories that haven't been scraped. And on SWE-bench Pro, the scores crater. Claude four point five Opus, which gets an eighty percent on Verified, drops to about forty-six percent on Pro.
Wow. That is a forty-point drop. That tells you that about half of the "intelligence" we’re seeing on the main leaderboards might just be very high-end memorization. It’s still impressive that it can retrieve the right memory and apply it to the code, but it’s not "reasoning" from first principles in the way a human engineer does. It’s more like a student who memorized the practice exam and then struggles when the teacher changes the numbers on the real test.
And yet, even that forty-six percent on "Pro" is still transformative. If an AI can autonomously fix nearly half of the non-memorized, real-world bugs in a codebase it’s never seen before, that is still a massive productivity multiplier. We’re moving into this world where the "human in the loop" becomes an editor rather than a writer. You aren't writing the fix; you're just approving the PR.
I think about the "agent scaffold" effect again. If the scaffold is what’s driving these jumps, maybe the future isn't a "smarter" model, but a "smarter" environment. Like, if the AI has access to a live debugger, or if it can talk to other agents, maybe that’s how we break the eighty percent ceiling. Imagine an agent that can actually spin up a Docker container, run the app, and "see" the UI failing.
That’s already happening. There’s a system called Auggie that currently tops the SWE-bench Pro leaderboard at fifty-one point eight percent. It uses a Claude four point five Opus backbone, but the way it interacts with the code is much more "human-like." It doesn't just guess; it builds a mental model of the execution flow. It uses a "plan-act-reflect" cycle. It writes a plan, executes a small part, checks if it worked, and then adjusts the plan. It’s essentially "Chain of Thought" but for software architecture. It’s slow—sometimes taking thirty minutes to solve one bug—but it’s thorough.
It’s interesting to see who’s missing from the top of these lists. Daniel mentions that GPT-five is underperforming expectations. On the OpenHands leaderboard, GPT-five is at seventy-one point eight percent. That’s good, but it’s not the "god-model" everyone was predicting a year ago. It feels like OpenAI is focusing more on multimodal "omni" features—voice, video, emotion—while Anthropic and the Chinese labs are obsessing over raw coding logic.
It’s a divergence in strategy. Anthropic has clearly doubled down on the "Computer Use" and "Claude Code" ecosystem. They want the model to be a literal drop-in replacement for a dev. And it shows. They hold seven of the top eleven spots on the Verified leaderboard. If you are a developer today, you are likely using a Claude-based tool like Cursor or Claude Code because the "vibe" of the code it produces just feels more "engineered." It handles edge cases like null checks and asynchronous race conditions that GPT-five sometimes glosses over.
"Engineered" is the right word. When I use some of the older models, the code looks like it was copied from a Stack Overflow answer from two thousand twelve. It’s technically correct but it’s ugly—no type hints, weird variable names like data1 and temp_var. The newer models—especially with these high-end scaffolds—are starting to write code that actually follows modern best practices. They’re using type hints, they’re writing docstrings, they’re thinking about edge cases. They write code that looks like it came from a Senior Dev at Google.
But we have to talk about the "plateau" Daniel mentioned. We went from twenty-two to seventy-nine in two years. That’s an insane vertical climb. But if you look at the last six months, the gains are measured in half-percentages. Are we reaching the limits of what LLM architecture can do for coding? Is there a ceiling where a transformer-based model just can't "understand" a complex system?
I think we’re reaching the limits of what "text-in, text-out" can do. To get to ninety-five percent, the AI probably needs to be able to actually "run" the software in a persistent environment and observe it over days, not seconds. It needs a "long-term memory" of the codebase. Right now, every time you start a SWE-bench task, the agent is essentially "born" into a brand new world. It has no memory of the last time it fixed a bug in Django. A human developer gets faster over time because they learn the "quirks" of the system. They remember that "Oh, this module always has issues with the database connection pool."
That’s a brilliant point. The benchmarks are "stateless." Each task is an isolated event. In the real world, coding is "stateful." You know that Dave in DevOps always writes weird networking code, so you look there first. You know the legacy "auth" module is a house of cards, so you avoid touching it. Until agents have that "historical context," they will always be capped at a certain level of performance. They are constantly reinventing the wheel for every single ticket.
It’s also about the "verification" part. Even on the "Verified" leaderboard, who is verifying the AI’s fix? It’s a test suite. But we all know that you can pass a test suite while still introducing a "time bomb" of a bug that only shows up in production under high load. SWE-bench doesn't measure performance regressions or security vulnerabilities that don't trigger a specific test failure. It doesn't check if the AI just increased the memory usage by four hundred percent to solve a simple logic error.
Right. You could "fix" a bug by just commenting out the line that throws the error. Technically, the test passes! A human reviewer would catch that in a heartbeat and call you an idiot, but a benchmark might give you a gold star. This is why the "Agentic Harness" is so important. Companies like Augment and Cognition are building systems that don't just "pass tests," they actually try to "break" their own solutions before submitting them. They run a separate "adversarial" agent that acts like a hostile QA engineer.
It’s like the AI needs a "critic" personality. One agent writes the code, and another agent—maybe a "mean" one—tries to find every reason why that code is garbage. "This won't scale," "This is a security risk," "This is unreadable." That "adversarial" approach is how you get from eighty percent to ninety percent. It forces the model to justify its choices.
And that brings us back to the cost. If you’re running two or three high-end models in an adversarial loop, your cost per bug fix goes from four hundred dollars to twelve hundred dollars. At that point, is it cheaper than a human? A senior dev in the U-S might cost a hundred and fifty dollars an hour. If they can fix that same bug in two hours, the human is actually "cheaper" than a high-end agentic swarm. And the human can also explain the fix to the rest of the team in the morning stand-up.
That is the "A-ha" moment for me. We always assume AI is the "cheap" alternative. But at the bleeding edge of reasoning, compute is incredibly expensive. We might see a future where the AI does the "boring" seventy percent for pennies, and the humans are brought in not because they are "smarter," but because they are more "cost-effective" for the really complex, multi-hour debugging sessions. The "AI tax" for high-level reasoning is real.
It’s a weird reversal of the "AI will replace us" narrative. It might be that "AI will assist us until it becomes too expensive to compute the answer." But let’s look at the practical takeaways for people listening. If you’re a developer or a tech lead, what do these leaderboards tell you to do today? How do you actually use this information?
For me, the first takeaway is: don't get married to a single model. If a Chinese lab like MiniMax can give you ninety percent of the performance for ten percent of the cost, you need to be using an agnostic framework. You want to be able to "swap the brain" out as the leaderboard shifts every month. If you're locked into the OpenAI API and Anthropic drops a "GPT-killer" for coding, you're at a competitive disadvantage.
Second takeaway: focus on your "test culture." The only reason these agents work is because there is a "ground truth" to aim for. If your company’s codebase has zero tests and a messy structure, an AI agent is going to be useless. It will just hallucinate in circles because it has no way to verify its own work. The "AI-ready" codebase of twenty twenty-six is one with high test coverage and clean interfaces. You're basically writing the "instruction manual" for your future AI coworkers.
It’s the "data labeling" of the coding world. If you want the AI to help you, you have to give it a map. And third, I’d say: start looking at the "scaffolds." Don't just "chat" with a model in a browser. That's like trying to build a house with a Swiss Army knife. Use tools like OpenHands, Claude Code, or Trae that actually "wrap" the model in a professional engineering environment. The "harness" is where the magic happens. It provides the terminal, the file system, and the browser that the model needs to actually be an engineer.
I’m also keeping a very close eye on the "Pro" leaderboards. The gap between "Verified" and "Pro" is the gap between "memorization" and "intelligence." As that gap closes, that’s when we’ll know we’ve actually cracked the nut of autonomous engineering. We're looking for the moment when the "Pro" score hits seventy percent. That's the tipping point.
It’s a fascinating time. It feels like we’re watching a child grow up in fast-forward. Two years ago, it couldn't tie its shoes. Now it’s passing the bar exam, even if it’s "cheating" a little bit by looking at the person next to it. But eventually, it won't need to look at the person next to it. It'll just know.
But hey, even a "cheating" genius is still a genius you can use to get your work done faster. I’ll take a seventy-nine percent success rate over my own "misplaced backslash" rate any day of the week. I'd rather spend my time thinking about the architecture than worrying about regex syntax.
Fair point. I think we’ve thoroughly dissected the leaderboard for today. It’s a reminder that even in a world of "super-intelligence," the details—the scaffolds, the costs, the test suites—still matter more than the hype. The "who" and the "how" are just as important as the "what."
This has been a deep dive into the state of AI coding. Thanks as always to our producer Hilbert Flumingtop for keeping the gears turning behind the scenes and making sure our own "scaffold" is running smoothly.
And a big thanks to Modal for providing the GPU credits that power this show—including the ones Gemini is using to process all this leaderboard data right now. This has been My Weird Prompts.
If you’re finding these deep dives useful, leave us a review on your favorite podcast app. It really helps the algorithm find other nerds like us who want to talk about agentic harnesses at two in the morning.
Catch you in the next one.
See ya.