So Daniel sent us this one, and it's a meaty one. He's asking: how do you actually measure whether your AI agent is good? His point is that the LLM world has its benchmarks — MMLU, Chatbot Arena, HumanEval — but agents are fundamentally different. You're not evaluating a single answer. You're evaluating dozens or hundreds of actions, tool calls, state management, and a final outcome that might be right even if the path was broken, or wrong even if the path looked reasonable. He wants a practical guide covering the major benchmarks — SWE-bench, AgentBench, GAIA, TaskBench, WebArena — plus the gotchas, and the emerging evaluation approaches like LLM-as-judge and custom harnesses. Basically: how do I know if version two of my agent is actually better than version one?
This is such an important question right now because the field has built a lot of impressive-looking infrastructure around agent evaluation, and if you don't know the gotchas, you can make genuinely bad decisions based on the numbers. The evaluation landscape has matured enormously, but the ways to be fooled by it have multiplied just as fast.
And by the way, today's script is powered by Claude Sonnet 4.6. Just noting that for the record.
Right, the friendly AI down the road. Okay, so let's start with the foundational question: why is agent evaluation fundamentally harder than LLM evaluation?
I mean, the obvious answer is that with a single-turn LLM you have one output. You can check it. With an agent you have a trajectory — potentially hundreds of tool calls, branching decisions, state accumulation. The final answer might be correct because the agent got lucky, or wrong because one tool call in step fourteen returned a bad result.
And the scoring problem compounds that. With something like MMLU you have multiple choice — ground truth is trivially defined. With an agent doing a software engineering task, you need to decide: do you score the final patch? Do you score the trajectory? Do you use the existing test suite? Each of those choices encodes different assumptions about what "good" means.
So let's get into the benchmarks. Where does the field actually start?
SWE-bench is the canonical starting point. Princeton and CMU released it in 2023. The setup is elegant: take real GitHub issues from popular open-source Python repositories — Django, Flask, scikit-learn, SymPy — and ask an agent to fix them. The agent navigates a real codebase, writes a patch, and the benchmark runs the existing test suite. Pass the tests, you get credit. Fail, you don't. No partial credit, no LLM judge. It's execution-based.
Which sounds clean until you realize the test suite itself might be wrong.
That's the first big gotcha, and it's not a minor one. The SWE-bench Verified contamination story is probably the most important development in agent evaluation in the past year. Verified was the gold standard through most of 2025. OpenAI ran an audit and found that every frontier model they tested — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — could reproduce verbatim gold patches for certain tasks. The models had seen the test data during training. On top of that, fifty-nine percent of the hardest unsolved problems had flawed test cases.
So the test suite was grading on a curve, and the curve was wrong.
The numbers make it undeniable. Claude Opus 4.5 scores eighty point nine percent on Verified. On SWE-bench Pro — which is the replacement benchmark — that same model scores forty-five point nine percent. Same model. Thirty-five points of difference. That gap is almost entirely contamination and task difficulty.
Thirty-five points. That's not noise. That's the benchmark lying to you.
OpenAI has stopped reporting Verified scores entirely and now recommends SWE-bench Pro. Pro was built by Scale AI's SEAL lab — eighteen hundred sixty-five tasks across forty-one repositories in Python, Go, TypeScript, and JavaScript. The average task requires a hundred and seven lines of changes across four point one files. Compare that to the median of four lines in Verified. It uses GPL-licensed and proprietary codebases specifically to resist contamination.
And the Pro leaderboard looks very different from the Verified leaderboard.
Completely different. On Pro with standardized scaffolding, Claude Opus 4.5 is at forty-five point nine percent. Claude Sonnet 4.5 at forty-three point six. Gemini 3 Pro at forty-three point three. GPT-5 High at forty-one point eight. The models are much more tightly clustered, which is what you'd expect from a harder, cleaner benchmark.
There's also SWE-bench Live, which Microsoft keeps refreshing monthly.
Right, that's the contamination-prevention approach through continuous updates. Live has fifteen hundred sixty-five tasks as of mid-2025, supports eight languages including Rust and C-sharp, and the top score is SWE-agent with Claude 3.7 Sonnet at seventeen point six seven percent. Dramatically lower than Verified. That's what genuine difficulty on unseen code looks like.
Okay, one thing I want to flag before we move on — the scaffolding gap. Because this is something that I think trips people up constantly.
This is critical. On SWE-bench Pro, the same Opus 4.5 model scores forty-five point nine percent on SEAL's standardized scaffold, fifty point two percent with Cursor's scaffolding, fifty-one point eight with Augment Code's Auggie, and fifty-five point four with Claude Code. That's a ten-point swing from the same base model depending entirely on how the agent manages context and tool calls.
Which means when you see leaderboard comparisons between different agent systems, you're often comparing scaffolds, not models.
That's exactly the trap. SEAL's value is precisely that it holds the scaffold constant. When someone self-reports a score with their custom harness, you have no idea how much of that score is the model and how much is their engineering.
Let's talk about GAIA, because the original result there was genuinely shocking.
GAIA came out of Meta and HuggingFace in 2023, and the setup was deliberately designed to expose the gap between what frontier models score on academic benchmarks and what they can actually do. Four hundred sixty-six questions that are conceptually simple for humans — things that require multi-step reasoning, web browsing, code execution, handling PDFs and images — but that AI systems consistently fail at. The original result: GPT-4 with plugins scored fifteen percent. Human participants scored ninety-two percent.
Which was the benchmark that made everyone realize that passing the bar exam doesn't mean you can do what an office worker does.
And the leaderboard situation on GAIA is a perfect illustration of the self-reporting problem. On the HuggingFace leaderboard, which accepts self-reported results, the top score is ninety-two point three six percent from an Alibaba Cloud agent. On Princeton's HAL leaderboard, which uses standardized scaffolding and independent verification, the top score is seventy-four point five five percent — the HAL Generalist Agent with Claude Sonnet 4.5.
Eighteen points of difference between self-reported and verified. That's not a rounding error.
And HAL does something that no other major benchmark does: it tracks cost. Claude Opus 4.1 High costs five hundred sixty-two dollars to run on GAIA. o4-mini Low costs seventy-three dollars for similar accuracy. Gemini 2.0 Flash costs seven dollars eighty for thirty-two point seven three percent accuracy. HAL plots a Pareto frontier — accuracy versus cost — and labels which models are actually Pareto optimal. Claude Opus 4.1 High is not Pareto optimal because Claude Sonnet 4.5 achieves higher accuracy at a third of the cost.
This is the thing that drives me slightly crazy about most benchmark discussions. Accuracy in isolation is almost meaningless for production deployment.
Because the question isn't "what's the most accurate agent?" It's "what's the most accurate agent I can afford to run at the scale I need?" And those are very different questions. An agent that costs five hundred dollars per run on a research benchmark might be technically impressive and practically useless.
Latency too. No major public benchmark reports wall-clock time.
Right. HAL uses completion tokens as a proxy, which is imperfect. But some o3-based agents on GAIA take forty-five minutes to complete a task. That might be technically correct and operationally a non-starter. The benchmark doesn't tell you that.
Okay, let's move through some of the other major benchmarks quickly. AgentBench?
AgentBench from Tsinghua is interesting because it's explicitly multi-domain. Eight distinct environments — bash commands, SQL queries, knowledge graph navigation, a digital card game, lateral thinking puzzles, household simulation, web shopping, and web browsing. The original 2023 results showed GPT-4 at a score of four point two seven, dramatically ahead of everything else. Claude 2 at three point five one. And then open-source models mostly below one point zero.
The gap between commercial and open-source was enormous at that point.
The main failure modes they identified were poor long-term planning and inability to maintain context across many turns. What's useful about AgentBench is the cross-domain insight: an agent that excels in web tasks often fails badly in code and database tasks. Never characterize an agent's overall capability from a single domain benchmark.
TaskBench is one I want to make sure we cover because it's testing something different from the others — it's not "did you complete the task" but "did you decompose and sequence the tools correctly."
TaskBench from Microsoft Research Asia evaluates the full pipeline of task automation: decomposition, tool selection, and parameter prediction. It's built around what they call a Tool Graph — a formal representation of tools and their dependencies. The key insight is that they generated the dataset using a Back-Instruct method: start from a subgraph of the tool graph, then use GPT-4 to generate a natural language instruction that would require exactly those tools in that dependency order. So you're testing whether models can invert that process.
And the finding about edge prediction versus node prediction is striking.
Edge prediction — understanding tool dependencies, the sequencing — is consistently about twenty percent harder than node prediction, which is just identifying the right tools. GPT-4 scored eighty-one point five four on node F1 but fifty-four point seven on edge F1. Models that can select the right tools often can't correctly sequence them. And code pre-training improves tool prediction by four point four five percent and parameter prediction by twelve point seven six percent, which tells you something about why coding-focused models tend to do better on agentic tasks generally.
Let's talk about WebArena and OSWorld because those are the benchmarks that feel most directly tied to the stuff people are actually trying to build.
WebArena from CMU gives you a self-hosted web environment — replicas of an e-commerce site, a Reddit-like forum, a GitLab-like coding platform, and a Wikipedia-like content management system. Eight hundred four tasks, binary pass/fail, evaluated programmatically without an LLM judge. The human baseline is seventy-eight percent. GPT-4's baseline score in 2023 was fourteen point nine percent. The current top on the steel.dev leaderboard is DeepSeek v3.2 at seventy-four point three percent, third-party verified.
That's a remarkable trajectory in about two years.
And then WebVoyager, which uses live websites and a GPT-4V judge, has self-reported scores from H Company's Surfer 2 at ninety-seven point one percent. But the caveat from steel.dev is important here: a seventy percent on WebArena and a seventy percent on WebVoyager are not equivalent. Different tasks, different environments, different graders, different difficulty levels. You can't compare them directly.
OSWorld is the desktop automation one.
OSWorld gives you a full virtual computer environment — Ubuntu and Windows — with three hundred sixty-nine tasks ranging from single-app operations to multi-app workflows. Human baseline is seventy-two percent. GPT-4o's baseline was seventeen point eight percent. The current top is GPT-5.4 self-reporting seventy-five percent, Claude Opus 4.6 at seventy-two point seven. Agent S3 from Simular AI is at sixty-nine point nine percent third-party verified and open source, which is notable.
There's also TAU-bench for customer service scenarios, which feels very relevant for anyone building support agents.
TAU-bench from Sierra Research simulates multi-turn customer service with domain-specific API tools. The HAL verified top score on the airline domain is fifty-six percent with o4-mini High at about eleven dollars per run, which is Pareto optimal. Claude 3.7 Sonnet and Opus 4.1 both reach fifty-two percent but at much higher cost. For anyone building a customer service agent, this is probably the benchmark to pay closest attention to because the task structure most closely matches what they're actually deploying.
Okay, I want to get into the evaluation approaches for people actually building agents, because the public benchmarks are useful for understanding where the field is, but they don't directly answer "is my agent getting better?"
Right, and this is where the practical toolkit diverges significantly from the leaderboard world. Let's start with LLM-as-judge because that's where most teams are actually operating. The evolution here has gone through roughly four generations. Traditional metrics like ROUGE and BLEU have poor correlation with quality for open-ended tasks. Single LLM-as-judge using GPT-4 to score outputs gets you Spearman correlation of about zero point seven to zero point nine with human preferences, which sounds good until you realize the failure modes.
Length bias being the big one.
Length bias, self-model bias — judges tend to favor outputs from architecturally similar models — style bias, and adversarial vulnerability. You can craft a response that's essentially nonsense but that exploits the evaluation prompt to get a high score. The multi-agent judge frameworks try to address this. ChatEval uses multiple LLM agents with distinct personas debating response quality and improved correlation with human judgments by ten to sixteen percent over single-agent prompting. CourtEval uses a courtroom structure — judge, prosecutor, defense attorney — where the judge revises the score after hearing both sides.
The Agent-as-a-Judge result from ICML 2025 is the one that really jumped out at me.
This is from Zhuge et al., and the setup is using an agent to evaluate another agent's entire trajectory, not just the final output. They applied it to DevAI, which is fifty-five realistic AI development tasks. The agent judge's decisions differed from human majority vote only zero point three percent of the time. A single LLM judge disagreed thirty-one percent of the time. The agent judge even exceeded individual human evaluators in consistency.
Zero point three versus thirty-one. That's the difference between trajectory evaluation and output-only evaluation.
And it points to why output-only evaluation is often insufficient for agents. If your agent makes ten tool calls and the final answer happens to be correct, output-only evaluation will rate that highly. But if step four used the wrong tool and step seven hallucinated a result that happened to cancel out step four's error, you have no idea your agent is doing something broken. Trajectory evaluation catches that.
LangChain's framework breaks this into three dimensions, right?
Grounding and context use — did the agent retrieve and reason over the right information, measured through faithfulness, context precision, and context recall. User experience quality — did the conversation achieve the user's goal, measured through topic relevancy and outcome success. And security and safety — did the agent stay within policy boundaries, checking for prompt injection and policy compliance. That third dimension is one that a lot of teams ignore until something goes wrong in production.
Let's talk about the custom harnesses because this is where most teams are going to spend their actual evaluation budget.
Braintrust and LangSmith are the two dominant platforms right now. Braintrust's key differentiator is their Loop feature — an AI assistant that writes custom scorers from natural language descriptions. You describe what you want to measure, it generates the scorer. They also have remote evals in playgrounds for no-code agent testing. The pricing is free tier for small teams — five users, a million spans per month, ten thousand scores — and then two hundred forty-nine dollars a month for pro.
The reported impact numbers they cite are striking. Thirty percent accuracy improvements within weeks, one customer service app reducing escalations by three thousand per day after systematic evaluation.
Those are self-reported, so take them with appropriate skepticism. But the underlying mechanism is real. When you're running production traffic through an evaluation framework and routing failures to domain experts for annotation, you're creating a feedback loop that genuinely accelerates improvement. The LangSmith approach is similar — native LangChain and LangGraph integration, automatic tracing, annotation queues, online evaluations running judges on sampled production traffic.
The human evaluation piece is worth dwelling on for a moment because I think there's a temptation to think you can automate your way out of it entirely.
You can't. Human evaluation remains the gold standard. The GDPval project from OpenAI is the most ambitious recent effort — thirteen hundred twenty tasks across forty-four occupations created by domain experts with an average of fourteen years of experience. Blind review by industry experts comparing model outputs to human professional outputs. The key finding: Claude Opus 4.1's outputs were rated equal or better than human professionals on just under half of tasks. Models complete tasks about a hundred times faster and a hundred times cheaper than human experts, though that doesn't account for oversight and iteration costs.
The hundred times faster and cheaper number is the one that'll get quoted in board meetings. The oversight and iteration costs caveat is the one that'll get ignored.
The practical human evaluation protocol for agent builders: annotation queues where complex traces get routed to domain experts, calibration sessions before annotation starts to align reviewers on scoring criteria, multi-annotator consensus using majority vote of three to five experts as ground truth, and blind review mixing model outputs with human outputs to prevent rater bias. And critically — the production-to-dataset loop. When monitoring identifies a failure in production, convert that trace into a test case with the corrected trajectory as ground truth. Your production failures are your best source of evaluation data.
So if I'm an agent builder trying to answer Daniel's core question — how do I know if version two is better than version one — what's the actual workflow?
A few things. First, don't rely on a single public benchmark. AgentBench showed that cross-domain performance is highly variable. Pick two or three benchmarks that reflect your actual use case. If you're building a coding agent, SWE-bench Pro with standardized scaffolding is your primary signal. If you're building a web agent, WebArena with third-party verification. If you're building a general assistant, GAIA via HAL.
And on the cost dimension — always include cost in your evaluation. Not just accuracy.
Always. HAL's Pareto frontier approach is the right mental model. The question isn't "which agent scores higher?" It's "which agent is on the Pareto frontier of accuracy and cost for my deployment constraints?" Those are often different agents.
The scaffolding point matters here too. If you're comparing version one to version two of your own agent, you can hold the scaffold constant. But if you're comparing your agent to a competitor's published score, you often have no idea what scaffold they used.
Which is why third-party verified scores matter. Steel.dev distinguishes between self-reported and third-party verified scores throughout their leaderboard, and the gap is consistently large. On GAIA, eighteen points. On SWE-bench Pro, GPT-5.3-Codex CLI self-reports fifty-seven percent while the SEAL standardized score for GPT-5.2 Codex is forty-one percent. Sixteen points of scaffold and reporting methodology.
There's also the question of what you're actually scoring. Binary pass/fail misses partial progress.
This is a real limitation of SWE-bench's approach. An agent that fixes nine of ten bugs in a task scores the same as one that fixes zero. You can't measure incremental improvement in the binary regime. SWE-bench Pro's three-stage human augmentation helps — they add problem statement, requirements, and interface specification — but the fundamental binary nature remains. The emerging solution is process reward models and trajectory scoring, where you score the path not just the outcome. That's where the Agent-as-a-Judge work is heading.
What about ARC-AGI-2 and some of the other benchmarks designed to be intentionally hard?
ARC-AGI-2 from 2025 is worth mentioning because the scores are deliberately low — o3 at high compute gets about four percent, Gemini 2.5 Pro at about two percent. The point isn't to measure absolute capability, it's to maintain a benchmark that resists memorization and signals where genuine reasoning improvement is happening. Low scores on ARC-AGI-2 aren't embarrassing — they're the point. It's a leading indicator, not a performance metric.
There's also BrowseComp from OpenAI for deep web research.
BrowseComp is interesting because it's explicitly designed so most agents fail. Multi-hop web research requiring synthesizing information across many pages. Kimi K2 Thinking from Moonshot AI is at sixty point two percent, OpenAI Deep Research at fifty-one point five, and then a dramatic dropoff — WebSailor-72B at twelve percent, GPT-4o with browsing at one point nine percent. The distribution tells you this is measuring something qualitatively different from standard web browsing tasks.
Okay, practical takeaways. If someone is building agents today and wants to set up a serious evaluation practice, what's the actual guidance?
Start with the benchmark selection question. Be honest about what domain you're in. SWE-bench Pro for code agents, GAIA via HAL for general assistants, WebArena for web agents, TAU-bench for customer service. Use HAL-verified scores, not self-reported scores. Always track cost alongside accuracy.
For your own internal evals, the production-to-dataset loop is probably the highest-leverage thing most teams aren't doing.
Every production failure is a test case waiting to happen. When your agent fails in the wild, capture that trace, have a domain expert annotate the corrected trajectory, and add it to your eval suite. That's how you build evaluation coverage that's actually calibrated to your real failure modes rather than the failure modes that benchmark authors happened to anticipate.
And the scaffolding discipline matters enormously. If you're running A/B tests between agent versions, hold the scaffold constant. If you change the model and the scaffold at the same time, you have no idea which change drove the improvement.
The Scale AI failure mode analysis on SWE-bench Pro is instructive here. For the strongest models, context overflow is the dominant failure — thirty-five point six percent of Sonnet 4 failures. For smaller models, tool-use inefficiency dominates at forty-two percent. Semantic understanding failures account for thirty-five point nine percent of Opus 4.1 failures. Knowing which failure mode is your bottleneck tells you where to invest — better context management, better tool selection logic, or better reasoning.
And don't sleep on the latency dimension even though the benchmarks mostly ignore it.
Completion tokens as a proxy is imperfect but better than nothing. HAL tracks this. If you're deploying an agent where a user is waiting for a response, a forty-five minute completion time that scores slightly higher on accuracy is almost certainly the wrong trade-off.
The field is in a genuinely interesting place right now. The benchmarks are getting more rigorous — the move from Verified to Pro on SWE-bench is a good example — but the gap between benchmark performance and what you'd actually want to deploy is still significant.
The CUB benchmark, which tests a hundred and six end-to-end workflows across seven industries, has a top score of ten point four percent. Ten percent. For end-to-end real-world workflows, even the best agents are failing ninety percent of the time. That's not a reason to panic — it's a reason to be precise about what you're deploying and what your fallback mechanisms are.
Alright. Big thanks as always to our producer Hilbert Flumingtop. And a genuine thank you to Modal for providing the GPU credits that keep this whole operation running. This has been My Weird Prompts. If you want to find us, search for My Weird Prompts on Telegram and you'll get notified when new episodes drop. We'll see you next time.
See you then.