#2165: Strip Your Agent to Bash

The frameworks matter less than you think. What separates a working agent from a failing one is the harness—the orchestration, memory, and tool des...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2323
Published: Apr 12
Duration: 25:53
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: ai-agents ai-orchestration prompt-engineering

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Agent Harness Is Everything

The question that dominates agentic development right now is deceptively simple: which framework should I use? LangGraph or CrewAI? AutoGen or Semantic Kernel? Claude Code or something custom?

The answer, according to emerging consensus in the field, is that you're asking the wrong question.

The Model Is Commodity, The Harness Is Everything

In March, LangChain published a framing that has stuck: Agent equals model plus harness. If you're not building the model, you're building the harness. That's where the engineering taste shows up.

The evidence is stark. In February 2024, Anthropic, OpenAI, and Google all hit near-parity on SWE-bench Verified—within a percentage point of each other. The model performance ceiling has flattened. What distinguishes working agents from failing ones is everything wrapped around the model: system prompts, tool definitions, orchestration, state management, memory, retry strategies, context management, and guardrails.

Sajal Sharma framed it bluntly at Yale: swapping models without rethinking the harness rarely produces proportional gains. The performance ceiling you're hitting is almost never the model. It's the environment you've put the model in.

Five Philosophies, Five Frameworks

Each major framework enforces a different mental model on developers:

LangGraph thinks in state machines—directed graphs where nodes represent actions and edges define control flow. Powerful for complex multi-step tasks with explicit branching and error handling, but with a real learning curve and over-engineering risk for simple use cases. Built-in human-in-the-loop checkpointing lets you interrupt and inject judgment at any node.

CrewAI thinks in team dynamics. Each agent has a role, goal, and backstory. A manager agent delegates and coordinates. The abstraction is closer to how humans naturally divide work, but it carries a cost: a five-agent crew costs roughly five times what a single LangChain agent costs per task, and the framework opinions can feel constraining for non-standard patterns.

AutoGen (Microsoft Research) is conversation-centric and asynchronous. Agents communicate through structured message passing. Humans are first-class participants in the conversation, not bolted-on afterthoughts. Code execution sandboxing is built in, and Azure ecosystem integration is deep—valuable for enterprise shops already in that stack.

Semantic Kernel (also Microsoft) is enterprise-first: .NET, C#, and Java support with dependency injection, middleware, and telemetry. The mental model is skills and plugins. An AI-powered planner decomposes complex goals into action sequences. The pitch is embedding AI into existing enterprise codebases without rearchitecting. The downside: complex plans can hallucinate steps, the abstraction layer is heavier, and the community is smaller.

Claude Code is the philosophical outlier. Simplicity thinking: the model controls the loop, the harness provides the environment. A while loop executes tool calls and feeds results back. Fourteen tools total—four CLI tools, six file operations, two web tools, two control flow tools. No explicit termination tool. No critic pattern. No sophisticated memory system baked in. The TODO list is injected after key steps to fight the "Lost in the Middle" problem: LLMs attend strongly to the beginning and end of context but poorly to the middle.

The Data That Proves It

The Vercel case is the clearest evidence. They built d0, a text-to-SQL agent with fifteen specialized tools: GetEntityJoins, LoadCatalog, RecallContext, SearchSchema, GenerateAnalysisPlan. Very thoughtfully designed. Then they deleted eighty percent of them.

New architecture: two tools. ExecuteCommand (bash in a sandbox) and ExecuteSQL.

Results:

Execution time: 274 seconds → 77 seconds (3.5x faster)
Success rate: 80% → 100%
Token usage: down 37%
Steps: down 42%
Worst case under the old system: 724 seconds, 145,000 tokens, 100 steps, still failed. The new system didn't fail.

Why? Attention saturation. Each tool schema is roughly 1-2 kilobytes of JSON. Fifteen schemas means 20 kilobytes of tool definitions in context before the actual task appears. The model spends more attention choosing between tools than doing the work. General-purpose tools like bash map directly to how models are trained—they've seen millions of bash commands. They haven't seen your custom GetEntityJoins schema.

LangChain's Terminal Bench experiment reinforces this from a different angle. Their coding agent scored 52.8% on Terminal Bench 2.0. Then they only changed the harness: added a build-and-self-verify loop, a pre-completion checklist middleware that forces a verification pass before exit, a local context middleware that maps directory structure on start, and a loop detection middleware that tracks per-file edit counts and injects a "consider reconsidering your approach" prompt after N edits.

Same model. Result: 66.5%. A 13.7 percentage point jump from middleware alone. Moved from outside the top 30 to top 5 on the leaderboard.

The Execution Failures That Aren't Model Failures

The APEX-Agents benchmark from Mercor tested four hundred and eighty tasks across thirty-three worlds simulating real professional work: investment banking, consulting, law. Average of 166 files per world. Tasks simulating five to ten day client engagements.

Best frontier model pass at one attempt: 24%. With eight attempts, around 40%. Zero-score rates—agent failed every rubric criterion—between 40% and 62% across configurations. Timeout rates, meaning exceeding 250 steps without finishing, up to 30% for some models.

Critically: these weren't knowledge failures. The models had the information. The failures were execution and orchestration problems. Agents getting lost after too many steps, looping back to failed approaches, losing track of objectives mid-task.

That's a precise description of harness failures, not model failures. The model knew what to do. The system around it didn't keep it on track.

The Testable Hypothesis

The counterintuitive design principle that emerges: strip your agent to bash plus file access, run your eval suite, see if performance improves. If it does, your specialized tools were net-negative.

The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about. The model has capability that isn't reliably accessible. The harness creates the conditions under which that capability expresses consistently.

This isn't abstract philosophy. It's measurable, testable, and reproducible across frameworks and use cases.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2165: Strip Your Agent to Bash

So Daniel sent us this one, and it's a question I think a lot of people building with AI agents are quietly wrestling with. He's asking what actually distinguishes one agentic harness from another, and where the real engineering creativity lives in building them. The framing is this: LangGraph, CrewAI, AutoGen, Semantic Kernel, Claude Code — they all orchestrate LLM calls with tools to accomplish multi-step tasks. But underneath, they encode radically different philosophies about how agents should think and operate. And Daniel's specific provocation is whether the future of agentic development is less about picking the framework and more about remixing your own stack of opinions on top of a flexible base. There's a lot here.

There really is. And I think the entry point that makes everything else click is a framing LangChain published in March: Agent equals model plus harness. If you're not the model, you're the harness. That's it. That's the whole thesis.

Which sounds deceptively simple.

It does, but follow the logic. The harness is every piece of code, configuration, and execution logic that isn't the model itself — system prompts, tool definitions, orchestration, state management, memory, retry strategies, context management, guardrails. Two developers using the same underlying model with different harness designs produce wildly different results. The model is increasingly the commodity. The harness is where engineering taste shows up.

And that claim about commoditization — is that actually borne out in the data?

February of this year, all three major providers — Anthropic, OpenAI, Google — hit near-parity on SWE-bench Verified, scoring within a percentage point of each other. Sajal Sharma gave a talk at Yale where he put it bluntly: swapping models without rethinking the harness rarely produces proportional gains. The performance ceiling you're hitting is almost never the model. It's the environment you've put the model in.

So the model is the engine and the harness is... everything else that makes the car driveable.

That's actually Evangelos Pappas's framing almost verbatim. The industry spent years arguing about who had the best engine. Almost nobody was building a car that could stay on the road.

Alright, by the way — today's script is coming to us courtesy of Claude Sonnet 4.6, which is a fun little detail given we're about to spend twenty-five minutes talking about agent architecture. The AI writing our podcast is itself an example of a model sitting inside someone's harness. Anyway. Let's get into the frameworks, because I think the philosophy differences here are genuinely interesting and not obvious from the outside.

Right. So the way I'd frame these five frameworks is that each one enforces a different mental model on the developer. Not just technically — philosophically. LangGraph forces you to think in state machines. You're modeling your agent as a directed graph where nodes represent actions — LLM calls, tool executions, conditional routing — and edges define control flow. The graph supports cycles, so agents can loop, retry, and self-correct. And crucially, there's built-in human-in-the-loop checkpointing — you can interrupt the graph at any node and inject human judgment.

Which sounds powerful but also like a lot of surface area to manage.

The learning curve is real. And there's a genuine over-engineering risk for simple use cases. But the payoff is that complex multi-step tasks with explicit branching and error handling become expressible in a way that's hard to achieve otherwise. LangSmith gives you production observability on top of that.

Then CrewAI is the team dynamics one.

Team dynamics thinking, yes. Each agent has a role, a goal, and a backstory. A manager agent can delegate and coordinate. The framework encourages you to think about specialization and collaboration rather than about graph topology. What's interesting is that CrewAI's abstraction is closer to how humans naturally think about dividing work — you have a researcher, an analyst, a writer — rather than thinking about state transitions.

Though that persona-first approach presumably has costs.

A five-agent CrewAI crew costs roughly five times what a single LangChain agent costs per task. The multi-agent message passing overhead is real. And for non-standard patterns, the framework opinions can feel constraining. You're paying for the abstraction in both dollars and flexibility.

AutoGen is Microsoft Research, and that's the conversation-centric one.

Conversation-centric, and importantly asynchronous. Agents communicate through structured message passing. GroupChat enables multi-agent discussions. What distinguishes AutoGen is that human-in-the-loop is a first-class pattern — humans are just another participant in the conversation, not an afterthought bolted on. Code execution sandboxing is built in, Docker and local. And the Azure ecosystem integration is deep, which matters for enterprise shops already in that stack.

Semantic Kernel is also Microsoft, right? So you have two Microsoft frameworks with pretty different philosophies.

Which is itself interesting. Semantic Kernel is enterprise-first in a way AutoGen isn't — .NET and C# and Java support, dependency injection, middleware, telemetry. The mental model it enforces is skills and plugins. You define prompt templates as skills, code as plugins, and an AI-powered planner automatically decomposes complex goals into action sequences. The pitch is that you can embed AI capabilities into existing enterprise codebases without rearchitecting everything.

But the planner reliability is the problem.

Complex plans can hallucinate steps. The abstraction layer is heavier than the others. And the community is smaller, which means when you hit an edge case you're more on your own.

And then Claude Code, which is the philosophical outlier in this group.

It really is. Claude Code's approach is what I'd call simplicity thinking. The model controls the loop. The harness provides the environment. The core mechanism is a while loop — if the model produces a tool call, execute it and feed results back; if not, stop and wait for user input. No explicit termination tool. No critic pattern. No sophisticated memory system baked in.

How many tools does it actually expose?

Fourteen total. Four CLI tools — bash, glob, grep, ls. Six file operations — read, write, edit, multi-edit, notebook read and edit. Two web tools — search and fetch. Two control flow tools — TodoWrite and Task. That's it. And there's a specific reason the first tool call is almost always TodoWrite — the system injects the current TODO list after key steps specifically to fight what the "Lost in the Middle" research identified: LLMs attend strongly to the beginning and end of context but poorly to the middle. The TODO list keeps the objective visible.

So the todo list is doing real architectural work, not just being a nice feature.

It's a harness design decision masquerading as a productivity feature. And the security model is interesting too — Claude Haiku does the security checks before bash commands execute. The main model is too expensive to run on every permission decision, so Haiku gives you structured output on whether user approval is needed. Conscious speed-accuracy tradeoff.

I want to dig into the data that actually proves the harness matters more than the model, because there are some numbers here that are pretty striking. The Vercel case especially.

The Vercel case is the one I keep coming back to. They had a text-to-SQL agent — they called it d0 — with fifteen specialized tools. Things like GetEntityJoins, LoadCatalog, RecallContext, SearchSchema, GenerateAnalysisPlan. Very thoughtfully designed, very specific. They deleted eighty percent of the tools. New architecture: two tools — ExecuteCommand, which is bash in a Vercel sandbox, and ExecuteSQL.

And then what happened?

Execution time dropped from two hundred and seventy-four seconds to seventy-seven seconds. Three and a half times faster. Success rate went from eighty percent to one hundred percent. Token usage dropped thirty-seven percent. Steps dropped forty-two percent. And the worst case under the old system — seven hundred and twenty-four seconds, a hundred and forty-five thousand tokens, a hundred steps — it still failed. The new system didn't fail.

Why does that happen? Because intuitively, more specialized tools sounds like it should be better.

The mechanism is attention saturation. Each tool schema is roughly one to two kilobytes of JSON. Fifteen tool schemas means you're putting around twenty kilobytes of tool definitions into the context before the actual task even appears. That competes with the task tokens. The model is spending more attention choosing between tools than doing the work. General-purpose tools like bash map directly to how models are trained — they've seen millions of bash commands. They haven't seen your custom GetEntityJoins schema.

So the counterintuitive design principle is: strip your agent to bash plus file access, run your eval suite, see if performance improves. If it does, your specialized tools were net-negative.

That's exactly the testable hypothesis. And the LangChain terminal bench experiment reinforces it from a different angle. Their coding agent scored fifty-two point eight percent on Terminal Bench 2.0. Same model throughout — they only changed the harness. They added a build-and-self-verify loop, a pre-completion checklist middleware that intercepts the agent before it exits and forces a verification pass, a local context middleware that runs on agent start to map directory structure, and a loop detection middleware that tracks per-file edit counts and injects a "consider reconsidering your approach" prompt after N edits to the same file.

And the result?

Sixty-six point five percent. Moved from outside the top thirty to top five on the leaderboard. Same model. Different harness.

That's a thirteen-point-seven percentage point jump from middleware.

And the framing they used for it is good: the goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about. The model has capability that isn't reliably accessible. The harness creates the conditions under which that capability expresses consistently.

The APEX-Agents benchmark from Mercor is the one that makes me slightly terrified, though.

For good reason. Four hundred and eighty tasks across thirty-three worlds simulating real professional work — investment banking, consulting, law. Average of a hundred and sixty-six files per world. Tasks simulating five to ten day client engagements. Best frontier model pass at one: twenty-four percent. With eight attempts, best model climbed to around forty percent. Zero-score rates — agent failed every rubric criterion — between forty and sixty-two percent across configurations. Timeout rates, meaning exceeding two hundred and fifty steps without finishing, up to thirty percent for some models.

And critically, these weren't knowledge failures.

The models had the information. The failures were execution and orchestration problems. Agents getting lost after too many steps, looping back to failed approaches, losing track of objectives mid-task. Which is a precise description of harness failures, not model failures. The model knew what to do. The system around it didn't keep it on track.

Let's talk about Manus, because the context management story there is remarkable and the four rebuilds detail is the kind of thing that sounds like hyperbole until you understand what they were actually learning each time.

Each rebuild followed the same pattern: removing complexity that seemed necessary but was degrading performance. They removed a complex document retrieval system. They removed fancy routing logic between specialized sub-agents. They removed specialized tools for each operation. What they kept: filesystem-as-memory, the todo-list mechanism, a context compaction hierarchy — raw context, then compaction, then summarization — and KV-cache optimization. That last one is financial: cached tokens cost thirty cents per million tokens versus three dollars per million uncached. Ten times cheaper.

And their agents average fifty tool calls per task. So that cost difference compounds fast.

Their input-to-output ratio is approximately a hundred to one. Meaning context management is directly cost management. Every architectural decision about what stays in the context window is also a financial decision. Meta apparently agreed with their conclusions enough to acquire them for around two billion dollars in December.

Which is a pretty strong market signal about where the value actually sits.

It is. And it connects to the broader taxonomy Sajal Sharma laid out. He distinguishes four layers: raw API calls, where you control everything; frameworks like LangChain and CrewAI, where you're making architectural decisions about memory, tools, and orchestration; the runtime layer, which is LangGraph, handling execution, state management, and durability; and then harnesses — Claude Code, LangChain's Deep Agents — which are maximally opinionated. Memory, context management, agent loop, tool access, safety checks all baked in.

And the LangChain ecosystem is interesting because it spans all three of those layers simultaneously. LangChain the framework, LangGraph the runtime, Deep Agents the harness.

Which makes it a useful lens for thinking about what the "remix" question actually means in practice. Because when Daniel asks whether the future is less about picking the framework and more about assembling your own stack of opinions — I think the answer is yes, and the NLAH paper from Tsinghua University is the scientific basis for why.

Walk me through that.

The paper treats harness modules as composable and ablatable objects. They ran an ablation study on SWE-bench Verified. File-backed state added one point six percent. Evidence-backed answering added one point six percent. A verifier actually subtracted point eight percent — more structure hurt. Multi-candidate search subtracted two point four percent. But self-evolution, which tightens the solve loop, added four point eight percent. Dynamic orchestration added nothing.

So more structure doesn't reliably mean better performance.

The modules that help are the ones that tighten the path from intermediate behavior to the evaluator's acceptance condition. The ones that add branching and optionality — multi-candidate search, complex verification — often hurt because they add tokens and decision overhead without proportional benefit. Which maps exactly to the Vercel finding. The harness components that work are the ones that keep the agent on the critical path, not the ones that give it more options.

This is where the opinionated-versus-unopinionated tension gets interesting to me. Because Rails made web development accessible to millions precisely by making strong choices. Next.js made React development accessible by prescribing file-based routing. The framework opinion is a feature, not a limitation — until you're the person who needs to do something the framework didn't anticipate.

The framing I find useful is Martin Fowler's — he describes the harness as a cybernetic governor. You have feedforward controls he calls guides: AGENTS.md files, coding conventions, architecture docs — things that anticipate the agent's behavior and steer it before it acts. And you have feedback controls he calls sensors: linters, test runners, AI code review — things that observe after the agent acts and help it self-correct.

And he distinguishes between computational controls and inferential controls.

Computational controls are deterministic and fast — tests, linters, type checkers, running in milliseconds. Inferential controls are slower, more expensive, non-deterministic — semantic analysis, AI code review — but they allow semantic judgment that computational controls can't make. An opinionated harness makes choices about which controls to include, when to run them, and what to do when they fire.

And his concept of harnessability is the one that I think has the most practical implications for teams right now.

Because it creates a new category of technical debt. Not every codebase is equally amenable to harnessing. Strongly typed languages, clearly definable module boundaries, frameworks like Spring — these increase harnessability. Legacy codebases with accumulated debt face the hardest problem: the harness is most needed precisely where it's hardest to build. You could call it harness debt. And it's real.

The security surface expansion is also something that doesn't get talked about enough in the framework comparison conversations.

In a workflow-based architecture, the attack surface is isolated within individual nodes. In a harness where a single orchestrating agent has access to the entire environment — file system, code execution, external APIs — a successful prompt injection has a much larger blast radius. Claude Code's Haiku-based security checks are a specific, observable design response to this. Before a bash command executes, Haiku evaluates whether user approval is needed. It's not foolproof, but it's an explicit acknowledgment that the "give it bash" approach has a security surface that has to be managed architecturally.

So what does a well-crafted opinionated remix actually look like in practice? Because I think that's the question builders are actually asking when they read all of this research.

The convergence across OpenAI Codex, Claude Code, and Manus is instructive — three independent architectures arriving at the same principles. Fewer general-purpose tools over many specialized ones. External state management — git, progress files, filesystem memory. Error retention, meaning you keep stack traces and failed approaches in context so the agent doesn't retry the same thing. Context discipline — forward-only layers, compaction hierarchies, progressive disclosure. And extensibility via sub-agents or MCP rather than monolithic capability expansion.

So a remix that draws on those convergent principles would take LangGraph's state management and durability, Claude Code's minimal tool philosophy, Manus's filesystem-as-memory pattern, and LangChain's middleware hooks for loop detection and pre-completion verification.

And the Tsinghua paper gives you the ablation framework for evaluating whether each component is actually helping. You don't have to guess. You add the module, run your eval suite, measure the delta. If self-evolution adds four point eight percent and multi-candidate search subtracts two point four, you add self-evolution and skip the search expansion.

Pappas's "build for deletion" principle is the one that I think should be the north star for anyone building harness components right now.

Every piece of harness logic should have an expiration date. If the next model can handle something without your scaffolding, delete the scaffolding. Manus rebuilt four times as models improved. LangChain's evolution from heavily abstracted chains to simpler LangGraph is another instance. The question for builders is: are you building infrastructure that gets simpler as models improve, or more complex? If your harness keeps getting more complicated as models improve, you're swimming against the current.

Which brings up the Bitter Lesson tension. Rich Sutton's argument from 2019 is that general methods leveraging computation always beat methods encoding human knowledge — eventually and by a large margin. A strict reading of that would predict that harness engineering itself gets obsoleted by sufficiently capable models.

Pappas has a good response to this. Multi-step execution tasks have irreducible coordination requirements — context management, state persistence, error recovery — that are not reasoning problems for the model to solve but infrastructure problems for the system to handle. The model can't fix context rot from inside the context window. It can't persist state across sessions by thinking harder. Those are environmental problems that require environmental solutions.

And Sajal Sharma's version of this is that the workflow logic doesn't disappear when you move to a harness. It moves into skills, system prompts, and tool parsing. The structure is still there — it's just expressed as instructions to the agent, and the agent decides how to apply them.

Which is a more sophisticated version of the Bitter Lesson than the naive "harnesses will become unnecessary" reading. The lesson isn't that scaffolding disappears. It's that the scaffolding should move toward general mechanisms and away from domain-specific hand-crafting. Fewer specialized tools, more capable general tools. Less explicit routing logic, more expressive system prompts. The harness gets simpler in structure but more powerful in what it enables.

Let's do practical takeaways, because I think there are some concrete things people can actually do with this.

The first one is run the Vercel experiment on your own agent. Whatever specialized tools you've built, strip them down to bash and file access and run your eval suite. If performance improves, your specialized tools were net-negative. This is testable and cheap to try.

Second: treat your context window as a managed resource, not a landfill. The "Lost in the Middle" research is real — models attend to the beginning and end of context but poorly to the middle. If your important instructions are buried under intermediate results, the model is effectively ignoring them. The TODO mechanism, context compaction, progressive disclosure — these aren't nice-to-haves. They're core to reliable performance at scale.

Third: build your harness with ablation in mind. Every component should be independently removable so you can measure its contribution. If you can't ablate it, you can't evaluate it. And if you can't evaluate it, you're just accumulating complexity without knowing whether it's helping.

Fourth — and this is the one I'd emphasize for teams with legacy codebases — assess your harnessability before you assess your framework. The harness is hardest to build where it's most needed. Strongly typed languages, clear module boundaries, good test coverage — these aren't just software quality indicators, they're prerequisites for effective agentic integration.

And fifth, probably the most important: design your harness to get simpler as models improve. Every piece of scaffolding you add should have a clear answer to the question "what model capability would make this unnecessary?" If you can't answer that, you might be building complexity that compounds rather than simplifies.

The open question I keep coming back to is who gets to decide what the right opinions are to bake into a harness. Rails made strong choices about MVC and it worked for a generation of web development. But Rails also made some choices that took years to undo. An opinionated agentic harness encodes assumptions about how agents should work that might be deeply wrong for certain problem domains — and the cost of unwinding that might be higher than building from primitives in the first place.

LangChain actually published their open research questions on this. Orchestrating hundreds of agents working in parallel on a shared codebase. Agents that analyze their own traces to identify and fix harness-level failure modes. Harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being pre-configured. That last one is the most interesting to me — a harness that is itself agentic about what harness it needs.

A harness that builds itself. Which is either the most elegant solution or the most terrifying, depending on your disposition.

Probably both.

That's a good place to land. Thanks as always to our producer Hilbert Flumingtop for keeping everything running. Big thanks to Modal for providing the GPU credits that power this show — if you're building anything with serverless GPU workloads, check them out. This has been My Weird Prompts. If you're enjoying the show, a quick review on your podcast app goes a long way in helping us reach new listeners. We'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2165: Strip Your Agent to Bash

The Agent Harness Is Everything

The Model Is Commodity, The Harness Is Everything

Five Philosophies, Five Frameworks

The Data That Proves It

The Execution Failures That Aren't Model Failures

The Testable Hypothesis

Downloads

You Might Also Like

#2165: Strip Your Agent to Bash