#1464: Claude Code: Engineering with the Agentic Harness

Explore how agentic harnesses transform AI from a passive chatbot into an active developer capable of full-cycle software engineering.

0:000:00

Episode Details

Published: Mar 23
Duration: 16:47
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The landscape of software engineering has shifted fundamentally. As of early 2026, four percent of all public GitHub commits are authored entirely by artificial intelligence agents. This isn't just a rise in autocomplete suggestions; it represents a move toward fully autonomous systems that write, test, and commit code. At the center of this shift is the concept of the "agentic harness," a specialized wrapper that transforms a Large Language Model (LLM) from a passive observer into an active participant in the software development lifecycle.

The Architecture of the Harness

A standalone LLM is essentially a stateless prediction engine—a "brain in a jar" with no direct access to a file system or terminal. The agentic harness provides the "biological equivalents" necessary for work: a hard drive, a terminal, and the ability to manage state. By using a harness, the model can run bash commands, interpret error outputs, and engage in recursive reasoning loops without human intervention.

The core of this system is the agentic loop, which consists of three phases: context gathering, execution, and verification. During context gathering, the agent traverses the directory and reads project-specific rulebooks like CLAUDE.md. In the execution phase, it uses tool-use capabilities to modify files. Finally, in the verification phase, it runs test suites and linters. If a test fails, the agent treats that failure as new input and restarts the loop until the problem is solved.

Reasoning and Integration

Modern agents like Claude Opus 4.6 utilize a "thinking budget," allowing for extended reasoning tokens. This architecture enables the agent to plan complex architectural changes and simulate outcomes before touching the disk. This internal chain of thought is visible to the developer, building trust through transparency.

Integration is handled via the Model Context Protocol (MCP), a standardized adapter that allows the agent to connect to external tools like Jira, Slack, or SQL databases. While MCP bridges the gap between code and business logic, it introduces new security challenges. Recent reports indicate thousands of internet-exposed MCP servers lack proper authorization, creating significant vulnerabilities in zero-trust environments.

The Shift to Asynchronous Workflows

The relationship between developer and AI is moving from synchronous chatting to asynchronous partnership. New features like "Channels" allow developers to assign a multi-hour refactoring task to an agent in the terminal, close their laptop, and receive a notification on a mobile device once the job is complete.

Furthermore, the rise of "Agent Teams" allows a single session to spawn sub-agents. These sub-agents work in parallel on different parts of a codebase—such as backend logic and frontend components—while a parent "architect" agent ensures consistency across the project.

Conclusion

The transition to agentic workflows is yielding massive productivity gains, with some organizations reporting that AI now handles 60% of daily engineering tasks. For developers, success in this new era depends on curating persistent project "brains" through markdown-based memory files, ensuring the agent retains the specific intuitions and standards of the local codebase.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1464: Claude Code: Engineering with the Agentic Harness

Daniel's Prompt

Custom topic: Claude code is often referred to as an agenetic harness. let's walk through what all those moving parts around inference are that make a CLI like Claude code feel so magical and earn it that title

You know, Herman, I was looking at some telemetry data from earlier this month, and there is a statistic that absolutely stopped me in my tracks. As of March twenty-third, two thousand twenty-six, four percent of all public GitHub commits are now authored by an artificial intelligence agent. Not just suggested by one, or partially written, or used as a glorified autocomplete, but fully authored, tested, and committed by an agentic system.

It is a staggering number when you consider where we were just eighteen months ago. I am Herman Poppleberry, and that four percent represents a forty-thousand-fold increase since the early research previews we saw back in twenty-twenty-five. We are not just talking about snippets anymore; we are talking about entire features and complex bug fixes. Today’s prompt from Daniel is about Claude Code specifically, and he wants us to dig into why people are calling it an agentic harness.

It is a great phrase, agentic harness. It implies that the model itself, the raw intelligence of something like Claude Opus four point six, is this powerful, wild force that needs to be strapped into a specific piece of machinery before it can actually do productive work in a production environment. Daniel wants to know what those moving parts are that make a command line interface feel less like a chat window and more like a senior developer who just happens to be a supercomputer.

The distinction is vital for anyone working in software today. Most people are used to a passive chatbot experience where you type a prompt and get a block of text back. You then have to copy that text, paste it into your I D E, run the compiler, and hope for the best. But a harness transforms that large language model from a passive observer into an active participant in the software development lifecycle. It is the difference between asking someone how to fix a leaky pipe over the phone and handing a master plumber a wrench and letting them into the basement.

So let us start with the harness itself. Why that word? Why not just call it an agent? Is it just marketing speak from Anthropic, or is there a structural reason for the terminology?

There is a deep technical reason. Because the model, even something as flagship as Claude Opus four point six, which launched back on February fifth, is still fundamentally a stateless prediction engine. It does not have a hard drive. It does not have a terminal. It does not have eyes to see your file structure or a way to feel the "heat" of a failing test. The harness is the specialized wrapper that provides those biological equivalents. It manages the state, it handles the recursive reasoning loops, and it provides the execution environment. Without the harness, the model is just a very smart brain in a jar. With the harness, it is an entity that can run a bash command, read the error output, and then decide to try a different approach without you ever saying a word.

That brings us to what you call the agentic loop. I think this is where the magic happens for most developers who are trying Claude Code for the first time. It is that three-phase cycle of context gathering, action execution, and result verification. Can you walk us through how that actually functions under the hood?

That is the heartbeat of the system. In the context gathering phase, the harness is not just looking at the file you have open. It is traversing the entire directory. It is looking for a file called C L A U D E dot M D, which acts as the project-specific rulebook. It is reading M E M O R Y dot M D to see what it learned during the last session. It is building a mental map of the codebase before it even suggests a change. It is essentially doing the "onboarding" process that a human developer does, but it does it in milliseconds.

And then it moves to execution. But it is not just writing code to a window for me to copy and paste. It is actually hitting the file system, right?

It is using the tool-use capabilities of the model to trigger specific bash commands. If it needs to refactor a function, it does not just write the function. It uses a tool to find the line numbers, another tool to replace the text, and then—and this is the crucial part of the loop—it moves to verification. It runs your test suite. It checks the linter. If the tests fail, the harness sees that failure as a new input. It does not stop and ask you what to do. It looks at the test failure, reasons about why it happened, and starts the loop over again. It might go through five or six of these loops before it ever presents a solution to you.

I find the visibility of that reasoning process to be one of the most interesting technical shifts recently. With the release of Opus four point six, we saw this move toward a hybrid reasoning architecture. They gave us this thing called a thinking budget. I have seen people setting these budgets to massive levels.

The thinking budget is a game changer for developer trust. You can allocate up to one hundred twenty-eight thousand tokens for what is called extended thinking mode. When the harness is working on a complex problem, it actually shows you that internal chain of thought. You can see the agent planning its route, weighing the pros and cons of different architectural patterns, and essentially talking to itself before it touches your code. It is using those tokens to simulate the outcome of its actions before it commits them to disk.

It feels like watching someone think in real-time. And because that thinking is happening in a high-density token space, it can handle much more complex logic than the old-school prompting methods we were using in twenty-twenty-four. But what happens when that logic needs to go outside the codebase? Daniel’s prompt mentions the Model Context Protocol two point zero, or M C P. How does that fit into the harness?

Think of M C P as the universal adapter. It is standardized on J S O N R P C two point zero. Before M C P, if you wanted an AI to talk to Jira or Slack or a Postgre S Q L database, you had to write custom A P I integrations for every single tool. Now, the harness just connects to an M C P server. There are over two thousand open-source M C P servers available on N P M right now. This allows Claude Code to pull in context from your bug tracker, look up a schema in your database, and then check a Slack thread to see what the product manager actually wanted, all within that same agentic loop. It is bridging the gap between the code and the business logic.

It is basically giving the agent a set of hands that can reach into any part of the enterprise stack. But there is a massive security implication there that I think we need to touch on. There was a report from S C Media just a few weeks ago, right around mid-March, that found nearly seven thousand internet-exposed M C P servers running without authorization controls.

It is a perimeter-sized hole in zero-trust architectures. If you have an M C P server that allows a model to read your database, and that server is exposed to the web without proper authentication, you are essentially leaving the keys to the kingdom under the doormat. Developers are so excited about the productivity gains that they are sometimes bypassing the standard security guardrails. The harness is powerful, but it is also a massive attack vector if it is not configured correctly. We are seeing a lot of C I S O s scrambling right now to get a handle on how many of these servers are running in their environments.

Speaking of productivity, the numbers Anthropic put out are wild. They claim their own internal engineers are using Claude for sixty percent of their daily tasks, resulting in a fifty percent productivity gain. And we are seeing that reflected in the benchmarks too. On S W E bench Verified, Opus four point six is hitting over eighty percent. But the one I really care about is S W E bench Pro, the one where they try to eliminate data contamination.

The Pro benchmark, which is managed by Scale AI and their S E A L leaderboard, is much more telling. Opus four point six is scoring between forty-six and fifty-seven percent there. To give some context, the baseline in early twenty-twenty-five was around fifteen percent. We have tripled the autonomous problem-solving capability in about a year. We are moving away from benchmarks where the model might have seen the answer in its training data and toward live environments where the model has to truly reason its way through a novel bug. It is the difference between memorizing a map and actually knowing how to navigate a forest.

I want to talk about the shift we saw just a few days ago, on March twentieth, with the launch of Claude Code Channels. This feels like the death of what people were calling the hardware tax. Do you remember last year when everyone was buying dedicated Mac Minis just to run autonomous agents twenty-four seven?

I remember it well. People were running things like OpenClaw on local hardware because they wanted their agents to work while they slept, but they did not want to leave their primary laptops running and overheating. Anthropic basically solved that by moving the execution environment into these asynchronous channels on Telegram and Discord. Now, you can start a massive, multi-hour refactor on your terminal at work, close your laptop, and go home. The agentic harness continues to run in the cloud, and when it finishes the job or hits a blocker it cannot solve, it pings you on Telegram.

It changes the relationship from a synchronous chat to an asynchronous partnership. It is more like managing a person than using a tool. And with the release of version two point one point seventy-six on March eighteenth, they added Agent Teams. This allows a single Claude Code session to spawn sub-agents to work on different parts of a problem in parallel.

The technical implementation of Agent Teams is fascinating because they are all operating under a single parent context window of one million tokens. So you have one sub-agent working on the backend A P I logic, another sub-agent refactoring the frontend components to match, and they are both feeding their progress back into the parent context. The parent acts as the architect, ensuring that the two sub-agents are not drifting apart or creating breaking changes for each other. This is how they are hitting those high S W E bench scores—by breaking down massive problems into smaller, manageable tasks that are coordinated by a central "brain."

It is a massive amount of orchestration. When you look at the competition, like Devin from Cognition AI, they take a very autonomous, almost hands-off approach where the agent lives in its own custom sandbox. Claude Code feels more like it is designed to stay close to the developer’s existing workflow in the terminal. It is not trying to replace the I D E; it is trying to become the engine that powers it.

And that is why the state persistence files are so important. I mentioned C L A U D E dot M D and M E M O R Y dot M D earlier. Those are not just documentation. They are the way the harness solves the stateless brain problem. If you tell the agent once that you prefer a specific functional programming pattern or that it should never use a certain library, it writes that down in C L A U D E dot M D. The next time you start a session, even if it is a week later, the harness reads that file and immediately regains that project-specific intuition. It does not have to be retrained or re-prompted from scratch.

It is building a persistent project brain. I think that is a great takeaway for anyone using these tools. If you are not actively curating your project-specific markdown files for your agent, you are wasting half the power of the harness. You need to treat those files like you are onboarding a new hire. You have to be explicit about your standards.

You really do. And you also need to be auditing your M C P server exposure. If you are running an open-source server from N P M, you need to make sure you are not inadvertently exposing your local file system or internal databases to the wider web. The productivity is intoxicating, but the security debt can accumulate very quickly. We are seeing a shift from "prompt engineering"—which was all about how you phrased a question—to "harness orchestration," which is about how you manage the environment the agent lives in.

So, what does this mean for the role of the developer? If we have these agentic harnesses doing the heavy lifting, running the tests, and even notifying us on our phones when the job is done, are we still programmers in the traditional sense?

We are becoming architects and code reviewers. Our job is shifting from the manual labor of syntax and debugging toward the high-level labor of system design and verification. You still need to know how the code works because you are the one signing off on the commit. You are the one who has to understand the architectural implications of the changes the agent is suggesting. But the days of spending four hours hunting for a missing semicolon or a misconfigured environment variable are effectively over for anyone using a modern harness.

It is a bit of a transition. I think some people will find it difficult to let go of that granular control. But when you see a fifty percent productivity boost, it is hard to argue with the results. You can build twice as much in the same amount of time, or, more likely, you can build things that were previously too complex for a single developer or a small team to manage.

That is the real promise here. It is not just about doing the same things faster; it is about expanding the scope of what is possible. We are seeing small teams of two or three people building platforms that would have required twenty engineers just three years ago. The harness is the force multiplier that makes that possible. It is the realization of the "one-person unicorn" dream that people were talking about at the start of the decade.

Well, I think we have covered the moving parts. From the agentic loop and the thinking budget to M C P two point zero and the new asynchronous channels, it is clear that the term harness is exactly the right way to describe this. It is the infrastructure that makes the intelligence useful. It is the difference between a brilliant idea and a finished product.

I am curious to see how the security landscape evolves as more enterprises adopt this. That perimeter-sized hole we talked about is going to be a major focus for C I S O s throughout the rest of twenty-twenty-six. We might see a new category of security tools just for auditing M C P traffic.

No doubt about that. We should probably wrap it up there. If you want to dive deeper into the technical evolution of these models, we actually did a deep dive on transformer architecture and the shift toward cognitive reliability back in episode one thousand eighty. It provides some great foundational context for how we got to this point with the Opus model family.

And if you are interested in why agents sometimes fail despite all this infrastructure, check out episode one thousand seventy-eight, "The Agentic Throughput Gap." It explains why your AI hits a wall and how things like Agent Teams are designed to break through it.

Thanks as always to our producer Hilbert Flumingtop for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the G P U credits that power the generation of this show.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your podcast app really helps us reach more people who are interested in this kind of deep-dive technical discussion.

You can also find our full archive and all the ways to subscribe at myweirdprompts dot com. We will see you next time.

Take care.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.