Imagine you are in a high-stakes hospital environment during a shift change. One nurse is finishing a grueling twelve hour rotation, and another is just walking in, fresh but completely blind to the last half day of chaos. The outgoing nurse has all the vital context: which patient is allergic to penicillin, who just had a dangerous spike in blood pressure ten minutes ago, and who is anxiously waiting for a family phone call that was promised an hour ago. If that nurse just leaves without a word, or if they just hand over a disorganized pile of scribbled, coffee-stained sticky notes, the results could be catastrophic. In our world, the patient is the task, and the nurses are the agents.
That is such a visceral way to frame it, Corn. And honestly, it is exactly what we have been dealing with in the world of autonomous AI for the last couple of years. We have all had that frustrating experience where an agent starts a complex task, gets halfway through, and then for whatever reason—maybe the context window gets messy, or we need to switch to a more specialized model for the heavy lifting—suddenly the ball gets dropped. It is like the second nurse walks in and has no idea the patient even exists, or worse, they think the patient is there for a broken leg when they are actually there for heart surgery.
It is that "train of thought" problem. And our housemate Daniel sent us a prompt this week that really gets into the technical weeds of how we are finally, in early twenty-six, solving this. He has been using what he calls a "hacky" method for a while now, basically forcing his agents to maintain a manual JSON log of their work to act as a shift handoff. He wanted us to look at how the industry is finally catching up to these home-rolled workarounds and building actual infrastructure for what used to be duct tape and prayer.
Herman Poppleberry here, and I have to say, Daniel is definitely not alone. For a long time, manual JSON logging was the absolute duct tape of the agentic world. If you wanted to move a task from a planning agent to a coding agent, you basically had to tell the first one, "Hey, write down everything you did in a very specific format so the next guy can read it." It was brittle, it was incredibly expensive in terms of tokens because you were essentially double-billing for the same information, and it was prone to what I call "silent context rot."
Silent context rot. I like that. It is that feeling when the agent thinks it knows what is happening, but the data structure has drifted just enough—maybe a key name changed or a value was misinterpreted—that it starts hallucinating the state of the project. But we are in March of two thousand twenty-six now, and the landscape has shifted massively. We are moving from these ad-hoc hacks to actual standardized primitives.
We really are. We have moved from the era of simple "chat," which we talked about way back in episode seven hundred ninety-five when we looked at sub-agent delegation, to an era of true orchestration. Back then, we were just happy if one agent could call another. Now, we are obsessing over the "how" of that transition. Today, we are digging into the evolution of agentic handoffs. How do we bridge the gap between autonomous agents without losing the thread of intent?
So, let us start with the basics of the handoff itself. When we say "handoff" in this modern context, are we just talking about a really long prompt that summarizes what happened, or are we talking about something more structural? Because if it is just a summary, we are still relying on the model's ability to interpret prose, which we know can be hit or miss depending on how much caffeine the model seems to have had that day.
That is the core of the evolution, Corn. In the early days, a handoff was basically just a prompt. You would say, "You are a coding agent, here is a summary of what the planning agent did, now go." But now, with frameworks like LangGraph one point zero point eight, which just dropped earlier this year, we are seeing the rise of "typed state channels." This is a massive departure from the "string-and-a-prayer" method.
Typed state channels. Okay, break that down for us. How does that differ from Daniel's manual JSON log? Because on the surface, they both sound like structured data.
It comes down to validation and enforcement. When Daniel writes a JSON log, he is basically creating a string that the next agent has to parse. If the agent forgets a comma, or if it decides to change a key name from "task_status" to "status_of_task" because it felt like being more descriptive that day, the whole thing breaks. The receiving agent gets a syntax error or, worse, ignores the field entirely. With typed state channels in LangGraph one point zero point eight, we are using things like Pydantic for compile-time schema validation.
So it is less like a handwritten note and more like a standardized digital form that the system literally will not let you submit unless every box is checked correctly and the data types match.
The orchestrator defines exactly what the state looks like: these are the required fields, these are the data types—like integers, strings, or specific enums—and these are the allowed values. If Agent A tries to hand off a state that does not conform to that Pydantic schema, the system throws an error before Agent B even wakes up. This brings us to a massive improvement in reliability. It prevents that downstream failure where an agent spends five minutes and ten dollars worth of tokens trying to work with corrupted or malformed data.
That makes sense for the structure, but what about the actual content? One of the big questions Daniel had was about "context flooding." If you pass everything from the first agent to the second—every thought, every tool call, every error message—you end up with this massive, bloated blob of text. We know from research that models can get "lost in the middle" when the context window gets too crowded. How are the new standards handling that noise?
This is where the OpenAI Agents SDK is doing some really sophisticated things. They have introduced these primitives called "input filters" and "handoff history mappers." This is a game changer for token efficiency. Instead of just dumping the entire conversation history into the next agent, you can define a logic that prunes the history. You can say, "Only pass the last three turns of dialogue, but keep all of the tool outputs and the final state of the file tree."
So it is like a curated highlight reel rather than the raw footage of the entire shift.
Precisely. And what is even cooler is that you can have a separate, smaller, faster model act as the handoff history mapper. Its entire job is to look at the massive, messy log of the first agent and condense it into the most salient points for the next one. It reduces the noise-to-signal ratio significantly. If you are moving from a "Researcher Agent" that looked at fifty different websites to a "Writer Agent," the Writer does not need to see the raw HTML of those fifty sites. It just needs the synthesized facts and the citations.
I imagine that also helps with the cost. If you are passing thirty thousand tokens of context every time you switch agents, and you do that ten times in a workflow, your API bill is going to look like a mortgage payment.
Oh, absolutely. Efficient handoffs are the only way to make enterprise-scale agents actually viable. We talked about this in episode one thousand ninety-eight, "The Agentic Symphony," where we looked at how companies are trying to orchestrate thousands of these things. If you do not have a way to filter that context, the system collapses under its own weight. You end up paying for the model to "re-read" things it already knows, which is the ultimate inefficiency.
I want to touch on the durability aspect too. You mentioned Temporal dot io in the prep notes. For people who are not distributed systems nerds, why is a persistence layer so important for an agentic handoff? Why can't we just keep it all in memory?
Think of it as the "black box" flight recorder. In a standard script, if the power goes out, or the API times out, or the model hits a rate limit, the state is gone. You have to start over from zero. But these agentic tasks in two thousand twenty-six are getting long and complex. They might run for an hour, doing deep research, writing code, running unit tests, and debugging. If a handoff happens and then the system crashes, you do not want to lose all that expensive work.
Right, you do not want the second nurse to have to re-read the entire medical history from birth because the hospital's computer rebooted.
Temporal provides a way to make these handoffs "durable." It saves the state of the handoff at every single step in a persistent database. So if the model crashes or the network blips, the system can wake up, look at the last successful state in the handoff log, and resume exactly where it left off. It turns these fragile, ephemeral AI conversations into robust, industrial-grade distributed systems. It is the difference between a toy and a tool.
That is a huge shift in mindset. We are moving from thinking of AI as a chatbot to thinking of it as a long-running process. And processes need state management. But that brings up the inevitable question of standards. Daniel mentioned things like MCP and the Google A2A protocol. It feels like we are in the middle of a standards war right now. Who is actually winning the battle for the agentic protocol?
It is less of a war and more of a "grand convergence," though it definitely felt like a bloody war last year. The Model Context Protocol, or MCP, has really become the heavyweight here. As of the first quarter of this year, it is officially governed by the Agentic AI Foundation, the AAIF. It is not just an Anthropic thing anymore. Everyone from Microsoft to local open-source developers is adopting it because it provides a standard way for agents to access tools and resources regardless of which model is running.
So if I have a tool that searches my local database, MCP allows a Claude agent to use it, then hand off the task to a Gemini agent, who can use that same tool without me having to rewrite the integration?
That is the dream, and we are finally seeing it happen in production. But Google is pushing their A2A protocol, which stands for "Agent-to-Agent." Their approach is slightly different and very interesting. They use something called "Agent Cards." Think of it like a business card or a LinkedIn profile for an agent. It tells other agents what this specific agent is good at, what its limitations are, what its cost-per-token is, and what kind of data schema it expects to receive.
I like that. It is like a discovery layer. If a generalist agent realizes the task requires deep expertise in, say, maritime tax law, it can search for an agent with a "Tax Law" Agent Card and initiate a formal handoff.
And that is where the handoff becomes more than just data. It becomes a "negotiation." The first agent says, "I have this task, here is the state." The second agent looks at its Agent Card and says, "Okay, I can take that, but I need you to format the financial data in this specific schema first, and I need you to prune the conversation history to the last five turns." It is much more dynamic and intelligent than the old way of just dumping a JSON log and hoping for the best.
We have to talk about AGENTS dot md too. I have been seeing this pop up in more and more GitHub repositories lately. It is such a simple idea—just a markdown file—but it feels like it solves a very specific problem that the high-level protocols might miss.
It really does. AGENTS dot md is what we call "passive context." Most of what we have been talking about so far is "active context"—the stuff that changes during a specific conversation or task. But every project has "static context": the coding standards, the architectural goals, the list of forbidden libraries, the preferred tone for documentation.
Right, the stuff that stays the same no matter which shift is working. The hospital's general safety protocols, not the specific patient's heart rate.
Instead of wasting thousands of tokens in every single handoff repeating the same "don't use this library" and "always use four spaces for indentation" instructions, you put them in an AGENTS dot md file at the root of your project. The agents are trained or prompted to look for that file first before they do anything else. It is like the employee handbook. The handoff then only needs to contain the task-specific, dynamic information. It is a massive win for both token efficiency and consistency across a multi-agent team.
That makes a lot of sense. It is about separating the permanent rules from the temporary status. But what happens when you are crossing model families? Daniel specifically asked about the practical challenges of moving from, say, GPT-four to Claude to Gemini. I imagine they do not all interpret the same handoff the same way, even if the JSON is valid.
This is where we run into the "personality drift" problem. Every model has its own quirks, its own way of following instructions, its own level of verbosity, and even its own "worldview" based on its training data. If you have a very concise, logic-heavy GPT-four-o model doing the initial planning, and it hands off to a more narrative-heavy, flowery Claude model, the tone of the project can start to shift in weird ways.
I have seen that. You start with a very technical, bulleted list of requirements, and by the third handoff, the agent is writing these beautiful, flowery paragraphs about the "elegance of the architecture," but it has completely forgotten half of the actual technical specs.
Precisely. And it is not just tone. It is also about how they handle things like uncertainty or ambiguity. Some models are more likely to "hallucinate" a missing piece of context in a handoff rather than stopping and asking for clarification. When you are doing cross-model orchestration, you have to build in what I call a "validation layer."
A validation layer? Is that another agent, or just more code?
It can be both. Often, the best practice now is to have a specialized, highly steered model—often a smaller, cheaper one—that acts as the "referee." When the handoff happens, the referee checks the output of the first model against the original requirements and the state schema. If it sees that the tone is drifting too far, or that key information like a specific variable name is being lost in the summary, it forces the first model to rewrite the handoff before it ever reaches the second model.
That sounds like a lot of overhead, though. Are we reaching a point where the management of the agents is taking more work and more tokens than the actual task itself?
That is the big critique, isn't it? We actually touched on this in episode one thousand seventy-eight, "The Agentic Throughput Gap." If you have too many layers of validation, summarization, and refereeing, your latency goes through the roof. The key is to use the right tool for the job. You do not need a referee for a simple two-step task. But if you are building an autonomous research system that is going to run for three days and cost five hundred dollars, that overhead is a drop in the bucket compared to the cost of the whole thing failing because of a bad handoff on hour two.
Let us talk about the decision-making process. This was Daniel's fifth question. Who should actually decide what context gets passed on? Is it the sending agent, a human-in-the-loop, or this "summarizer agent" we keep talking about?
This is a bit of a philosophical divide in the industry right now. The old-school approach, which a lot of developers still prefer for safety, is the "human-in-the-loop" model. The agent finishes its part, shows you the handoff log, and you click "approve." It is safe, but it is not truly autonomous. It does not scale.
Right, you cannot have a human-in-the-loop if you are running a thousand parallel agents doing market analysis. You would need a thousand humans, which defeats the purpose.
So the trend is moving toward the "orchestrator-worker" pattern. In this setup, there is a central "orchestrator" agent that never does the actual work—it never writes code, it never searches the web. Its only job is to manage the state and decide which worker gets what specific slice of context. This is what we are seeing in the enterprise frameworks from Microsoft and Google. The orchestrator has a high-level view of the entire project, so it is in the best position to say, "Okay, Coder Agent, you only need the API specs and the current error log; you do not need the three pages of market research the first agent did."
So the worker agents are kept on a "need-to-know" basis to keep them focused.
Yes. It prevents that context flooding we talked about. But there is a risk there, too. If the orchestrator is not smart enough, it might withhold a crucial piece of information that it did not realize was relevant. This is why we are seeing a push for more "self-reflective" agents. The worker agent should be able to look at the handoff and say, "Hey, I think I am missing the database schema for the production environment, can you provide that before I start?"
It is like the new nurse being able to call the old nurse at home and say, "Wait, you forgot to tell me the dosage for the patient in room four twelve, and the chart is blank."
That "bidirectional communication" between agents is the next frontier. Right now, most handoffs are one-way. Agent A finishes, hands off to Agent B, and Agent A is terminated. But with the new persistence layers like Temporal, we can actually keep Agent A's state "suspended." If Agent B has a question, it can wake Agent A back up and ask for clarification. It turns the handoff from a baton pass in a relay race into a collaborative team environment, even if the agents are not running at the same time.
That is fascinating. It really changes the architecture of how we build these things. I want to circle back to something you mentioned earlier, the context window sizes. This seems like a really practical headache. If I am handing off from a model with a two hundred thousand token window to one with only thirty-two thousand, I am in trouble.
Oh, it is a huge problem. This is where the industry is really leaning on those input filters I mentioned. If you are moving to a model with a smaller window, the orchestrator has to be aggressive about summarization. It is not just about choosing what is important; it is about "compressing" the information. We are seeing some really cool techniques using semantic embedding to summarize the context. Instead of just shortening the text, the model converts the key concepts into a dense, information-rich format that the receiving model can "unpack."
It is like sending a zip file of ideas.
That is a great way to put it. And because these models are getting better at following complex, dense instructions, they can actually do a lot with a very small, high-density handoff. But you have to be careful. If you compress it too much, you lose the nuance. You lose the "why" behind a decision, which can lead the next agent to make a mistake.
This really feels like we are building a new layer of the internet, doesn't it? Like, in the nineties we were figuring out how to move packets of data with TCP/IP. Now we are figuring out how to move packets of "intent" and "state" between intelligent agents.
That is exactly what it is, Corn. We are defining the protocols for the "agentic web." And just like with the early internet, we started with hacky, manual ways of doing things—like Daniel's JSON logs—and now we are seeing the infrastructure harden. Daniel's manual logs are the equivalent of manually routing packets. It worked for the early adopters, but it was never going to scale to the world we are entering now.
So, for the developers and the curious listeners out there who are currently using those hacky methods, what is the immediate takeaway? Should they drop everything and move to LangGraph or the OpenAI SDK tomorrow?
My advice would be to start migrating toward "typed state channels" as soon as possible. Even if you are still using a manual process, start defining your handoffs with Pydantic schemas. It forces you to be disciplined about what information is actually necessary. It makes your agents more reliable immediately because you can catch formatting errors before they derail the task. Stop building custom JSON parsers and start using the native validation tools.
And what about AGENTS dot md? It seems like such a low-hanging fruit for anyone managing a codebase.
Every project should have an AGENTS dot md file. It is the cheapest, easiest way to improve your agent's performance. It reduces the size of your prompts and ensures that no matter how many handoffs happen, the core rules of the project stay intact. It is basically the "constitution" for your agents. If you don't have one, you're essentially asking every new agent to guess the rules of your house.
I love that. The agentic constitution. And what about the summarizer layer? Is that something a solo developer should be looking at, or is that really more for enterprise-level complexity?
If you are finding that your agents are getting "lost" or that your token costs are spiraling, a summarizer layer is the first thing I would look at. You do not even need a separate model necessarily. You can just add a step at the end of your agent's task where it is instructed to generate a concise "handoff log" based on a specific template. It is about moving away from "raw history" and toward "intentional state transfer."
It really comes down to intentionality, doesn't it? We cannot just assume the model will remember what is important. We have to be explicit about what the next agent needs to know.
The best handoff is the one the user never has to debug. It should be invisible. When you see an agentic system that just works—where it moves from research to planning to execution without a single hitch—it is because someone spent a lot of time thinking about the "plumbing" of those handoffs.
It is interesting to think about where this goes in the next year or two. We are already seeing the AAIF govern MCP. Do you think we will ever get to a truly universal handoff protocol? Like, one standard that every model from every company follows perfectly?
I think we are closer than people realize. By twenty-twenty-seven, I suspect we will have something akin to the "HTTP of agents." It might be an evolution of MCP or something entirely new, but the economic pressure for interoperability is too high to ignore. Companies do not want to be locked into one model provider. They want to be able to use the best model for each sub-task, and they can only do that if the handoffs are standardized.
It is that pro-competitive, open-market drive. If I can swap out a Claude coding agent for a Gemini one without rewriting my entire orchestration layer, that is a win for me as a developer and a win for the ecosystem.
It keeps the model providers on their toes. They have to compete on the quality of their reasoning and the efficiency of their models, not on how much proprietary lock-in they can create with their state management.
Well, this has been a fascinating deep dive. I feel like I understand the plumbing of my own agentic workflows a lot better now. It is easy to get caught up in the "magic" of what the AI can do and forget about the hard engineering required to make it reliable at scale.
That is the Poppleberry promise, Corn. We go deep so you don't have to. But seriously, this stuff matters. As we move into a world where agents are handling more and more of our professional and personal tasks, the reliability of these transitions is going to be the difference between a tool that helps you and a tool that creates more work for you.
Well said. And hey, if you have been finding these deep dives helpful, we would really appreciate it if you could leave us a review on your podcast app or over on Spotify. It genuinely helps other people find the show and keeps us motivated to keep digging into these weird prompts that Daniel and our listeners send in.
It really does. And if you want to check out our archive, we have over eleven hundred episodes now covering everything from battery chemistry to the geopolitics of the Middle East. You can find all of them at myweirdprompts dot com. We have a full RSS feed there for your favorite podcast player, and we also have a Telegram channel if you want to get notified every time a new episode drops. Just search for "My Weird Prompts" on Telegram.
We have covered a lot today, from the nursing metaphor to the technical nuances of Pydantic schemas in LangGraph. I think the big takeaway for me is that the shift from "chat" to "orchestration" is really a shift from "talking" to "building." We are building systems now, not just having conversations.
Precisely. We are moving from the era of the chatbot to the era of the "agentic operating system." And the handoff is the "system call" of that new OS.
I love that. The system call of the agentic OS. Well, I think that is a perfect place to wrap things up for today. Thanks for joining me on this one, Herman. I always learn something new when we get into the technical weeds.
Always a pleasure, Corn. I am looking forward to seeing what Daniel sends us next week. If it is half as interesting as this one, we are in for a treat.
Definitely. To all our listeners, thanks for tuning in to My Weird Prompts. We know you have a lot of choices in the AI podcast space, and we are glad you chose to spend some time with the Poppleberry brothers here in Jerusalem.
Stay curious, keep building, and we will talk to you in the next episode.
Take care, everyone.
Goodbye!
So, Herman, before we fully sign off, I was just thinking about that nurse metaphor again. Do you think we will ever reach a point where the agents are so good at handing off that they actually start to develop a sort of "collective memory"? Like, not just task-specific, but a long-term understanding of how we specifically like to work?
That is the holy grail, isn't it? We are starting to see hints of that with some of the personalized memory frameworks that sit on top of the orchestrators. It is not just about the shift handoff anymore; it is about the entire "career history" of the agentic team. But that is probably a topic for a whole other episode.
Episode eleven hundred and three, perhaps?
We will see what Daniel has to say about it.
Fair enough. Alright, for real this time, thanks for listening to My Weird Prompts. You can find us at myweirdprompts dot com. See you next time.
See ya!