If you are building agentic systems in twenty twenty-six, the difference between a model that genuinely supports tool calling and one that just claims to can be the absolute difference between a working product and a debugging nightmare that keeps you up until four in the morning. Today’s prompt from Daniel is about exactly that—the engineering reality of tool calling, the Model Context Protocol, and how to actually evaluate if a model is "agentic-ready" or just wearing a fancy marketing suit.
It is a great time to dig into this, Corn. I am Herman Poppleberry, and I have been obsessed with the shift we have seen just in the last few months. With the release of Grok’s latest agentic-optimized model this past March and the sheer proliferation of non-native models trying to play in the agentic sandbox, developers are facing a bit of a crisis of choice. By the way, today’s episode is powered by Google Gemini three Flash, which is writing our script and, ironically, is one of those models that has had to prove its own tool-calling mettle in a very crowded market.
It is funny how "tool calling" has become this catch-all term. It used to just mean "can this thing write a bit of Python?" but now it is the backbone of everything. Daniel’s asking a really pointed question here: what does "support" actually entail from a development standpoint? Because I see models being released with "agentic" variants and others that don’t advertise it at all. Can you just bully an instructional model into making Model Context Protocol calls, or are you just asking for a headache?
That is the heart of the engineering challenge. To understand it, we have to define what tool calling actually is in the context of agentic AI. It is not just function invocation. It is a three-legged stool: structured output parsing, state management, and retry logic. When a model "supports" tool calling natively, it means it has been fine-tuned on datasets where natural language is paired with specific structured blocks—usually JSON. The model is literally trained to recognize a "trigger" in the conversation where it needs external data and then to output a very specific "stop token."
Right, so it is not just the model being smart; it is a handshake between the model and the API. If I’m using Anthropic, their server is looking for that specific sequence, stopping the generation, running the tool, and then feeding the result back in. If a model doesn’t have that native handshake, you’re basically trying to read the model’s mind while it’s still talking.
Precisely. Well, I should say, that is the core of the reliability gap. In a native model, the JSON schema integrity is part of the objective function during training. In a non-native or "instructional" model, you are relying on the model to follow a system prompt that says, "Hey, please only speak in JSON." But as anyone who has worked with base models knows, they love to ramble. They’ll give you the JSON, but then they’ll add a helpful sentence at the end like, "I hope this helps with your database query!" And that extra sentence is what breaks the parser and crashes your agent.
The classic "helpful assistant" syndrome. It is like asking a robot to hand you a screwdriver and it hands it to you, but then insists on giving you a high-five that knocks the screwdriver out of your hand.
And it’s not just the high-five; it’s the format of the screwdriver itself. A native model understands that if the schema requires an integer for a user_id, it must be an integer. An instructional model might get creative and output "user_id": "unknown" because it’s trying to be conversational. That’s a silent killer in production because your backend code expects a number and suddenly gets a string.
So, when we talk about being "MCP-ready," what are we actually looking for under the hood? Is it just about the JSON, or is there a deeper architectural requirement?
To be truly "agentic-capable" or "MCP-ready" from an engineering standpoint, you have to look at the layers. Layer one is that native API support we talked about—the parameters like "tools" in the request body. Layer two is the structured output enforcement. But layer three is where the real magic happens: agentic state management. This is the ability to handle multi-turn tool calls. If an agent needs to search for a user, get their ID, use that ID to find an invoice, and then use the invoice date to check a shipping status, that is four distinct turns where the model has to keep the state of the "thought process" clear.
And that is where the Model Context Protocol, or MCP, comes in. Daniel mentioned that Anthropic developed it, and it has really become the de facto standard. I saw the MCP one point two spec update back in January that added streaming support. Why has that particular protocol won out? Is it just because Claude is good at it, or is the architecture itself superior?
It is a bit of both, but the architecture is the real winner. MCP is designed as a stateful protocol. Think of it like a USB-C port for the application layer. Before MCP, every time you wanted to give an LLM a tool, you had to write a custom shim. You had to translate your database schema into a format OpenAI liked, then do it again for Anthropic, and again for Google. MCP standardizes the infrastructure around the model.
So the model doesn't actually "know" it's using MCP?
No, and that is a huge point of confusion. The LLM doesn't need to understand the MCP spec. The MCP Client—which might be your Python app or the Claude Desktop—connects to an MCP Server. The server says, "Here are the tools I have." The client then takes those definitions and translates them into whatever the LLM needs to see. If you're using a native model, the client uses the API's tool parameter. If you're using a non-native model, the client has to literally inject those tool definitions into the system prompt as text.
Wait, so if I’m using a non-native model, I’m essentially eating up my context window just to explain the tools?
You hit the nail on the head. In a native model, the tool definitions are often handled by a specialized sub-network or a highly compressed prefix. In a non-native model, you might be spending two thousand tokens just describing your API endpoints in the system prompt before the user even says "hello." And as that prompt gets longer, the model’s "attention" starts to drift. It might forget the third parameter of your fifth tool because it’s too busy focusing on the user’s actual request.
So, to Daniel's question: yes, you could use an instructional model that doesn't advertise tool calling and get it to make MCP calls. But you’re basically building a translator in your own code to turn the model's text into something your tools can understand. It’s like using a universal remote that you have to manually program for every single button versus one that just pairs automatically.
And the reliability hit is massive. We saw this in a case study recently comparing Claude three point five Sonnet, which has native MCP support, against Llama three point two seventy-B, which is a fantastic model but wasn't built with the same "native" tool-calling focus. In a multi-step research agent task, Sonnet hit a ninety-eight percent success rate on schema adherence. Llama, when pushed through a prompted ReAct pattern—that’s the Reason plus Act pattern—was hovering around seventy-five percent. That twenty-five percent gap is where the "run-on" responses and argument hallucinations live.
Seventy-five percent sounds okay until you realize that in a four-step agentic chain, your total probability of success is zero point seventy-five to the power of four. That’s... what, about thirty-one percent? You’ve gone from a reliable tool to a coin flip that usually ends in an error message.
Math doesn't lie when it comes to compounding errors in agentic chains. And that is why the "agentic-optimized" models, like the one Grok released in March, are making such a big deal out of native support. They aren't just saying "we can do it"; they are saying "we have fine-tuned the model to specifically output the stop tokens that stop the generation the millisecond the tool call is complete."
I want to go back to the "why" of models lacking native support. Is it just a training data issue? Or is there something about the architecture of, say, a smaller model that makes it harder to be "agentic"?
It's a combination. Training for tool calling requires a very specific kind of synthetic data. You need millions of examples of "User asks X -> Model thinks Y -> Model outputs JSON Z -> Model receives Result A -> Model concludes B." If you don't have that in your fine-tuning mix, the model doesn't "learn" the rhythm of the tool-calling loop. There’s also the context window issue. Agents often have massive system prompts because you’re stuffing ten or twenty tool definitions in there. If a model doesn't have "long-context reliability"—meaning it can remember the tool's parameters even if they are buried sixty thousand tokens deep—the agent will fail.
I’ve noticed that with some of the smaller "distilled" models. They’re fast, they’re cheap, but the moment you give them more than two tools, they start mixing up the arguments. It’s like they have the working memory of a goldfish.
That is actually a documented phenomenon. In the February twenty twenty-six developer survey, sixty-eight percent of AI engineers said that "native MCP support" was a top-three criterion for model selection, specifically because of that memory and reliability factor. When a model is "agentic-native," it’s often optimized for what we call "parallel tool calling."
Parallel tool calling? Is that like the model saying, "I need to check the weather in London, Paris, and Tokyo all at once" instead of doing three separate turns?
Yes. And from a latency perspective, that is a game changer. If a model can output a single block containing three tool calls, your orchestrator can run those in parallel across three different APIs and feed the results back in one go. If the model isn't optimized for that, it will try to do them one by one, and your user is sitting there for thirty seconds waiting for a simple task to finish.
How does the model actually "know" it can do them in parallel? Is that a prompt thing or a training thing?
It’s a training thing. The model has to be exposed to "multi-tool trajectories" during fine-tuning. It learns that if a query has independent sub-tasks, it can emit a list of tool calls rather than just one. If you try to force a non-native model to do this with a prompt like "You can call multiple tools at once," it often gets confused and tries to nest the JSON inside itself, which leads to a parsing error. It’s like trying to teach someone to juggle three balls at once by just shouting "JUGGLE!" at them. They might try, but the balls are going to end up on the floor.
So if I’m looking at Grok’s "agentic-optimized" model versus, say, GPT-four-o, which has native function calling but isn't strictly an "MCP model" by design, what am I actually weighing?
You're weighing the "agentic harness" versus the "native moat." Grok and Claude have leaned heavily into the idea that the model should be a "reasoning engine" first. They have optimized for internal chain-of-thought—that "thinking" block you see before the tool call. Native function calling in older models often felt a bit like a reflex. The model would see a keyword and jump to a tool. The newer agentic models are trained to "deliberate." They write out their reasoning, which actually helps the model stay on track and reduces hallucinations.
It’s like the difference between a junior dev who just starts typing the first solution they think of versus a senior dev who sits there, stares at the whiteboard for five minutes, and then writes ten lines of perfect code.
That is a great way to put it. And there are second-order effects to this too, especially around cost and latency. When you use a non-native model and you’re forcing it into a tool-calling role via prompt engineering, you are paying for all those extra tokens of the model "explaining" itself in ways you didn't ask for. You’re also adding latency because the model is slower to reach the "stop" point. With native support, the "stop sequence" is handled at the inference level. The moment the JSON is done, the GPU stops spinning.
Let’s talk about the "hack" for a second. Daniel asked if he could use an instructional model. If I’m a developer and I really want to use a specific base model—maybe for privacy reasons or because it’s cheaper—what does that "tool calling harness" actually look like?
It usually involves something like LangChain’s tool parsers or a custom regex-based shim. You write a system prompt that says: "You are a tool-calling agent. To use a tool, you must output a block like this: TOOL: name, ARGS: json." Then, in your application code, you have to constantly stream the model’s output, look for that "TOOL:" string, and manually kill the generation. It is brittle. If the model decides to use a lowercase "tool:" instead of uppercase, your agent breaks. If the model puts a space in the wrong place in the JSON, it breaks.
It sounds like you’re building a sandcastle while the tide is coming in. You can make it look like a castle for a minute, but eventually, the physics of the model are going to wash it away.
We saw this with people trying to use Ollama’s Llama three point two for MCP calls earlier this year. It worked for simple things—like "what time is it?"—but the moment you asked it to do something complex involving nested JSON objects, it just fell apart. The model would start the JSON, get distracted by a detail, and then finish the thought in plain English.
I’ve actually seen that happen where the model says, "I will now call the search tool," and then it just... describes what the search tool would have found, instead of actually calling it.
That’s the "hallucinated execution" bug! It’s one of the most frustrating things to debug. The model is so confident it knows what the tool will return that it skips the actual call. A native model is trained with a hard "stop" after the tool call syntax. It literally cannot continue until the external system provides the observation. Non-native models don't have that guardrail, so they just keep talking, making up fake data as they go.
So, if we’re building a decision tree for model selection, what are the branches?
Branch one: Do you need multi-turn reliability? If yes, prioritize native MCP support. Branch two: Are you doing parallel tasks? If yes, you need a model optimized for parallel tool calling. Branch three: Is cost the absolute only factor? If yes, you can try a non-native model with a heavy "harness," but you have to factor in the "developer tax" of maintaining that brittle code.
And that "developer tax" is real. I think people underestimate how much time is spent debugging "why did the agent stop halfway through this task?" only to find out it was a trailing comma in a JSON block.
There was a great "MCP Stress Test" published in an Anthropic blog post back in January twenty twenty-six. It basically gave models a series of increasingly complex tool-calling scenarios—nested calls, ambiguous tool descriptions, and "trap" questions where a tool shouldn't be used. The native models—Claude, the new Grok, the latest Gemini—all sailed through. The instructional models, even the very large ones, failed on the "ambiguity" test. They felt "compelled" to use a tool even when it wasn't appropriate, simply because the system prompt was so focused on tool use.
That is an interesting failure mode. It’s the "when all you have is a hammer, everything looks like a nail" problem, but for AI. If you tell an instructional model "you are a tool-using agent," it feels like it has to use a tool to be a "good" agent, even if the user just said "hello."
Precisely. Native models have better "calibration." They know when to just talk and when to reach for the toolbox. That calibration is a result of that specific fine-tuning we talked about earlier. It’s a nuance that doesn’t show up on a standard "M-M-L-U" benchmark, but it shows up the second you try to build a production-grade agent.
Is there a way to measure that calibration before you commit to a model? Like, is there a specific metric developers should look for in the technical reports?
Look for the "false positive rate" in function calling. A good model should have a near-zero rate of calling a tool when the answer is already in its context window. If the user asks "What is my name?" and the model calls a database tool instead of just looking at the previous message where the user introduced themselves, that’s a poorly calibrated agentic model. It’s wasting your money and adding unnecessary latency.
So, what about the "agentic variants"? We see models being released as "Model X" and "Model X Agentic." From an engineering standpoint, is that just a different fine-tuning recipe on the same weights?
Usually, yes. The "agentic" variant has typically gone through an extra round of Reinforcement Learning from Human Feedback, or RLHF, specifically focused on tool-calling trajectories. They also often have a different "system prompt" baked into the model's internal behavior. It’s not just a marketing label; it usually indicates that the model has been tested against things like the Berkeley Function Calling Leaderboard or similar agentic benchmarks.
It’s like buying a truck that has the "towing package" pre-installed. Sure, you could probably bolt a hitch onto the base model yourself, but the one with the package has the upgraded cooling system and the transmission that won't explode when you hit a hill.
That is a rare analogy for us, Corn, but it is spot on. The "cooling system" in this case is the model's ability to handle the "noise" of a complex tool environment without overheating—or in AI terms, without losing the thread of the conversation.
I think we should talk about the "context engineering" shift that Daniel mentioned in his notes. He said we’re moving from "prompt engineering"—how we phrase things—to "context engineering"—how we architect the data and tools the model can see. What does that mean for someone building an MCP-based system?
It means your job as a developer is less about finding the "magic words" to make the model behave and more about being a librarian and an architect. In an MCP world, you have to decide: which tools does this agent actually need right now? You can't just dump a thousand tools into the context window; even the best models will get confused. Context engineering is about dynamically pulling in the right tool definitions based on the user's intent.
So it’s like a "Just-In-Time" delivery system for tools. Instead of the agent carrying a massive, heavy toolbox, it has a small belt, and you, the developer, are handing it the right wrench exactly when it reaches for one.
And the Model Context Protocol makes that so much easier because the "handover" is standardized. You can have an MCP server that manages your entire company’s internal APIs, and your agentic "router" just fetches the specific tool it needs for the current task.
But wait, how does the router know which tool to fetch if it hasn't seen the user's request yet? Isn't that a chicken-and-egg problem?
That’s where the "two-stage" agentic architecture comes in. You use a very small, very fast model—like a Gemini Flash or a small Llama—to do a preliminary "intent classification." It looks at the user’s prompt and says, "This sounds like a database query and a calendar check." Then, your middleware fetches the MCP definitions for only those two tools and passes them to the larger, more capable "reasoning" model. This keeps the context window clean and the reliability high.
That feels like a much more scalable way to build. But it also puts a lot of pressure on the model’s ability to parse those definitions quickly.
It does. And that brings us back to the "JSON schema adherence" benchmark. If you’re building a dynamic system where the tool definitions are changing, the model has to be incredibly robust. It can’t rely on having "memorized" a specific tool during training. it has to be able to read a brand-new JSON schema it has never seen before and immediately understand how to format the arguments.
Which is why the "it can output JSON, so it can do tool calling" myth is so dangerous. Outputting a simple JSON object is one thing; following a complex, nested schema from a third-party MCP server is a whole different level of difficulty.
It really is. I’ve seen developers get burned by this over and over. They test their agent with a simple "get_weather" tool, it works, and they think they’re golden. Then they try to connect it to their enterprise resource planning system with a forty-parameter tool definition, and the model just starts crying.
Metaphorically speaking.
Metaphorically, though sometimes I think I can hear the fans on the server whining in sympathy. Another thing to consider is "tool-use density." Some models are great if you call one tool every ten messages. But if you have a workflow that requires calling tools in every single turn for fifty turns, the "drift" becomes a real problem. The model starts to lose the original user intent because the context is now eighty percent tool outputs and only twenty percent original conversation.
Is there a fix for that? Or is that just a limit of current LLM architecture?
Part of the fix is "context pruning," where you summarize old tool outputs once they are no longer needed. But the real fix is native support. Native models are better at distinguishing between "system-level observations"—the tool results—and "user-level instructions." They treat them differently in their attention mechanism, so the tool data doesn't "pollute" the understanding of the user's goal.
So, for the developers listening who are looking at the landscape in twenty twenty-six, what is the practical takeaway here? If you’re staring at a list of models on a provider like Modal or Anthropic, how do you make that choice?
I think you have to start with an "MCP readiness checklist." Don't just look for the "tool calling" tag. Ask: One, does it support structured JSON output natively? Two, can it handle multi-turn calls without prompt injection or "hallucinating" its way out of the loop? Three, is there a documented "stop sequence" for tool calls? If you're using a non-native model, you have to accept that you're going to see twenty to thirty percent higher latency because of the "parsing tax" and the extra tokens.
And you should probably run that "MCP Stress Test" yourself. Don't take the vendor's word for it. Give the model a tool it shouldn't use and see if it’s smart enough to say "I don't need that."
That is the best filter. A truly "agentic" model is defined as much by the tools it doesn't call as the ones it does.
I love that. It’s the wisdom of the agent. It’s not just about having the tools; it’s about knowing when to keep them in the belt.
We are also seeing a shift in how these models are being benchmarked. The old benchmarks were all about knowledge—"Who was the sixteenth president?" The new benchmarks, like the ones from February twenty twenty-six, are all about "trajectory." Did the agent take the most efficient path to solve the problem? Did it use the right tools in the right order? That is a much better reflection of real-world value.
Does this mean we’re going to see fewer "general purpose" models and more "specialized agent" models? Like a model that is specifically only good at SQL and Python tool calling?
I think so. We’re already seeing "coding-specific" models that are essentially just agentic models with a very deep knowledge of library documentation. They aren't great at writing poetry, but they can chain together ten different API calls to refactor a legacy codebase without breaking a sweat. The "one size fits all" era of LLMs might be coming to an end in favor of these high-performance "workers."
It’s a shift from "what do you know?" to "what can you do?" Which is really what the agentic age is all about.
It is. And as we move forward, I expect "tool calling support" to stop being a "feature" and just become a baseline requirement. In two years, we won't be talking about "agentic-optimized" models because every model will have to be agentic just to survive in the market.
It’ll be like having a phone that can’t connect to the internet. Why would you even buy it?
But for now, while we’re in this transition period, understanding the "plumbing"—the MCP specs, the stop tokens, the fine-tuning recipes—is what gives you the edge as an engineer.
What about the security aspect of this? If we’re giving these models native access to tools via MCP, aren't we just opening the door for prompt injection to do a lot more damage?
That is the elephant in the room. When a model has native tool access, a "jailbreak" isn't just about making the model say something offensive; it’s about making the model run rm -rf on your server. This is why the MCP spec includes "human-in-the-loop" hooks. For sensitive tools, the protocol actually requires a client-side confirmation. The model says "I want to delete this file," the MCP client pauses, asks the human "Is this okay?", and only then proceeds. Native models are actually better for security because they have these structured checkpoints built into the handshake.
So the "handshake" isn't just for efficiency; it’s for safety. It’s a formal request rather than just a model blabbing out a command and hoping nothing goes wrong.
Precisely. It turns the model from an autonomous actor into a supervised assistant. And that distinction is vital for enterprise adoption. No CEO is going to approve an agent that has unfettered access to the company's financial data without those MCP-level guardrails.
Well, I feel like I’ve had a masterclass in AI plumbing today. I’m still a sloth, so I’ll probably just let the agents do the work for me, but at least now I know why they’re failing when they do.
Just make sure your agent has a native "get_hibernation_status" tool and you should be fine, Corn.
I'll put it on the roadmap for my personal MCP server. "Step one: check if Corn is awake. Step two: if not, do not disturb."
A very reliable agentic workflow.
Truly the only one I care about. But seriously, this has been a great dive. It’s rare that we get to look at the "how" behind the "what" in such a concrete way.
I agree. It’s easy to get caught up in the hype of "agents are coming," but the reality is that agents are already here—they just require a lot of very careful engineering to keep them from falling over.
And a lot of respect for the Model Context Protocol. It really does feel like the "USB" moment for AI.
It really does. Standardizing the interface is the first step toward a truly interconnected AI ecosystem.
Well, I think that’s a good place to wrap this one. We’ve covered the native versus prompted gap, the reality of MCP, and why your non-native model might be hallucinating its way into a corner.
It has been a pleasure as always.
Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show—including the inference for the agents we were just talking about.
This has been My Weird Prompts.
If you are finding these deep dives helpful, we’d love it if you could leave us a quick review on your favorite podcast app. It really does help other curious minds find the show.
Take care, everyone.
See you in the next prompt.