#2683: MCP vs Agent Skills: Context Wars

When 12M token windows arrive, do MCP servers or agent skills win? Plus: federated access for agent teams.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2844
Published: May 7
Duration: 36:02
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: model-context-protocol ai-agents context-window

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A developer building heavily with Claude noticed something alarming: his agent's reasoning quality was degrading. The culprit wasn't bad code or a broken model — it was context pollution from too many plugins. Every skill description, parameter list, and example loaded into context upfront, burning thousands of tokens before a single message was typed. His fix was elegant: a single searchable catalog plugin that loads skill definitions on demand, turning eager loading into lazy loading. But a new model claiming a twelve million token context window threatens to make his entire approach obsolete.

The episode explores the fundamental architectural choice facing agent builders: MCP servers or agent skills? MCP wraps APIs into clean, stable abstractions with five tools instead of two hundred endpoints. Agent skills let models call raw APIs directly with curl. The argument for skills is seductive — in a world of massive context windows, why bother with an abstraction layer? But MCP offers something skills don't: security boundaries. An MCP server can enforce granular permissions — search email but not delete it — while skills need credentials embedded in each definition.

The federated access problem compounds this. When a team builds agents, someone needs to control which tools each agent can use and with what permissions. Current frameworks lack robust identity layers. The emerging solution involves pre-authorized OAuth scopes for agents, where tokens carry limited permissions baked in at setup time. Until that infrastructure matures, builders face a choice between MCP's natural permission choke points and skills' flexibility — a decision that depends heavily on whether twelve million token context windows deliver on their promise.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2683: MCP vs Agent Skills: Context Wars

Daniel sent us a two-parter this week — and it's one of those questions that only makes sense if you've been living inside the guts of agentic AI for the last six months. He's been building heavily with Claude, noticed his context was getting eaten alive by too many plugins, built his own workaround, and now he's asking a bigger question. Where should builders actually place their bets right now — MCP, or agent skills? And then, once you pick a lane, how do you handle federated access when you're working with a team? Because sharing root credentials with a junior dev is just as dumb in the agent world as it is in AWS.

This lands at exactly the right moment, because a model just dropped claiming a twelve million token context window. That's an order of magnitude beyond where we've been stuck. The state of the art has been hovering around one million tokens for a while now — it's been a plateau — and suddenly someone's claiming they've cracked twelve. That changes the math on everything Daniel's asking about.

Quick housekeeping note — DeepSeek V four Pro is writing our script today, so if anything sounds unusually coherent, that's why.

I was going to say, you sound sharper than usual.

I was talking about you.

Well, let's dig into the twelve million token claim first, because it sets up the whole conversation. The model Daniel's referring to — from what I've been able to track — appears to be from a lab outside the usual big five. Not Anthropic, not OpenAI, not Google, not Meta, not xAI. The details are still thin, but the claim is twelve million tokens with what they're calling "near-lossless" recall across the full span.

Near-lossless is doing a lot of work there.

It always does. But even if it's eighty percent effective recall at the tail end, that's still transformative. For context, one million tokens gets you roughly the complete works of Shakespeare, all seven Harry Potter books, and a decent-sized codebase. Twelve million — that's an entire small library. That's every email you've ever sent, every Slack message from a five-year-old startup, and the full documentation for a dozen APIs, all in active memory at once.

Which means the whole category of work Daniel's been doing — context management, pruning, deciding what to load and what to leave out — starts to look like a temporary job. Like being a lamplighter in 1890.

That's exactly the tension he's feeling. He built this elegant plugin catalog system — one plugin that searches a catalog and exposes skills on demand, instead of a hundred plugins each bleeding context. It's clever, and he's right to give himself credit. But he also knows it might be obsolete in six months.

Let's be precise about why his fix matters right now. He was running into what you could call the eager loading problem. Claude loads all plugin descriptions into context upfront — every skill you've got installed in the marketplace, its description, its parameters, its examples. That's not your conversation with Claude. That's overhead. And he noticed his reasoning quality was degrading, which is the classic symptom of context pollution.

This is something the Anthropic docs actually confirm, and Daniel dug into them directly. When you have a lot of skills installed, each one's metadata sits in context. It's not huge per skill — maybe a few hundred tokens — but if you go on what he called a "skill-creating binge," and you've got fifty or sixty plugins, suddenly you're burning thousands of tokens before you've even typed a message. And those tokens are competing with the actual task for the model's attention.

The attention mechanism in a transformer doesn't treat "system instructions" and "user query" as separate buckets. It's all one big soup. Every token attends to every other token, and the more tokens you have, the more diluted the attention gets for any given piece of information.

Daniel's workaround — a single plugin that acts as a searchable catalog, loading skill definitions only when they're actually needed — it's essentially turning eager loading into lazy loading. It's the same principle as dynamic library linking in operating systems. Don't load the code into memory until the program actually calls for it.

Which is exactly the kind of thing a good engineer does when the system isn't handling it automatically. And his instinct — "maybe Anthropic will fix this in a month" — is probably right. These are solvable problems. The question is whether they get solved at the infrastructure level or made irrelevant by massive context windows.

Let's talk about the bifurcation Daniel's really asking about. He sees two surfaces for agentic AI right now. One is MCP — the Model Context Protocol — which is essentially a standardized way to wrap APIs so that models can interact with them through a streamlined interface. The Gmail API might have two hundred endpoints with bewildering parameter combinations. MCP says: we'll expose five clean tools — search email, send email, get thread, list labels, delete message. The model only needs to know those five things.

The other surface is agent skills — where you define a skill as a block of instructions, possibly with scripts attached, and the model invokes the skill, which can then do whatever it needs to do, including calling raw APIs directly with curl or running arbitrary code.

Daniel's argument is provocative. He says you can do everything with agent skills that you can do with MCP, and therefore MCP might be redundant in a world of large context windows. His reasoning is: why wrap the Gmail API in an MCP server when you can just write a skill that knows how to use curl against the Gmail API endpoints directly?

I want to push back on that, because I think it misses something important about why MCP exists in the first place.

MCP isn't just about reducing the number of endpoints. It's about creating a stable abstraction layer. When Google changes the Gmail API — which they do, regularly — the MCP server gets updated once, and every agent that uses it keeps working. If you've got a hundred skills that all use raw curl against Gmail endpoints, and Google deprecates an endpoint or changes the authentication flow, you're updating a hundred skills.

That's the DRY principle in a different hat. Don't Repeat Yourself. But Daniel's counterargument would be: in a world where the model itself can update those skills, where you can say "hey, the Gmail API changed, go fix all the skills that depend on it" — does the abstraction layer still need to be a separate protocol?

That's the crux of it. If the model is smart enough and has enough context to manage its own tools, then MCP becomes a convenience rather than a necessity. But we're not quite there yet, and I'd argue we're further away than the twelve million token headline suggests.

Let's get concrete about what a twelve million token context window actually means for these two approaches. The current bottleneck with agent skills is exactly what Daniel experienced — every skill definition costs tokens, so you can't have too many. At one million tokens, you can maybe have fifty to a hundred well-defined skills before you start seeing degradation. At twelve million, you can have thousands.

Which changes the economics entirely. If you can have thousands of skills loaded simultaneously without context pollution, then the "just write a skill for everything" approach becomes viable in a way it isn't today.

Here's where I think Daniel's analysis undersells MCP. MCP isn't just about token efficiency. It's about security boundaries. When a skill uses raw curl to hit an API, it needs the full credentials for that API. The skill can do anything those credentials allow. An MCP server can enforce granular permissions — this tool can search email but not delete it, this tool can read a specific label but not everything.

This connects directly to Daniel's second question about federated access. If you're a solo developer, you trust your own skills. You wrote them, you know what they do. But the moment you're on a team, and someone else is writing skills that will run with your credentials, you need guardrails.

This is where the namespace problem comes in. Daniel mentions that namespacing MCP tools is possible but becomes a headache manually. He's right. If you've got five different MCP servers, and two of them expose a "search" tool, you need to distinguish between them. You end up with "gmail_search" and "slack_search" and "jira_search" — and now your model needs to understand the naming convention.

Which is the kind of thing that sounds trivial until you're debugging why the agent keeps trying to search Jira for your email.

There's an emerging set of practices around this. The MCP specification itself has been evolving — it now supports tool namespacing natively, so servers can declare a prefix and the client handles the disambiguation. But the bigger issue Daniel's raising is about access control at scale.

Let's take the AWS analogy he used. You don't give the junior developer root credentials. You give them an IAM role with exactly the permissions they need. The equivalent in the agent world would be: this team member's agent can use the Gmail search tool but not the Gmail delete tool. Or it can access the staging database but not production.

That's currently a mess. Most agent frameworks — whether you're using MCP or skills — don't have a robust identity and permissions layer. The model either has access to a tool or it doesn't. There's no concept of "this model, running on behalf of this user, in this context, can use this tool with these constraints.

This is where I think Daniel's intuition about skills being more fundamental might actually create a bigger problem. With MCP, the server is a natural choke point for permissions. You can put your access control logic in the server, and every tool call goes through it. With raw skills that call APIs directly, the permissions logic has to be embedded in each skill, or you need some kind of proxy layer.

Which you could build. You could have a skill that doesn't hold credentials itself but calls out to a secrets manager, which enforces policies. But now you're building infrastructure that MCP servers give you for free. Or at least, they're supposed to.

Let's talk about what's actually happening in the ecosystem right now. Daniel asked about emerging tooling for federated access, and there are a few threads worth pulling on.

The most interesting one, I think, is what's happening around OAuth for AI agents. There's a draft specification — it's being called "Agent OAuth" or sometimes "AI Agent Authorization" — that extends standard OAuth flows to handle the case where the entity making the API call isn't a human clicking a button but an agent acting on their behalf.

The core problem is that OAuth was designed for a human-in-the-loop consent moment. You get redirected, you click "allow," you get a token. An agent can't click "allow" — or rather, you don't want it to, because that defeats the purpose of consent.

So the emerging pattern is what's being called "pre-authorized scopes." When you set up the agent's credentials, you specify exactly which scopes it gets, and those scopes are baked into the token. The agent never sees the full credential. It gets a token with limited permissions, and if it tries to do something outside those permissions, the API rejects it.

That sounds obvious, but it's actually a departure from how a lot of API keys work today. Most API keys are all-or-nothing. You have the key, you can do anything. The move to scoped tokens for agents is a necessary precondition for any kind of team-based agent development.

It's happening. Google's API console now lets you create service accounts with extremely granular permissions. Anthropic's API has been moving toward more fine-grained API keys. Stripe has had restricted API keys for years. The pieces are there, but stitching them together into a coherent agent permissions framework — that's what nobody's quite solved yet.

Daniel mentioned he's a solo developer, so he doesn't have to think about this day to day. But he's right to flag it, because the moment agentic AI moves into enterprise settings — which it's doing right now — this becomes the blocking issue. No CISO is going to sign off on an agent that has unrestricted access to the company's Gmail, Slack, GitHub, and AWS.

There's a startup — I won't name them because I'm not sure about their current status — but they're working on what they call an "agent identity plane." The idea is that every agent gets its own identity, separate from the human it's acting on behalf of, and that identity carries its own set of permissions. You can revoke an agent's access without revoking the human's, and vice versa.

That's the right abstraction. An agent isn't you. It's a piece of software acting on your behalf, and it should have its own credentials, its own audit trail, its own scope. When something goes wrong — and it will — you need to know which agent did what, under whose authority, with what permissions.

This loops back to the MCP versus skills question in an interesting way. MCP servers can log every tool call. If you're using raw skills with curl, the logging happens at the API level — which is fine if you control the API, but if you're calling a third-party service, you might not get detailed logs. MCP gives you an audit layer that skills don't inherently provide.

Unless you build it. And that's really the trade-off Daniel's dancing around. Skills are more flexible. You can do anything with them. But every piece of infrastructure you want — logging, permissions, versioning, rollback — you have to build yourself or do without. MCP gives you a lot of that out of the box, at the cost of some flexibility.

Let me complicate this further. There's a third path that's emerging, which is skills that are backed by MCP servers under the hood. You define the skill in a way that's easy for the model to understand, but when the skill executes, it calls an MCP server that handles the actual API interaction.

The skill is the interface, and MCP is the implementation layer.

And that might be where things settle. The model sees a clean, natural-language skill definition — "search my email for messages from Daniel about context windows" — and the skill's implementation uses MCP to securely and auditably call the Gmail API.

That hybrid approach actually makes a lot of sense, and it maps to how software development has always worked. You have high-level abstractions that are easy to reason about, and low-level implementations that handle the messy details. The question is just where the boundary sits.

Daniel's plugin catalog system is essentially that boundary. His one plugin is the high-level abstraction — "find and load the right skill for this task" — and the individual skills are the implementations. He's built a miniature MCP server, in effect, without calling it that.

Which is why I think his framing of "MCP versus skills" might be a false dichotomy. They're different layers of the stack. The real question is: which layer do you invest in standardizing, and which layer do you keep flexible?

If I had to bet — and I'm going to hedge this heavily — I think MCP wins at the infrastructure layer and skills win at the user-facing layer. The model sees skills. The plumbing uses MCP. And the twelve million token context window doesn't change that calculus much, because the constraint isn't just about how many tools you can load into memory.

The constraint is also about reliability, security, and maintainability. Those don't go away just because you have more RAM.

Let's talk about what happens when context windows really do get to twelve million tokens. Because there's a knock-on effect that I don't think enough people are discussing.

The current agentic workflow involves a lot of context switching. You load tools for one task, unload them, load different tools for the next task. That's expensive in terms of latency and cognitive overhead for the model. If you can keep everything loaded all the time, the model can fluidly move between tasks without that switching cost. That changes what kinds of workflows are possible.

It makes the agent more like an operating system and less like a script. An OS keeps everything in memory and the CPU schedules attention. A script runs one thing at a time and then exits. Most AI agents today are scripts. A twelve million token context window makes them operating systems.

An operating system needs a permissions model. Which brings us right back to Daniel's second question.

Let's get specific about what best practices are actually emerging for federated access in agent systems. Daniel asked what he should be paying attention to, and I think there are three things.

I'm listening.

First, scoped API keys for everything. This is table stakes. Every service your agent touches should have a dedicated API key with the minimum permissions needed. Not your personal API key. Not the root key. A key that can do exactly what the agent needs and nothing else. If you're using Gmail, that means a service account with read-only access to specific labels. If you're using GitHub, a fine-grained personal access token that can only access specific repositories.

Rotate those keys regularly. If an agent's credential gets compromised — or if the agent itself goes rogue — you want to be able to cut it off without breaking everything else.

Second, the identity plane concept I mentioned earlier. Every agent action should be traceable to a specific agent identity, which is separate from the human identity that authorized it. This is still nascent, but the building blocks are there. Service accounts, audit logs, and the emerging Agent OAuth spec all point in this direction.

The third thing, I'd add, is tool-level access control within the agent framework itself. This is where MCP has an advantage. If you're using an MCP server, you can configure it so that tool A is available to everyone on the team, tool B is available only to senior developers, and tool C requires explicit human approval before it executes.

That last one — human-in-the-loop approval for high-stakes actions — is going to become standard practice. You don't want an agent being able to delete a production database or send an email to your entire customer list without a human saying "yes, do that.

There's a paper from earlier this year that looked at this exact problem. They called it "graduated autonomy" — the idea that agents should have different levels of autonomy depending on the risk of the action. Low-risk actions, like searching email, are fully automated. Medium-risk actions, like sending an email, might require a quick confirmation. High-risk actions, like changing DNS records or running database migrations, require explicit human approval with a timeout.

The key insight from that paper was that the risk level isn't inherent to the tool — it's contextual. Deleting one email is low risk. Deleting every email from the last year is high risk. The permissions system needs to be able to distinguish between those two cases.

Which is hard. Because it requires the permissions system to understand the semantics of the action, not just which tool is being called.

This is where I think we're going to see a lot of innovation in the next year or two. The tools for agent permissions today are crude. They'll get smarter. And I suspect the winning approach will be something like: define policies in natural language, have a smaller, faster model evaluate each action against those policies, and escalate anything ambiguous to a human.

That's essentially what Daniel's catalog plugin does, in a way. It's a policy layer — "only load the skills that are relevant to this task." Extend that to "only allow the actions that are consistent with this policy," and you've got a permissions framework.

Let's circle back to the twelve million token model, because there's one more angle I want to explore. Daniel said something interesting in his prompt — he noted that reasoning capability improvements haven't been matched by context window increases in recent releases. The models got smarter, but they didn't get bigger memory. This new model, if the claims hold up, breaks that pattern.

The question is whether that's actually useful. A bigger context window doesn't help if the model can't effectively reason across the entire span. We've seen this with existing models — they technically support a million tokens, but their effective recall drops off sharply after a hundred thousand or so.

The "lost in the middle" problem.

Information at the beginning and end of the context window gets attended to. Information in the middle gets blurred. It's a well-documented phenomenon, and it's not clear that just scaling up the context window fixes it. You need architectural changes to how attention works.

The twelve million token claim, if it's real, almost certainly involves some kind of architectural innovation. Ring attention, or sparse attention, or something that breaks the quadratic scaling problem. Because a naive transformer with twelve million tokens would be computationally infeasible.

The quadratic scaling is the elephant in the room. Attention scales with the square of the sequence length. Going from one million to twelve million tokens — that's twelve times the tokens, but a hundred and forty-four times the computation if you're doing full attention. Nobody can afford that.

Whatever they're doing, it's clever. And that cleverness might have implications for whether the model can actually use all twelve million tokens effectively, or whether it's twelve million tokens of storage but only two million tokens of effective reasoning.

Which brings us to the practical question for someone like Daniel. If you're building agent systems today, should you optimize for the world as it is — one million token context windows, eager loading, the need for careful context management — or for the world as it might be in six months?

I think the answer is: do both, but know which is which. Daniel's catalog plugin is a perfect example of optimizing for today. It solves a real problem he was experiencing right now. But he built it knowing it might be temporary, and he didn't over-invest. It's one plugin, not a whole new framework.

The principle is: make your temporary fixes easy to delete. If you're building something to work around a current limitation, don't build it into the foundations of your system. Keep it as a shim that you can remove when the limitation goes away.

On the other side, invest deeply in things that will matter regardless of context window size. Good permission models. Clear audit trails. Those don't become less important when you have more tokens. They become more important, because the agent can do more damage.

That's the through line in Daniel's two questions. He's asking about MCP versus skills, and about federated access, but the underlying concern is the same: what do I build that will still matter in two years? And the answer is: build for security and maintainability. The interface layer — MCP, skills, whatever — will evolve. The need to control what your agents can do will not.

Let me offer a concrete recommendation, since Daniel asked for one. If I were building an agent system today, here's what I'd do. I'd use MCP for any integration that involves sensitive data or destructive actions. Gmail, GitHub, databases, payment systems — those go through MCP servers with strict tool-level permissions. I'd use agent skills for everything else — quick utilities, data formatting, research tasks, things where the blast radius is small.

That's a solid heuristic. MCP for the high-stakes stuff, skills for the low-stakes stuff. And if you're a solo developer like Daniel, you can probably get away with skills for everything, because you're the only one with access and you trust yourself not to write a skill that deletes your email.

The moment you add a second person, the calculus changes. And Daniel's smart to be thinking about that now, even if he's not there yet. Because the architecture you choose at the start constrains what's easy later.

There's one more thing I want to mention about the federated access question, because Daniel specifically asked about namespacing and tooling. There's a pattern emerging that I've seen in a few open-source projects — it's basically a tool registry with built-in RBAC.

Role-based access control.

The idea is you register all your tools — whether they're MCP tools or skills — in a central registry. Each tool has a set of required permissions. Each user or agent has a set of granted permissions. When an agent tries to use a tool, the registry checks whether the agent's permissions cover the tool's requirements. If not, it's blocked before the call even reaches the API.

It's like a middleware layer for agent actions. Every tool call goes through it, and it enforces policy.

And the nice thing about this pattern is that it's agnostic to whether the tool is implemented as an MCP server or a raw skill. The registry doesn't care. It just knows that "delete_all_emails" requires the "email_destructive" permission, and this agent doesn't have it.

That's the kind of thing I expect will become standard within a year or two. It's too obvious a need, and the building blocks are too readily available, for it not to.

It's the kind of thing that makes the MCP versus skills debate less consequential. If you've got a good permissions layer, you can use whichever implementation pattern makes sense for each integration, and the security properties are consistent across both.

Which is, I think, the answer to Daniel's deeper question. He's asking which horse to bet on, and the answer is: bet on the infrastructure that makes the choice less important. Permissions, logging, identity — those are the things worth investing in.

The twelve million token model, if it's real, changes the urgency of some decisions but not the fundamentals. You still need to control what your agents can do. You still need to know who did what. You still need to be able to revoke access. Those requirements are invariant under context window scaling.

Alright, let's land this. Daniel asked two questions. MCP versus agent skills — what's the take right now? And what are the best practices for federated access?

On the first: it's not either-or. MCP gives you security, auditability, and a stable abstraction layer. Skills give you flexibility and speed. The smart move is to use MCP for high-stakes integrations and skills for everything else, and to keep your architecture flexible enough that you can change your mind.

On the second: scoped API keys for everything, separate agent identities, tool-level access control, and human-in-the-loop for destructive actions. The tooling is still nascent, but the direction is clear. If you're building now, build with the assumption that you'll need to add permissions later, even if you don't need them today.

Daniel's catalog plugin — the thing that sparked this whole question — is exactly the right kind of temporary fix. It solves a real problem, it's elegant, and it's easy to throw away when the infrastructure catches up.

The lamplighter who knows the electric lights are coming but still lights the lamps tonight, because people need to see.

That's the job.

Now: Hilbert's daily fun fact.

Hilbert: The only surviving hand plane from the 1840s Vanuatu archipelago was forged from a single piece of discarded ship ballast iron, and its tote is carved from narra hardwood with a distinctive thumb-notch found nowhere else in Pacific woodworking traditions.

I have so many questions about ship ballast iron.

I have so many questions about why Hilbert knows that.

The thing I keep coming back to is that we're in the awkward adolescence of agentic AI. The tools are powerful enough to be useful but not mature enough to be boring. And that means the people building with them right now are doing a lot of work that future developers won't have to do. Context management, permission scaffolding, tool orchestration — all of this will eventually be handled by the platform.

The people doing that work now are the ones who understand how it actually works. When the abstractions finally arrive, they'll know what's happening underneath. That's an advantage.

This has been My Weird Prompts, with thanks to our producer Hilbert Flumingtop. If you want more episodes, we're at myweirdprompts.

If you enjoyed this, leave us a review — it helps other people find the show.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2683: MCP vs Agent Skills: Context Wars

Downloads

You Might Also Like

#2683: MCP vs Agent Skills: Context Wars