#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start

Stop shipping the whole toolbox to every session. A bridge plugin pattern that fetches skills on demand instead.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-2834
Published: May 6
Duration: 23:14
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: context-window ai-agents prompt-engineering

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When you install plugins in Claude Code, every skill, command, and agent definition gets eagerly loaded at session start. Descriptions pile up. Your context window gets nibbled before you've typed a single character. Install five plugins and you're probably fine. Install fifteen and you're burning maybe two thousand tokens before doing anything. Install thirty and your context window has a noticeable dent — and research shows model attention degrades for information in the middle of long contexts.

The alternative is a lazy-fetch architecture: a centralized catalogue server (Postgres with pgvector, signed records using ed25519, embedding cache) holds all skills and their metadata. Plugin authors publish into the substrate. On your laptop, you install exactly one bridge plugin. At session start, it asks the substrate what skills you're subscribed to and pins a compact index — qualified name, namespace, short description — as a system reminder. The full skill body stays on the substrate until the agent explicitly fetches it through an MCP tool call.

This pattern isn't universally better — if you have three plugins and use all of them in every session, the bridge adds overhead. But past the threshold where plugin descriptions meaningfully compete with task context, lazy fetch wins. In team settings, identity-scoped subscriptions let different machines see different slices of the same catalogue without reinstalling anything. The description becomes the load-bearing artefact: it's retrieval bait, the only signal the model sees when deciding whether to fetch a skill. Writing good descriptions is essentially SEO for your agent, and the namespace plus qualified name gives enough signal to disambiguate similar skills. Trust infrastructure matters too — ed25519 signing on every record, authorship tiers distinguishing user-authored from third-party skills, and supply-chain hygiene that becomes load-bearing security when a malicious skill can instruct the agent to act in your name with your credentials.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start

Daniel sent us this one — he's been building something that tackles a problem I think a lot of people are starting to feel but haven't quite named yet. The basic issue: when you install plugins in Claude Code, every skill, command, and agent definition gets eagerly loaded at session start. Descriptions pile up. Your context window is getting nibbled before you've typed a single character. His architecture inverts the whole model — instead of plugins living on your machine, you run a catalogue server as a substrate, and each workstation only installs a thin bridge plugin that fetches what it needs on demand. He wants us to dig into why this pattern matters for agentic systems generally, the trade-off between eager and lazy fetching, and why the humble description field suddenly becomes the most important thing you write.

Oh, this is good. This is really good. And by the way — DeepSeek V four Pro is writing our script today, so if the transitions feel extra crisp, that's why.

Alright, so where do we start? Because this feels like one of those patterns that's obvious in retrospect but nobody was talking about six months ago.

Let's start with the actual numbers on what eager loading costs you. In a typical Claude Code setup, when your session initializes, the system prompt gets assembled — that's your custom instructions, your CLAUDE dot md file, and then every installed plugin's manifest. Each plugin declares its skills, its commands, its agents, and each of those has a description. Those descriptions might be fifty words, might be two hundred. Install five plugins, you're probably fine. Install fifteen, and you're burning maybe two thousand tokens before you've typed anything. Install thirty — which isn't crazy if you're pulling in specialized tooling for databases, deployment, testing, code review — and suddenly your context window has a noticeable dent.

The context window isn't free real estate. You pay for it in attention degradation, not just token limits.

There's been solid research on this — the "lost in the middle" problem, where model attention quality degrades for information in the middle of a long context. Anthropic's own papers have documented this. So it's not just that you're wasting tokens, it's that the tokens you're wasting are sitting right in the prime retrieval zone, potentially pushing more relevant context toward the edges where recall gets worse.

Daniel's insight is basically: stop shipping the whole toolbox to every session. Ship a catalogue instead.

And the architecture is worth walking through carefully, because the details matter here. You have a centralized substrate — a catalogue server — that holds all the skills, agents, commands, and their metadata. Postgres with pgvector for embeddings, signed records using ed25519, an embedding cache, asset storage for anything heavy. Plugin authors publish into the substrate. They don't sit on your laptop at all.

On the laptop side?

You install exactly one bridge plugin. That's it. One thin plugin that does two jobs. First, a SessionStart hook fires and asks the substrate: "what skills is this identity subscribed to?" The substrate returns a compact index — qualified name, namespace, a short description per skill — and the bridge pins that as a system reminder. The model can read that index and decide what it needs. The full skill body — the actual implementation, the prompts, the tool definitions — stays on the substrate until the agent explicitly fetches it through an MCP tool call, get underscore skill.

The model sees a menu, not the kitchen.

That's exactly the metaphor. It sees a menu with descriptions, and when it decides "I need the PostgreSQL schema migration skill," it calls get underscore skill, the substrate returns the full skill definition, and the agent loads it into context at the point of use. Lazy fetch with a pinned descriptor index.

Okay, let me push on the trade-off here, because I think it's more interesting than "lazy is better." Eager loading has a real advantage — when the skill is needed, it's already there. The model doesn't have to make a tool call, wait for a round-trip, parse the response. In a session where you end up using most of your installed skills, eager loading was the right call. You paid the context cost once and got free access thereafter.

That's fair. And there's a crossover point. If you have three plugins and you use all of them in every session, the bridge pattern adds overhead for no benefit. The round-trip cost of fetching skills on demand might actually slow things down compared to just eating the upfront context cost.

This pattern isn't universally better. It's better past some threshold.

Right, and I think the threshold is basically: when does plugin count times average description size start to meaningfully compete with the actual task context? If you're doing a focused coding session where the relevant context — the files you're editing, the conversation history, the instructions — is maybe eight thousand tokens, and your plugin descriptions are eating two thousand of that, you've lost twenty-five percent of your working space. That's real.

It gets worse in team settings. If you've got a shared substrate with dozens of namespaces and you're subscribed to a curated slice, the eager model would have forced you to either install everything locally and manually prune, or just live with the bloat. The bridge pattern gives you identity-scoped subscriptions — same substrate, different surface per machine.

This is where the architecture gets genuinely elegant. The subscription model means your work laptop and your personal machine can see different slices of the same catalogue. You don't reinstall anything. You don't reconfigure plugins. You just change your subscriptions, and the next SessionStart pin reflects the new set. For teams, this is huge — a team lead curates a namespace, team members subscribe to it, and when the lead adds a new skill, it surfaces automatically.

There's a parallel here to package registries versus vendoring. For years, the debate was "do you check your dependencies into your repo, or do you fetch them from a registry at build time?" Vendoring gives you reproducibility and offline access. Registries give you centralized updates and smaller repos. Neither is universally correct — it depends on your dependency count and your tolerance for build-time fetches.

Container registries versus baking images directly onto machines. The substrate pattern shows up everywhere once you start looking — centralize the catalogue, distribute thin clients, fetch on demand.

MCP itself is starting to hit this. You've got aggregator services now that consolidate multiple MCP servers behind a single endpoint. That's the same instinct — don't make every client configure fifteen server connections, give them one connection that routes to many backends.

The trend line is only going one direction. As agentic systems get more capable, the number of tools they can potentially invoke is going to explode. We're going from "here are your five tools" to "here are your five hundred tools, pick the right one." You cannot eagerly load five hundred tool definitions. The context math simply doesn't work. You have to move to a describe-then-fetch model.

Which brings us to the description. You said earlier that the description becomes the load-bearing artefact in this architecture.

In an eager-loading world, the description is documentation. Nice to have, helps the model understand when to use the skill, but if it's a bit vague, the model can also see the full implementation and figure it out. In a lazy-fetch world, the description is retrieval bait. It's the only thing the model sees when it's deciding whether to fetch the skill. If your description is vague, the model won't know to call get underscore skill. If it's misleading, the model fetches the wrong thing, wastes a round-trip, and then has to try again. The description stops being documentation and starts being the primary retrieval surface.

Writing a good description becomes a skill in itself. You're essentially doing SEO for your agent.

I hate that that's the right analogy, but it's the right analogy. You need the description to be specific enough that the model can disambiguate between similar skills — "run database migrations" versus "generate migration files" versus "validate schema against production" — and concise enough that it doesn't bloat the pinned index. It's a tight constraint. Maybe sixty to a hundred words that capture exactly what the skill does, when to use it, and what the prerequisites are.

The namespace adds another layer. Qualified name plus namespace means you can have multiple skills with similar names in different contexts without collision. The description plus the namespace together give the model enough signal to route correctly.

One thing I want to flag that I think is underappreciated: the bridge plugin also exposes discovery tools beyond the pinned index. List underscore namespaces, list underscore skills, search underscore skills. The search uses vector embeddings — pgvector in the substrate — so the model can do semantic search across the catalogue when the pinned descriptions aren't enough. That's the fallback. The primary path is "read the pin, pick the skill, fetch it." The secondary path is "search the catalogue for something I didn't know I needed.

Which is interesting because it mirrors RAG architectures. You have a compact retrieval index — the pinned descriptions — and a denser retrieval path when the index misses. Same shape as "chunk your documents, embed them, retrieve top K, feed to model.

The embedding cache matters here too. If you're searching the catalogue repeatedly, you don't want to recompute embeddings for every query. The cache sits in front of pgvector and makes repeated searches fast. It's a small detail but it's the kind of thing that separates a prototype from something you'd actually run.

Let's talk about trust, because Daniel mentioned signing and authorship tiers. Once you have a shared substrate, supply-chain hygiene stops being optional.

In the native install model, you're pulling plugins from wherever — npm, GitHub, a zip file someone sent you — and you're trusting that what you downloaded is what you think it is. But the blast radius is limited to your machine. In a shared substrate model, if someone publishes a malicious skill to a namespace that a whole team subscribes to, the blast radius is the whole team.

Ed25519 signing on every record. The substrate can verify that a skill was published by the author it claims to be from.

There's an authorship distinction — user-authored versus third-party. Skills you wrote yourself get a different trust tier than skills pulled from a community namespace. The subscription model means you can decide "I trust my team's namespace implicitly, I'll auto-subscribe to new skills there, but third-party namespaces require manual review.

This feels like the early days of package managers. npm had the left-pad incident. PyPI has had typo-squatting attacks. The substrate model inherits all those problems and adds "the package can execute arbitrary agent actions" to the threat model.

Which is terrifying, honestly. A malicious skill doesn't just run bad code on your machine — it can instruct the agent to do things in your name, with your credentials, in your session. The signing and trust tiers aren't decorative. They're load-bearing security infrastructure.

Where does this pattern show up next? You mentioned team-shared agent tooling. I'm thinking about enterprise rollouts — companies that want to give every developer access to a curated set of internal tools without managing plugin installations across hundreds of machines.

That's the obvious one. But I think there's a less obvious one: MCP server proliferation is about to get wild. Right now, if you want to give Claude Code access to your company's internal API, you write an MCP server, you document the tools, you tell everyone to install it. Fine for one server. What happens when your company has forty internal APIs and each has an MCP server? Nobody's installing forty servers. You need a substrate that aggregates them, and you need lazy fetch so the model isn't drowning in tool definitions.

The "load all tools" model breaks at a certain scale, and we're going to hit that scale fast.

I saw something relevant to this — there was a piece a few months back about how AI firms are starting to give the US government early access to models for evaluation before launch. Microsoft, Google, xAI all signed on. And the subtext there is that as these models get deployed in sensitive contexts, the tooling surface matters enormously for security review. You can't audit what you can't see. A substrate model with signed records and an authorship trail makes auditing tractable in a way that "everyone installed random plugins from GitHub" absolutely does not.

That's a good connection. The governance story gets better when you have a single catalogue to audit rather than N machines to inspect.

Let me circle back to something you said earlier about the description being retrieval bait. I think there's a knock-on effect here that's worth naming: it changes how plugin authors think about their work. In an eager-loading world, you write the implementation first and the description is an afterthought. In a lazy-fetch world, the description is the interface. If the model can't find your skill, it doesn't matter how good the implementation is. Plugin authors have to invert their priorities — the description gets the same care as the code.

Which is uncomfortable for a lot of developers. We're used to writing code that speaks for itself. Now you're writing prose that has to convince an AI to look at your code.

It has to be honest prose. If you oversell your skill in the description, the model fetches it, discovers it can't do what you claimed, and now you've burned trust and a round-trip. The description is a promise, and the implementation has to keep it.

The constraint is: concise enough to not bloat the pin, specific enough to disambiguate from similar skills, honest enough that the fetch isn't wasted, and compelling enough that the model actually chooses it. That's a hard writing problem.

It's basically a query-document matching problem where the query is the agent's internal reasoning about what it needs, and the document is your sixty-word description. And you don't get to see the query. You have to anticipate what the agent might be thinking when it needs your skill.

This is why the vector search fallback matters. The pinned descriptions handle the common case — the agent knows roughly what it wants and can scan the index. But when the agent's need is fuzzy — "I need something that helps with database stuff, not sure what" — the semantic search over embeddings catches the long tail. The description still matters because it's what gets embedded, but the retrieval mechanism is more forgiving.

The embedding cache makes that fast. Without it, every search query hits pgvector and recomputes similarity scores across potentially thousands of skills. With the cache, repeated searches — which happen a lot in agentic loops — stay cheap.

Let's step back and talk about when this pattern doesn't pay off. You already mentioned the low-plugin-count case.

If you're on a plane or in a secure environment with no network access, the bridge pattern breaks because the substrate is unreachable. Eager loading works fine offline — everything's already local. You'd need some kind of local cache or offline mode for the bridge, which adds complexity.

Latency-sensitive workflows too. If you're in a tight loop where the agent is rapidly switching between skills — "run the linter, now run the tests, now check the schema, now deploy" — each skill switch incurs a fetch round-trip. In eager mode, all those skills are already in context. The bridge adds latency to every switch.

Though I'd argue that in practice, most sessions don't involve rapid skill switching. The agent typically settles into a workflow and uses a small subset of available skills. The lazy fetch cost is paid once per skill per session, not once per invocation. Once get underscore skill returns, the skill definition is in context for the rest of the session.

So the overhead is bounded by the number of distinct skills used, not the number of skill invocations.

And that number is usually small. Even in a complex session, you might use five or six distinct skills. The bridge pattern means you pay a round-trip for each of those five or six, and zero context cost for the twenty skills you didn't use. Eager loading means you pay context cost for all twenty-five, regardless.

The math tilts toward lazy fetch as the ratio of installed skills to used skills increases. Which, in a shared substrate with rich catalogues, is basically always.

This connects to something I've been thinking about with agent architectures more broadly. We're moving from "the model has capabilities" to "the model has access to capabilities." The distinction matters. A model's native capabilities are always available, zero latency, zero context cost — they're baked into the weights. Accessed capabilities — tools, skills, plugins — have retrieval costs. The more capabilities you access rather than bake in, the more the retrieval architecture matters.

That's a good framing. And it suggests that as models get better at using tools, the tool catalogue becomes the bottleneck, not the model. You can have the smartest agent in the world, but if it can't efficiently find the right tool, it's crippled.

Which is exactly why the description becomes load-bearing. The description is the bridge between the model's reasoning and the capability catalogue. Get it wrong, and the smartest model in the world picks the wrong tool.

There's an analogy here that I think works — it's like a library catalogue. A library with a million books and no catalogue is just a warehouse. The catalogue is what makes it a library. The descriptions in the pinned index are the catalogue cards. Get the catalogue right, and the model can navigate a million skills. Get it wrong, and you've just got a warehouse.

The namespace system is the Dewey Decimal part. It groups related skills so the model can reason about categories, not just individual entries.

Alright, I want to push on one more thing before we wrap. Daniel mentioned that the substrate uses pgvector for embeddings and that vector search is a fallback, not the primary retrieval path. Why not make vector search the primary path? It's more flexible.

Because the pinned index is deterministic and zero-latency for the model. The model can read the index directly — it's just text in the system prompt. No tool call, no round-trip, no embedding computation. Vector search requires a tool call, embedding the query, running similarity search, returning results. That's a full round-trip plus computation. For the common case — "I need the database migration skill" — scanning a text index is faster and more reliable than semantic search.

The pinned index is the fast path, vector search is the discovery path.

And they serve different needs. The pinned index answers "which of my subscribed skills does this?" Vector search answers "is there anything in the broader catalogue that does this, even if I'm not subscribed?" One is retrieval from known territory, the other is exploration of unknown territory.

That's a clean separation. And it means the description has to serve both paths — it needs to be scannable in a text list and semantically match relevant queries in embedding space.

Same text, two different retrieval mechanisms, both have to work. No pressure on the plugin author.

To land this: the broader architectural lesson is that once your tool inventory crosses some threshold — maybe ten plugins, maybe twenty, depends on description length — a substrate beats per-client install. You get centralized curation, identity-scoped subscriptions, signed records for trust, and lazy fetch that keeps your context window lean. The cost is added complexity, network dependency, and the burden of writing descriptions that actually work as retrieval bait.

I think the threshold is lower than most people assume. Even five plugins with verbose descriptions can eat meaningful context. The moment you feel that friction — "why is my session slower than it used to be?" — you've probably crossed it.

The other lesson is that this pattern isn't specific to Claude Code or MCP. Any agentic system that accumulates tools over time is going to hit the same wall. The describe-then-fetch pattern is going to become standard infrastructure, like package registries did for code dependencies.

I'd go further. I think in two years, we'll look back at eager loading of tool definitions the way we now look at hardcoding configuration values in source files. It works fine at small scale, and then it doesn't, and the migration is painful if you haven't planned for it.

The people who wrote good descriptions from the start will have a much easier migration than the people who wrote "does database stuff" and called it a day.

Write your descriptions like your agent's capabilities depend on them. Because increasingly, they do.

Now: Hilbert's daily fun fact.

Hilbert: In the nineteen twenties, a British diplomat stationed in Tajikistan attempted to train a local ibex to deliver mail between mountain villages. The ibex instead formed an unexpected partnership with a stray dog — the dog carried the mail, the ibex cleared the path of snow with its horns. They worked as a team for nearly three years before retiring to a goat farm together.

...right.

I have so many questions, and I'm going to ask none of them.

This has been My Weird Prompts. Thanks to Hilbert Flumingtop for producing. If you enjoyed this episode, leave us a review wherever you listen — it helps. We're back next week.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start

Downloads

You Might Also Like

#2674: Why Your Agent's Context Window Is Getting Eaten Before You Start