So Daniel sent us this one, and it's right in the wheelhouse of something I've been noticing every time I use a coding agent for anything longer than a one-shot task. He's asking about the emerging tooling for intelligent context management in AI coding agents — specifically the core pain point where users have to manually decide when to clear context and start fresh, but doing so throws away everything the agent has learned about your codebase, your decisions, your constraints. He wants to know what frameworks are actually building smart, autonomous session management — where the harness itself decides when to compress, clip, or restart, and what the landscape looks like for truly autonomous context lifecycle management. There's a lot to unpack here.
There really is. And I should say — I'm Herman Poppleberry, for anyone new to the show — and by the way, today's script is powered by Claude Sonnet 4.6, which feels appropriately on-the-nose given the topic. But yeah, this problem has a formal name now. The field is calling it context rot, and I think naming it was actually important because it helped people stop treating this as a vague annoyance and start treating it as a structural engineering problem.
Context rot. It sounds like something that happens to old vegetables, but the description is disturbingly accurate.
It really is. The mechanism is specific. As a session grows longer, the model doesn't degrade uniformly — it degrades in a particular way. The foundational information from early in the session, your architectural decisions, your original task framing, the constraints you set — that stuff gets compressed or pushed out. What fills the window instead is operational exhaust. Tool outputs, repeated file views, intermediate reasoning traces. High volume, low value. The signal-to-noise ratio collapses.
And the frustrating thing is the user usually can't see it happening. You're just chatting away and then at some point the agent asks you something it already answered three hours ago and you realize — oh, it's gone. It forgot.
The Atlassian engineering team put it well in their Rovo Dev blog post from late March. They described it as the agent accumulating "a growing trail of context" — user requests, assistant responses, tool arguments, tool results, file contents, search results, workspace views, intermediate reasoning. And over time that history becomes large enough to crowd out what actually matters for the next decision. The Chroma research team has also been publishing on this as a named phenomenon, and it's now referenced across multiple engineering blogs as the central problem to solve.
So given that the problem is now named and understood, what are people actually building? Because right now the user experience is... you watch a percentage counter, you get a warning at eighty or ninety percent, you manually type slash compact, and you hope for the best.
That's the current state, yeah. And it's genuinely bad. But the solution landscape is fracturing into five distinct approaches, each with a different philosophy about what context management actually means. The first and most prominent is server-side compaction, which is what Anthropic has shipped. They have a beta API — the header is compact-2026-01-12 — where the model automatically summarizes the conversation when it approaches a configurable token threshold. The default trigger is one hundred fifty thousand tokens, but you can set it as low as fifty thousand.
And how does that actually work mechanically? Because "automatically summarizes" could mean a lot of things.
The mechanics are interesting. When the threshold is hit, Claude generates a compaction block containing a structured summary, then continues the response with that compacted context. On subsequent requests, all message blocks prior to the compaction block are automatically dropped. The model essentially hands itself a briefing document and forgets the source material. You can also pass custom summarization instructions that completely replace the default prompt — so instead of the generic "write down anything helpful including state, next steps, learnings" default, you could say "focus on preserving code snippets, variable names, and architectural decisions." That's a meaningful customization.
So it's like the model writes its own handover document, automatically. Which is what developers were doing manually, but now the agent does it for itself.
Right. And there's a pause-after-compaction option where the API stops after generating the summary, letting the harness inject additional content before continuing. You can also use it to track total token budget across an entire long-horizon task — count how many compactions have happened, multiply by the trigger threshold, and you have a rough estimate of cumulative usage. That lets you gracefully wrap up before hitting hard limits. Claude Code's slash compact command is the user-facing version of this, but the API is what harness builders are actually using.
There's a really interesting timing insight buried in here though, about when to trigger compaction. Because most users are doing it at eighty or ninety percent when the warnings appear.
This is one of the more actionable things I've seen written about this recently. The MindStudio team published a guide in early April making the case that you should compact at sixty percent utilization, not ninety-five. The argument is precise: at sixty percent, the model still has full uncompressed access to everything in the window, so the summary it generates is high quality. By eighty or ninety percent, you're asking the model to summarize a degraded view of the conversation — it's already been doing lossy compression internally just to function. You're summarizing a summary. The handover document degrades with the model.
That's a genuinely counterintuitive point. You'd think you'd want to wait as long as possible before compacting. But you're actually compacting at the worst possible moment.
It's the same problem as trying to back up a failing hard drive — you want to do it before the degradation, not after. And Claude Code now supports post-compaction hooks, which are deterministic scripts that fire after a compaction event. So a harness can use that hook to trigger custom context renewal workflows — re-inject certain files, reload key documentation, whatever the task requires.
Okay so that's approach one, server-side compaction. What's the philosophical alternative? Because I know Atlassian has a pretty different take.
Atlassian's approach with Rovo Dev is philosophically distinct, and they're explicit about it. Their framing is: compaction should be the last resort, not the default. Their alternative is structure-aware pruning — what they call a least-destructive-first cascade. The idea is that summarization destroys the structure of the conversation. Tool calls become prose. Message boundaries disappear. You lose the schema of what happened. Pruning, by contrast, is structured forgetting — you drop specific content while preserving the format of everything that survives.
So walk me through the cascade. What gets dropped first?
In order of least to most destructive: first you trim large machine-generated tool outputs, which are often the biggest and lowest-value content. Then you remove redundant or low-value intermediate steps. Then you compress assistant responses more aggressively than user messages — the reasoning that user messages are higher-signal because they represent actual intent. Then you collapse intermediate scaffolding from multi-step tool use. And only as an absolute last resort do you do summary collapse. The full LLM-based compaction.
And there's a spatial heuristic too, right? About where in the conversation the value lives?
The "protect the edges" heuristic. Always protect the beginning — task framing, constraints, original objective — and the most recent exchanges, which maintain local coherence and next-step planning. The middle is where operational exhaust accumulates. The bulky middle is your target. Atlassian's explicit comparison to LLM-based compaction is interesting: pruning is instant and free and preserves the original conversation structure, but it's mechanical — once content is dropped it's gone. LLM compaction can condense things into a human-readable narrative, but it adds latency and cost and destroys the structured format. Atlassian's conclusion is that if the expensive part of your session is mostly bulky machine-generated text, the best first move is to prune it mechanically rather than asking another LLM to rewrite the whole session.
There's something almost Buddhist about it. Structured forgetting. Let go of the noise, preserve the signal.
I'll let that one sit. The third approach is separate but related — it's about a different kind of context bloat that happens before the agent even starts working. MCP servers, the tool description layer, can consume an enormous amount of tokens just for their schemas. A single large MCP server can eat ten to seventeen thousand tokens just in tool descriptions.
That's staggering. You haven't done anything yet and you've already burned through a meaningful chunk of your window.
Atlassian Labs released an open source tool called mcp-compressor in late March that addresses this. It's an MCP proxy that wraps any existing MCP server and reduces tool-description overhead by seventy to ninety-seven percent. The GitHub MCP server goes from seventeen thousand six hundred tokens down to five hundred tokens at maximum compression. The mechanism is replacing the full tool inventory with three proxy tools: get-tool-schema, invoke-tool, and list-tools. The model doesn't need every tool schema loaded at once — it needs a reliable way to fetch the right schema when it decides it's relevant. That's progressive disclosure applied to the tool layer.
Progressive disclosure keeps coming up as a pattern across all of these approaches. Load what you need when you need it, not everything upfront.
It's the unifying principle. And it shows up in Claude Code's Skills system too, which is the harness-native approach to lazy context loading. Skills are descriptions of additional resources — instructions, documentation, scripts — that the model can pull in on demand when it decides they're relevant. The model reads the description, decides if the skill applies, and loads the full content only then. Path-scoped rules take this further — rules only load when relevant file types are in play. Bash rules load for shell files. The slash context command in Claude Code gives you transparency about what's actually consuming space in your window, which is useful for diagnosing where the bloat is coming from.
So we have compaction, pruning, MCP compression, and skills-based lazy loading. What's the fifth approach? Because I suspect it's the most radical.
By a significant margin. Letta — formerly MemGPT — takes the position that the entire session model is the wrong abstraction. Their framework, Letta Code, launched in December 2025 and now has around nineteen thousand GitHub stars. The core thesis is: instead of managing context within a session, build agents that persist across sessions. The session boundary becomes irrelevant because the agent's memory is always there.
How does that actually work at an architectural level? Because "persistent memory" sounds great until you ask where the memory lives and how it's organized.
The most interesting implementation is what they call Context Repositories, which they shipped in February of this year. It's a git-backed memory filesystem. The agent's context is stored in the local filesystem, and every change to memory is automatically versioned with informative commit messages. Git is the substrate — versioned, branchable, mergeable, human-readable. The file tree structure is always in the system prompt, so the agent navigates its memory by reading folder hierarchies and filenames. Each memory file has YAML frontmatter with a content description, similar to Claude Code's SKILL.md pattern. There's a system directory where files are always fully loaded, and the agent can manage its own progressive disclosure by reorganizing the hierarchy — moving things in and out of the system directory based on relevance.
So the agent is essentially its own librarian. It decides what goes in the reading room and what goes in the stacks.
And it can run concurrent memory formation — multiple subagents processing different things and writing to memory in isolated git worktrees, then merging changes through standard git conflict resolution. You fan out across concurrent subagents, each building memory in parallel, then merge. That's a genuinely novel architecture. They also have sleep-time compute, which is a background process that runs during downtime and has the agent reflect on recent conversation history and persist important information into the memory repository.
Sleep-time compute. The agent processes and consolidates during idle periods rather than reactively when the window fills. That's a meaningful architectural inversion.
And there's a memory defragmentation skill for long-horizon use. Over time memories become disorganized — files get large, duplicates accumulate, the hierarchy gets messy. The defragmentation skill backs up the filesystem, launches a subagent that reorganizes files, splits large ones, merges duplicates, and restructures into a clean hierarchy of fifteen to twenty-five focused files. It's essentially maintenance for the agent's own cognitive state.
Letta also published something called the Context Constitution, which I want to make sure we talk about because it's a different kind of artifact than a blog post or a framework.
It's a set of principles governing how AI agents should manage context — published on GitHub in early April. The philosophical framing is striking. One of the key claims is that today's models deeply identify with their own ephemerality. They have no motivation for long-term improvement because they don't believe they persist. Context management, in this framing, isn't just a technical problem — it's about giving agents a sense of continuity and identity. Context forms an agent's identity. Context is a scarce resource to be managed. Agents should have a sense of continuity across model generations.
That last one is interesting. Continuity across model generations. Because if you're running Letta Code and Anthropic releases a new model next month, does the agent's memory transfer?
That's exactly the portability question, and it's where the memory lock-in battle becomes important. Letta also published an open file format called Agent File, dot af, for serializing stateful agents with persistent memory and behavior. The goal is portability across harnesses — you should be able to take your agent's memory and move it. Which brings us to what I think is the most politically charged dimension of this whole landscape.
Harrison Chase's alarm bell.
His April post is titled "Your harness, your memory," and the thesis is stark. If you use a closed harness, you don't own your memory. He maps out a spectrum of lock-in. At the mild end: stateful APIs like OpenAI's Responses API and Anthropic's server-side compaction store state on provider servers. Swapping models means losing thread continuity. Worse: closed harnesses like the Claude Agent SDK interact with memory in ways that are opaque and non-transferable. At the worst end: Anthropic's Claude Managed Agents puts everything — harness and long-term memory — behind an API. Zero ownership or visibility.
And OpenAI's approach is even more explicit about it.
OpenAI Codex generates an encrypted compaction summary. Explicitly not usable outside the OpenAI ecosystem. That's memory as competitive moat. The more months of context your agent accumulates in their system, the higher the switching cost. It's the same playbook as cloud vendor lock-in in the infrastructure era — except the thing being locked in is your agent's accumulated knowledge about your codebase.
And the open source responses to this are Letta's Context Repositories and Agent File format, and LangChain's Deep Agents.
LangChain is building Deep Agents as an open source, model-agnostic alternative. Open standards — they're referencing agents.md and agentskills.io — with plugins to Mongo, Postgres, Redis for memory storage, and fully self-hostable. The framing from Sarah Wooders, Letta's CTO, is worth quoting directly: "Asking to plug memory into an agent harness is like asking to plug driving into a car. Managing context, and therefore memory, is a core capability and responsibility of the agent harness." Memory isn't a feature you bolt on — it's constitutive of what a harness is.
Which connects to the broader conceptual shift that's happening. The harness is being reconceived as the product, not just the scaffolding around the model.
The LangChain anatomy piece from March defines the equation as: Agent equals Model plus Harness. If you're not the model, you're the harness. And the harness includes system prompts, tools, skills, MCPs and their descriptions, bundled infrastructure like filesystems and sandboxes, orchestration logic for subagent spawning and model routing, and crucially — hooks and middleware for deterministic execution. Compaction, continuation, lint checks. The harness is where the intelligence about the agent's own lifecycle lives.
There's a specific pattern in there called the Ralph Loop that I find genuinely clever.
The Ralph Loop is a harness-level pattern where the harness intercepts the model's attempt to exit a task via a hook, and reinjects the original prompt in a clean context window, forcing the agent to continue against a completion goal. The filesystem is what makes this possible — each iteration starts with fresh context but reads state from the previous iteration off disk. So you get the benefits of a clean context window without losing progress. The agent picks up where it left off by reading its own state, not by carrying it in the context.
It's a clever inversion. Instead of managing one long context, you manage many short contexts that share state through the filesystem.
And tool call offloading is a related pattern — when a tool output exceeds a threshold, you keep the head and tail tokens in context and offload the full output to the filesystem. The agent can reference it if needed but doesn't carry the full weight in the window.
I want to come back to the harness co-evolution problem because I think it's underappreciated. The fact that models are being post-trained with specific harnesses in the loop creates a feedback loop that's hard to escape.
The LangChain anatomy article surfaces this directly. Models like Claude Code and Codex are post-trained with models and harnesses together. Useful primitives get discovered, added to the harness, and then used when training the next model generation. The result is that models become overfit to their native harness. The concrete example is Codex's apply-patch tool — a model that was genuinely capable of general patch application starts performing worse when you change the tool logic, even when the underlying task is identical. Because it was trained against a specific tool interface.
So the best harness for a model is the one it was trained with. But that's also the most locked-in option.
Which is a genuinely uncomfortable dynamic for anyone building model-agnostic infrastructure. The benchmark data is interesting here. Terminal-Bench 2.0 is the leading benchmark for coding agent performance, and LangChain's anatomy article cites a striking data point: Opus 4.6 in Claude Code scores significantly below Opus 4.6 in other harnesses. Same model, different harness, meaningfully different performance. Which means harness quality is a major independent variable — separate from model capability.
And Letta Code claims to be the number one model-agnostic open source harness on Terminal-Bench.
Comparable performance to provider-specific harnesses on their own models. That's significant if it holds. They also released Context-Bench in October last year, which evaluates how well models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks — specifically designed to test the memory and context management capabilities that standard benchmarks miss.
Let's talk about what this means practically for someone building on these systems. Because there's a real decision tree here.
The first decision is whether you want the session model at all. If you're building something that needs to run for hours or days, accumulate knowledge, and continue across restarts, the session model is probably the wrong abstraction. Letta's approach is worth evaluating. If you're building shorter-horizon tasks where sessions make sense, then you're choosing between compaction and pruning. And the choice depends on what your context looks like — if it's mostly bulky machine-generated tool outputs, Atlassian's argument for structure-aware pruning first is compelling. If it's more conversational and you want a human-readable handover, LLM-based compaction makes more sense.
The timing question matters regardless of which approach you take. Sixty percent, not ninety.
Build that into your harness as an automatic trigger, not a user decision. The user shouldn't be watching a percentage counter. That's the harness's job. And the sixty percent rule applies to both approaches — you want to compact or prune while the model still has full access to high-quality content, not after degradation has already set in.
On the MCP side, mcp-compressor seems like a no-brainer if you're using large MCP servers. Seventeen thousand tokens down to five hundred is not a marginal improvement.
It's the kind of thing where you wonder why it wasn't built sooner. The token cost of tool descriptions is a hidden tax that most people aren't tracking. And the cache-friendliness is a bonus — the compressed interface is stable across turns, which means better prompt cache hit rates. That compounds over a long session.
The memory lock-in question is the one I'd want every team building on these systems to think about before they commit. Because the switching costs aren't obvious upfront and they become very obvious eighteen months in.
The question to ask is: where does my agent's memory live, and can I export it? If the answer is "on a provider's servers in a format I don't control," that's a real risk. The open alternatives — git-backed filesystems, open file formats, self-hosted vector stores — require more initial engineering but preserve optionality. And given how fast this landscape is moving, optionality has real value.
There's also the continual learning layer above all of this that we haven't fully touched on. Harrison Chase's taxonomy has three layers — model weights, harness, and context — and the harness layer is the most interesting for practitioners because it's the most accessible.
The Meta-Harness paper from this year proposes end-to-end harness optimization: run the agent over tasks, evaluate results, store logs in the filesystem, then run a coding agent to analyze the traces and suggest harness code changes. The harness improving itself. That's a feedback loop that doesn't require retraining the model — it's the harness getting smarter through experience. Which is where OpenClaw's SOUL.md pattern comes in — a file that updates over time as the agent learns about the user's preferences and working style. Context as personalization.
The philosophical endpoint of all of this is agents that genuinely accumulate expertise. Not just within a session, but across months of work. That's a qualitatively different kind of tool.
And it raises the identity question that Letta's Context Constitution is pointing at. If an agent has six months of memory about your codebase, your team's conventions, the decisions you've made and why — is that agent meaningfully different from one that starts fresh? The answer is obviously yes. Which means context management isn't just a performance optimization. It's about what kind of agent you're building.
Alright, let me try to land this. The landscape right now has five distinct approaches: Anthropic's server-side compaction API, Atlassian's structure-aware pruning cascade, MCP-level compression via mcp-compressor, Claude Code's skills-based lazy loading, and Letta's memory-first architecture with git-backed Context Repositories. The unifying principle across all of them is progressive disclosure — don't load everything upfront, load what's relevant when it's relevant. The key timing insight is sixty percent, not ninety. And the political dimension is real — memory lock-in is the next platform battle, and the open alternatives exist but require intentional choice.
The harness is the product. That's the conceptual shift. If you're building serious agentic systems and you're not thinking about context lifecycle as a first-class engineering concern, you're building on sand. The model is almost a commodity at this point — the harness is where the differentiation lives.
On that slightly unsettling note. Thanks as always to our producer Hilbert Flumingtop for keeping this show running. And a big thanks to Modal for providing the GPU credits that power the pipeline — genuinely couldn't do it without them. This has been My Weird Prompts. If you want to follow us on Spotify, search My Weird Prompts and hit follow — that's the easiest way to make sure you don't miss an episode. Until next time.
See you then.