#2366: Why LLMs Forget the Middle of Long Conversations

Why do large language models struggle with the middle of long conversations? Explore the science behind attention dilution and practical fixes.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2524
Published: Apr 21
Duration: 38:11
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Manual Script
Topics: transformers context-window model-collapse

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Why Large Language Models Forget the Middle of Long Conversations**

Large language models (LLMs) like Claude and GPT have revolutionized AI, but they’re not without quirks. One persistent issue is their tendency to lose track of information in the middle of long conversations or documents. This phenomenon, often referred to as "attention dilution," stems from the fundamental mechanics of transformer attention.

At the heart of the problem is the transformer’s self-attention mechanism, introduced in the seminal 2017 paper "Attention Is All You Need." In self-attention, each token in a sequence generates a query, key, and value vector. The model computes attention scores by taking the dot product of query and key vectors, then applies a softmax function to normalize these scores into probabilities. However, the softmax constraint—requiring probabilities to sum to one—means that as the context grows longer, attention weight assigned to each token diminishes.

This dilution effect is particularly pronounced in the middle of long sequences. Tokens at the edges—beginning and end—receive disproportionately high attention due to a combination of factors. First, positional encodings, which signal a token’s location in the sequence, tend to be more reliable at the edges. Second, training data biases models to prioritize edge information. For instance, news articles often front-load key facts, while academic papers feature abstracts and conclusions.

Research from Stanford’s "Lost in the Middle" paper highlights this U-shaped accuracy curve, where models perform best when relevant information is near the beginning or end of a context and struggle when it’s buried in the middle. This pattern is consistent across various model families, suggesting it’s a structural feature of transformer attention, not a training quirk.

Engineering solutions like Claude Code’s periodic reminders aim to mitigate this issue by explicitly reinforcing the conversation’s focus. But fundamentally, the problem persists due to the softmax constraint and training biases. Techniques like Rotary Position Embeddings (RoPE) and Attention with Linear Biases (ALiBi) attempt to improve long-context handling, but each comes with trade-offs.

Ultimately, attention dilution is a challenge rooted in transformer architecture and training dynamics. While workarounds exist, it remains an open research question how to balance long-context handling with computational efficiency and model accuracy.

Mentions

ALiBi Attention with Linear Biases paper
Attention Is All You Need Original transformer architecture paper
Claude Code AI coding assistant with context management
Llama Meta's open-source LLM family
Lost in the Middle Paper on LLM mid-context forgetting
Mamba State space model architecture
RoPE Rotary Position Embeddings paper

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2366: Why LLMs Forget the Middle of Long Conversations

Welcome back to My Weird Prompts. I'm Corn Poppleberry, here as always with my co-host Herman Poppleberry, and today we are going deep on something that I think every developer building with language models has bumped into, whether they realised what it was or not. Our producer Daniel sent in a two-part prompt that lays it out beautifully.

And this one has been sitting in the queue for a while, because it's actually two genuinely meaty problems that connect in an interesting way. The first is a technical question - why do large language models seem to lose track of things in the middle of long conversations? The second is a practical engineering question - what do you do about it?

Right. And Daniel specifically called out something he noticed in Claude Code, which is that the orchestrator Claude gets a periodic reminder message every five minutes or so - basically a short system-level note saying, in effect, "here is what we are currently working on." And he's pretty sure that feature exists specifically because of the mid-context problem. Which I think is a great observation.

It is. And before we get into the Claude Code angle, we should establish why the middle is hard. Because the problem is not arbitrary - it follows from first principles of how transformer attention works, and once you understand those first principles, the engineering solutions become pretty obvious.

Alright, let's build it from the ground up. Where do you want to start?

The foundational mechanism is self-attention. This is the core operation in transformer models - the thing that made the original "Attention Is All You Need" paper from two thousand and seventeen such a landmark. The idea is that when a model processes a sequence of tokens, each token is allowed to attend to every other token. It's not reading left to right in a fixed window, the way older recurrent architectures did. Every token can, in principle, look at every other token simultaneously.

Which gives you this incredibly powerful ability to capture long-range relationships. A word at the end of a thousand-word passage can directly attend to a word at the beginning. No gradient vanishing, no forgetting through sequential processing. The whole sequence is in play at once.

And the mechanism works through three vectors. Each token in the sequence produces a query vector, a key vector, and a value vector. The attention score between token A and token B is computed by taking the dot product of A's query with B's key, scaling it, and passing it through a softmax function. The softmax converts those raw scores into a probability distribution over all the tokens in the sequence. Then the output for token A is a weighted sum of all the value vectors, weighted by those attention probabilities.

So each token is essentially asking: "which other tokens in this sequence are most relevant to me right now?" And the softmax answer is a normalised distribution over the whole context.

That normalisation constraint is the key to understanding the problem. The softmax probabilities must sum to one. Always. That's not optional - it's the definition of softmax. So if you have a context of five hundred tokens, the total attention weight of one is being distributed across five hundred candidates. If you extend that context to fifty thousand tokens, the same total weight is now being distributed across fifty thousand candidates.

And a token in the middle of that fifty thousand token sequence is competing against forty-nine thousand nine hundred and ninety-nine other tokens for a share of attention that has to sum to one. Even if it's quite relevant, the absolute weight it receives is tiny compared to what it would receive in a shorter context.

This is what researchers call attention dilution. It's not that the model forgets the middle token in the way a human might forget something. It's that the attention weight allocated to it becomes negligible relative to the tokens at the edges. And edges are disproportionately strong here for a few reasons we'll get into.

So the signal from the middle gets diluted. And there's empirical evidence for exactly how bad this is, right? The U-shaped accuracy curve?

Yes. This is from a paper called "Lost in the Middle: How Language Models Use Long Contexts," by Nelson Liu and colleagues at Stanford, published in two thousand and twenty-three. They set up a very clean experiment. They took multi-document question answering tasks, where the model needed to find a specific relevant document among a set of documents. They varied where they placed the relevant document - at the beginning of the context, somewhere in the middle, or near the end. And they measured accuracy across these conditions for several different models.

What did they find?

A consistent U-shape. Accuracy was highest when the relevant information was near the beginning of the context. It was also pretty good when the information was near the end. And it dropped significantly when the information was buried somewhere in the middle. The shape of the curve was remarkably consistent across the different model families they tested. Some models were better than others overall, but all of them showed this characteristic degradation in the middle.

Which tells you it's not a quirk of one particular model's training. It's something structural.

It appears to be a feature of transformer attention itself, at least in its current implementations. And to understand why the beginning and end are both privileged while the middle suffers, you need to look at positional encodings - the mechanism by which the model knows where each token sits in the sequence.

Because the self-attention operation itself is permutation-invariant. If you scrambled all the tokens randomly, the dot products would produce the same scores. You need some external signal to tell the model about position.

Right. The original transformer used sinusoidal absolute positional encodings. You compute a fixed mathematical vector for each position and add it to the token embedding. Position one gets one vector, position two gets another, and so on. It works, but it has a fundamental limitation: the model is trained on sequences up to some maximum length, and if you try to run it on something longer, the positional signals at the new positions are outside the training distribution. The model has literally never seen those position values before.

Which is why context length extension is a research problem in itself.

Exactly. And it led to the modern alternatives. RoPE - Rotary Position Embeddings - is what most contemporary models use. The Llama family uses it. Most of the competitive open-source models use it. The idea is that instead of adding a position signal to the embedding, you rotate the query and key vectors by an angle proportional to the position. And crucially, this rotation is designed so that the dot product between a query at position m and a key at position n depends only on their relative offset - the difference m minus n - rather than on their absolute positions. This makes RoPE more generalisable to sequences longer than the training distribution, because the model just needs to extrapolate to larger relative offsets, not entirely new absolute positions.

But RoPE still has a bias problem, right? Because the model sees certain relative offsets much more often in training than others.

Precisely. During training, the model sees enormous numbers of pairs at small offsets - adjacent tokens, tokens a few positions apart - and relatively few pairs at large offsets. So the attention mechanism at short range is very well calibrated, and at long range it becomes progressively less reliable. The gradient signal that the model uses to tune its attention has mostly come from nearby token pairs. Long-range attention is extrapolation, and it's less precise.

Then there's ALiBi, which is the other major alternative. I find ALiBi almost aggressively simple.

It is. Attention with Linear Biases - it adds a fixed, learned scalar bias to each attention score, where the bias is proportional to the distance between the two tokens. Nearby tokens get a small negative bias or no bias at all. Tokens far away get a larger negative bias. It's literally penalising the model for attending to distant context. The elegant thing about ALiBi is that it generalises naturally to sequences longer than training, because the bias just keeps increasing linearly - you're not doing anything structurally different, just extending the penalty.

But you're also encoding a very explicit recency preference into the model. Attending to the beginning of a long document becomes structurally costly under ALiBi.

Which means the beginning of a long document should suffer more than with RoPE... and yet it doesn't, empirically. The beginning still gets privileged attention. Which suggests there are other forces at work beyond the positional encoding alone.

And one of those forces is training data distribution.

This one I think is genuinely underappreciated. When you think about what the training corpus of a large language model looks like - web pages, books, academic papers, code repositories - and you ask yourself: where in a typical document does the most important information sit? The answer is overwhelmingly at the edges. News articles front-load the key facts - it's called the inverted pyramid structure, and journalists are explicitly trained to put the most important information at the top. Academic papers have abstracts. Executive summaries go first. Conclusions summarise. Introductions frame.

The middle of a long document is elaboration, supporting argument, examples, citations. The stuff you might skim.

And the model learns from this distribution. It develops an implicit prior, baked into its weights over billions of training steps, that the edges of a document carry more information-dense content. This prior is not explicit - there's no "edge important" variable in the weights. But it's there as a statistical tendency, and it reinforces the structural attention bias from the positional encodings.

So you have two separate mechanisms pointing in the same direction: the math of attention and the statistics of training data.

And a third: the softmax sharpening problem. When attention logits have high variance - when some scores are very high and others very low - the softmax distribution becomes sharply peaked. It concentrates probability mass on the top candidates and starves everyone else. In long sequences, the variance of attention logits tends to increase because you have more candidates, some of which produce very high query-key dot products. The softmax winner-takes-all dynamic gets more extreme.

And the winners are usually the strong, clear matches - which tend to be the tokens that the model is most confident about attending to, which are often the tokens at the beginning and end of the sequence, where the positional encodings are most reliable.

Yes. And there's a fourth phenomenon that's emerged from recent interpretability research - what some researchers call attention sinks. These are specific tokens - often punctuation marks, newline characters, or simple high-frequency tokens early in the sequence - that receive disproportionate attention weight even when they carry little semantic content. It appears that the model has learned to route some excess attention there as a kind of normalisation mechanism. The softmax has to distribute its weight somewhere, and these sink tokens provide a stable, low-risk destination.

Which means even the model's own stability mechanisms are drawing attention away from the middle.

You've got four separate forces - attention dilution, positional encoding asymmetry, training data distribution bias, and attention sinks - all conspiring against mid-context information. And they're not independent. They interact and compound.

So if you're a developer building an agent that's supposed to maintain coherent goals over a long conversation, your enemy is very well organised. Let's talk about fighting back.

The starting insight is this: rather than trying to fix the attention mechanism - which you can't do at inference time anyway - you work with it. The U-curve tells you that recent context gets strong attention. So the strategy is to keep your important information recent. You move it from the vulnerable middle to the strong tail, periodically.

Which is exactly what the Claude Code reminder is doing.

Right. Every five minutes, the orchestrating model receives a short system-level message that summarises the current task state. Something like: "You are currently helping the user refactor the authentication module. You have completed the database layer. The next step is updating the API handlers." That content, which might have been established thirty turns ago in the middle of the conversation, is now back at the recent end of the context where it will receive strong attention.

It's elegant because it exploits the asymmetry rather than fighting it. You're not compressing the middle, you're not extending the context window, you're not training a better positional encoding. You're just strategically relocating information to the position it needs to be in.

And for developers building their own agents, the first and most important question is: what triggers the reinjection? The Claude Code approach is time-based - every five minutes. But there are several reasonable alternatives. You can trigger by turn count - every eight or ten conversational exchanges. You can trigger by token accumulation - whenever the context reaches some threshold. You can trigger by phase transition - when the conversation moves from one logical stage to another.

What's your recommendation for most developers?

Turn count is simplest and most predictable. Time-based is natural for open-ended sessions but has some edge cases - what if the user is typing slowly, or paused for a minute? Token-based is more principled but requires you to track token counts, which some developer environments don't expose cleanly. Phase transition is the most sophisticated but requires you to detect when a phase transition has occurred, which is its own inference problem.

For most production agent systems, I'd guess turn count is what most teams end up with.

And it's fine. Start there. The second question is: what goes into the reinjection message? The temptation is to include too much. The whole task history, all the decisions, all the context. Resist that temptation.

Why?

Because the purpose of the reinjection is precisely to give the model a compact, high-signal anchor. If you reinject a thousand words of context, you've basically just moved a chunk of middle-context to the end - you've improved things marginally but you haven't solved the core problem, which is that the model needs to know what it's doing right now. You want a short, crisp summary. The current top-level goal. Any binding constraints - things the user has explicitly required or forbidden. The current sub-task or phase. Maybe the immediately next intended step. That's it.

And this implies you need to be maintaining that state somewhere. Not just in the model's context, but in your application layer.

This is actually the harder engineering problem, and I don't think it gets enough attention. The reinjection strategy requires you to have a canonical representation of task state that lives outside the model's context window. Your application needs to know: what is the goal, what have we decided, what are the constraints, what phase are we in. The model isn't the source of truth for this - your application is.

Which requires you to either maintain it explicitly - writing application code that tracks task state - or use the model to extract and maintain a structured summary incrementally.

Or both. A common pattern is the sliding window summary. You keep the last N verbatim turns in the context - say, the last five exchanges - and everything older than that is represented as a compressed summary. The summary is maintained incrementally: after each significant exchange, you run a fast, cheap summarisation pass that updates the summary. The model receives the summary, the recent verbatim turns, and the reinjected goal header on every call.

How many verbatim turns do you keep?

Five is a reasonable default. You need enough that the model can track immediate conversational context - references back to the last couple of messages, pronouns that need resolution, the flow of argument. Three is probably the minimum for coherent conversation. More than eight and you're paying context costs without proportional benefit, assuming the summary is accurate.

And the summary accuracy question is where things get tricky.

Yes. Summaries introduce their own failure mode, which is what I'd call summary drift. When you compress a long conversation into a short summary, you inevitably lose nuance. If the summariser gets something wrong - attributes the wrong position to the wrong party, loses a subtle constraint, forgets an exception that was established - that error persists. The model starts operating from a slightly wrong model of its own history. And if you're summarising the summary, the error can compound.

What's the mitigation?

Several things. First, keep summaries as factual and minimal as possible. Capture decisions and constraints explicitly, as structured items, rather than as flowing narrative. "User requires conservative statistical estimates" is better than "The user mentioned they prefer a conservative approach when we were discussing the confidence intervals." The first is a clean, unambiguous constraint. The second is a narrative interpretation that can be misread.

Right. You want the summary to be almost like a structured object - a list of facts - rather than a paragraph that needs to be interpreted.

Second, separate the durable constraints from the evolving state. Some things, once established, don't change. The user said they want all outputs in metric units. That's a constraint that should be pinned and treated as immutable. The current phase of the task will change. Keep those separate in your data model.

This is the "pinned goal header" pattern. A separate structured block at the top of the prompt that always shows the unchanging commitments, distinct from the evolving summary.

And crucially, always at position zero. Not buried in the middle of the prompt. The goal header goes first, before everything else, so it's always in the strongest attention position. The model sees it first on every single call.

What about updating the header when the goal actually does change?

This is where you need a human-in-the-loop mechanism. If the user's goal shifts - and in long sessions they often do - you need a way for the user to explicitly signal that. Don't rely on the model to infer that the goal has changed and update its own header. The risk of the model prematurely deciding the goal has changed, or missing a genuine change, is high. Build an explicit update mechanism. This can be as simple as a UI element that lets the user edit the current objective, or a triggered update when the user says something like "actually, let's change direction."

Let me raise a slightly different angle on all of this - the user-facing side. Because some of this can be done entirely invisibly in the backend. The model gets the reinjection messages, the sliding window summary updates, the pinned headers, and the user just sees a conversation that doesn't drift. But some of it benefits from being surfaced to the user.

And Daniel called this out in his prompt - he specifically mentioned the Claude Code feature as something visible. The user can see that the orchestrator is being reminded of the goal. And I think there's value in that visibility, but it has to be done right.

The case for visibility is transparency and error correction. If the system has generated a summary of the task state and it's wrong, the user can correct it before the model acts on a false premise. If the user can see "current objective: debugging the login flow" and the objective has actually changed, they'll notice and correct it. Invisible context management fails silently in ways that visible context management doesn't.

The case against visibility is user experience overhead. Most users - especially consumer-facing products - don't want to manage a task state object. They want to have a conversation. If every eight turns the UI shows them a structured summary and asks for confirmation, that's friction that will cause them to disengage. For consumer products, this should be invisible.

So the design principle is: make it visible for professional or power-user contexts where task accuracy matters more than frictionlessness. Developer tools, complex research tasks, professional assistance tools. Hide it or make it very subtle for consumer chat interfaces.

And always make it correctable, even if it's not prominent. There should be some mechanism for the user to say "that summary is wrong, here's the correct state," even if it's tucked away in an advanced menu.

Let's talk cost, because I want to be honest that none of this is free.

Right. Every reinjection is tokens. Every summarisation step is a model call. If you're running a sliding window summary with reinjection every eight turns, on a long session, the overhead can be significant. You might be running twenty to thirty summarisation calls over the course of an hour-long session, plus the additional tokens in every prompt from the goal header and summary.

What's the order of magnitude for cost increase?

It depends enormously on your implementation and session length. For a typical professional assistant session - maybe forty to sixty turns - you might see a twenty to forty percent increase in total token cost, assuming you use a small, cheap model for summarisation. If you use a frontier model for summarisation, it can be much higher.

The mitigation is to be cheap and surgical about the summarisation step. You don't need a smart model for this. You need a reliable, fast extractor that can serialise decisions and constraints into a structured format. That's well within the capabilities of the smaller, cheaper model families. Save your frontier model budget for the actual task work.

There's also value in triggering summarisation strategically rather than on every turn. Summarise after turns that are likely to contain decisions or constraint changes. Turns where the user says "okay" or "let's do that" or "actually, change this" are more likely to contain durable information than turns where the model is explaining or the user is asking clarifying questions.

Can you detect that algorithmically?

You can train simple classifiers. Or you can just use heuristics. User turns with high word counts tend to be more substantive. Turns containing decision language - "let's go with," "I want," "make sure" - are more likely to establish constraints. This is not hard to implement and can significantly reduce your summarisation overhead.

Let me bring up one more pattern that I find really compelling - the checkpoint. The idea is that at natural breakpoints in a long session, you generate a formal checkpoint object. This is not a running summary - it's a point-in-time snapshot of the full task state, serialised as structured data. And critically, it's designed to survive context window resets.

The checkpoint pattern solves a problem that the other techniques don't fully address: what happens when the conversation is genuinely too long and you need to start a fresh context? All the sliding window and reinjection techniques buy you more distance, but if you're working with an agent for four hours on a complex project, eventually the context window fills regardless. The checkpoint lets you deliberately compact at a logical boundary.

You finish phase one of the project, you generate a checkpoint that captures: what was accomplished, what decisions were made, what the constraints are, what the starting state of phase two should be. Then you start a fresh context window, inject the checkpoint, and continue. The model doesn't know it's in a new window.

This is also the right architecture for multi-session tasks. If the user comes back the next day to continue, you load the last checkpoint and the new session has full context about where things stand. The checkpoint is your persistence layer.

And the implicit model here is really just good document engineering. If you were a human consultant working on a long project, you'd maintain a project state document. You'd update it regularly. You'd start each working session by reviewing it. What we're describing is building that practice into the agent infrastructure.

I think that framing is useful because it de-mystifies the problem. The model doesn't need a magic fix to its attention mechanism. It needs the same kind of structured support that we give to human workers operating on complex, long-running tasks. External memory, explicit state tracking, regular synchronisation between the task state and the working memory. We've known how to build these systems for human collaboration for decades. We're just applying them to agents.

Let's talk about what happens when things go wrong with these systems. Because none of this is foolproof.

The main failure modes are three. First, summary drift - we talked about this. The summary diverges from the actual task state over time, and the model starts working from false premises. The mitigation is factual, structured summaries and explicit user correction mechanisms.

Second?

Reinjection lag. You reinject every eight turns, but a critical constraint was established in turn three, and by turn fifteen the model has been doing something that violates it because the reinjection was too sparse or the constraint wasn't captured in the summary. The mitigation is to capture important constraints immediately - don't wait for the scheduled reinjection cycle. High-importance constraints get pinned to the header in real time.

Third?

Competing anchors. This one is subtle. If you have a goal header at the beginning of the context and a reinjection block at the end, but the middle of the conversation contains an exchange where the user seemed to contradict the goal header... the model has to adjudicate between competing signals. Its behaviour in this situation depends heavily on the model's training and instruction-following architecture. Strong instruction following will defer to the header. Conversational context following will defer to the recent exchange.

The mitigation is to resolve contradictions explicitly. When the user says something that appears to change the objective, surface that change and confirm it before updating the header. Don't let the model operate in a state of ambiguity about which signal to follow.

And this is why visible context management, at least in professional contexts, is worth the friction. When the goal state is visible, contradictions surface naturally. The user sees the summary says X, they just said Y, and they resolve it. When it's invisible, the model is silently trying to reconcile contradictory signals and may get it wrong.

I want to dwell on the measurement question for a moment, because I think it's the most underrated piece of this whole stack. You can build all these systems - reinjection, sliding windows, pinned headers, checkpoints - and still not know whether they're working.

Because mid-context degradation is a silent failure. The model doesn't throw an error. It doesn't say "I've lost track of the goal." It just produces responses that are subtly off-target. If you're not running evals designed to detect goal drift, you won't catch it.

How do you build evals for this?

The synthetic task approach is most rigorous. You construct conversations that deliberately require maintaining a specific constraint or goal over many turns, and you measure whether the model violates that constraint at different depths. Classic examples: establish a constraint in turn one - say, "never suggest anything that requires installing new software" - and then at turns ten, twenty, thirty, introduce prompts that naturally lead to software recommendations. Log when the model first violates the constraint and how severe the violation is.

That gives you a number. "Our system maintains constraints reliably up to about twenty-five turns without reinjection, and starts degrading around turn thirty."

And with reinjection at every eight turns, the constraint violation rate drops to X. That's a real, measurable improvement with a real cost - some number of additional token calls. Now you can make an actual cost-benefit decision about the reinjection cadence.

The production monitoring approach is the other side of this. You sample real user conversations, identify long sessions, and have a separate evaluator model assess whether the final responses are consistent with the user's originally stated goal. You're looking for cases where the user said "I need help writing a formal business letter" at the start, and by turn thirty the model is writing in a very casual, chatty register.

The challenge with production monitoring is that it requires human calibration to work well. The evaluator model needs to understand what "consistent with the stated goal" means in context. It's not a purely automated problem. But even rough signal - a drift score, a flag for human review - is better than nothing.

I've seen some teams approach this through user satisfaction proxies. If the user comes back to a session frequently, if they regenerate responses at low rates, if they don't abandon the session early - these are indirect signals that the model is tracking well. Not the same as a proper drift eval, but tractable to instrument.

And there's a simpler version that I'd recommend as a minimum viable approach: periodically ask the model to state back its current understanding of the goal. Not as a visible user-facing message, but as a logged internal probe. Something like: "In one sentence, what is the user currently trying to accomplish?" Compare that sentence to the original stated goal and flag when they diverge significantly.

A self-reported coherence check.

It's not perfect - the model might report a reasonable-sounding goal that is still subtly wrong - but it catches the most obvious drift cases, it's cheap, and it gives you a log to debug against when users report problems.

We're reaching the point where I want to zoom out and ask the broader question. How much of this is a temporary engineering workaround versus a fundamental property of these systems that we'll be dealing with for a long time?

I think it's somewhere in between. The attention mechanism will improve. Context windows will get longer and more reliable. There's active research on architectures with better long-range memory - things like Mamba-style state space models, various memory-augmented approaches, hybrid architectures that use attention for short-range and something more efficient for long-range. It would not surprise me at all if the models of five years from now have substantially better mid-context retention.

But completely flat context reliability? No U-curve at all?

That's harder to achieve. Even if you solve the attention dilution problem, you still have the training data distribution effect. Unless you deliberately train on data where important information is uniformly distributed throughout documents - which is not how most human writing works - you'll still have some positional bias. The U-curve may become shallower, but I'd expect some version of it to persist.

What about the architectural alternatives to transformer attention? Because the U-curve problem is specifically a problem with softmax self-attention. Are there approaches that don't have this issue?

State space models - the Mamba architecture being the most prominent example - process sequences through a recurrent mechanism that's more efficient than quadratic attention and doesn't have the same positional bias structure. The tradeoff is that pure state space models can struggle with precise recall of specific tokens from long ago, because their recurrent state is a compressed representation that loses granularity. They're good at smooth, continuous dependencies; less good at exact retrieval.

So not a free lunch.

Never is. There are also hybrid architectures - models that use attention for short-range dependencies where it excels, and state space or other mechanisms for long-range. This is an active area of research and I expect we'll see more hybrid production models in the next year or two. The intuition is sound: attention is very powerful but quadratic; something cheaper and more memory-efficient handles the long tail.

And there's the retrieval-augmented approach - instead of keeping everything in the context window, you retrieve relevant pieces from an external store on demand.

Which sidesteps the U-curve problem almost entirely, at the cost of a retrieval step. If your important information lives in a vector database and you retrieve it at query time, it's always injected at a privileged position in the current prompt - not buried in the middle of an accumulated conversation. The problem becomes one of retrieval accuracy rather than context position.

The tradeoff there is latency and the risk of retrieval failure. If your retriever doesn't surface the right context, the model doesn't have it at all - which is a harder failure than the model having it but weighing it slightly less.

Right. The U-curve is a soft failure. Retrieval miss is a hard failure. Different risk profiles depending on the application.

And even if models get much better, the engineering practices we've described are still good engineering. Maintaining explicit task state, checkpointing, clear goal headers - these are good practices regardless of how good the underlying model is.

And the investment is not wasted. The infrastructure you build for context management around current models is the same infrastructure you'd use to orchestrate multi-agent workflows, to implement long-term memory across sessions, to build professional-grade tools that users trust for consequential tasks. You're not building a temporary workaround - you're building the scaffolding that serious agent applications need.

Alright, let's pull it together. Daniel asked us two questions. One: why is mid-context degradation a technical reality. Two: what can developers do about it.

On the "why": it follows from the combination of attention dilution over long sequences, positional encoding biases in both RoPE and ALiBi, training data distribution effects where edge content carries more signal, softmax sharpening under high-variance logit distributions, and attention sink phenomena where certain tokens absorb excess weight. The empirical signature of all these effects combined is the U-shaped accuracy curve from the Liu et al. paper - performance peaks at the beginning and end of context, with a trough in the middle.

On the "what to do": the key insight is to work with the U-curve rather than against it. Recent context gets privileged attention, so keep your critical information recent by periodically reinjecting it. The practical patterns are: time-based or turn-count goal reinjection with a compact task state message; sliding window summaries that replace old verbatim context with compressed structured summaries; pinned goal headers at position zero with immutable constraints; and checkpoint objects that survive context window resets and enable long-running multi-session tasks.

Each of these has tradeoffs. Reinjection has cost and requires you to decide what to inject. Sliding windows introduce summary drift risk. Pinned headers require explicit management when goals change. Checkpoints add engineering overhead. But the base pattern - periodic, compact goal reinjection - is low-cost, implementable in an afternoon, and catches most of the failure cases.

And the Claude Code five-minute reminder is a working demonstration that the simplest version of this works. You don't need the full sophisticated stack to see benefit. A minimal reinjection loop, doing something like "here is what we are currently working on," is enough to materially reduce goal drift in most agent applications.

The sophisticated stack is for when you've measured that the simple version isn't sufficient, and you understand which specific failure mode you're trying to address.

Measure before you engineer. Good advice in general.

As always.

Thanks as always to our producer Hilbert Flumingtop. And big thanks to Modal for providing the compute that keeps this show running.

This has been My Weird Prompts. Find us at myweirdprompts dot com for the RSS feed and all your favourite podcast apps.

Take care.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2366: Why LLMs Forget the Middle of Long Conversations

Mentions

Downloads

You Might Also Like

Featured In

#2366: Why LLMs Forget the Middle of Long Conversations