Daniel sent us this one, and it's a bit meta — he's asking us to diagnose a bug in the very system that produces this podcast. The problem: about twenty percent of our episodes, roughly sixty out of three hundred, get hit with a repetition loop. The scriptwriter agent generates a section, then just... generates it again. Half the episode becomes duplicated content. Daniel's been tolerating it because the other eighty percent are genuinely excellent, and he hasn't had time to sit down and debug it properly. He's wondering what's actually causing this, whether it's fixable, and what the tradeoffs are.
This isn't just our problem. Anyone building a production pipeline on top of large language models eventually hits some version of this — the system works beautifully most of the time, and then it doesn't, in a way that's bafflingly hard to reproduce. You can't just open a debugger and step through the code. The failure lives somewhere between the prompt, the model's internal dynamics, and the pipeline architecture.
The planning agent produces a perfectly coherent outline. The burst generation starts fine. Then somewhere around the midpoint, the model decides the best way to continue is to just... say it all again. Same paragraphs, same transitions, same everything.
The thing is, Daniel's already tried the obvious fix. They had a review agent in place — a second pass that checks for quality issues — and it introduced its own failure mode. The review comments would leak back into the script, so instead of fixing repetition, you'd get -commentary about the repetition embedded in the episode. Which is almost worse.
There's something darkly funny about an AI system that, when asked to stop repeating itself, responds by repeating the instruction to stop repeating itself.
It's the bureaucratic compliance version of an LLM. "Per my previous email...
We've got a systems problem with three layers. The model itself has tendencies toward repetition under certain conditions. The prompt structure that's supposed to prevent it can actually decay as generation continues. And the review mechanism meant to catch it can become part of the loop instead of breaking it.
Daniel's question is really two questions. First, what's actually happening under the hood when the scriptwriter agent loops? And second, what can you do about it without introducing new failure pattern that are just as bad?
The twenty percent number is interesting too. It tells us this isn't random noise. Something specific is tripping the system roughly one time in five, and understanding what that something is means we can target it directly.
Let's trace it from the inside out.
What's actually happening inside the model when it loops? The planning agent does its job — it produces a valid outline with distinct sections, clear beats, everything looks right. Then the scriptwriter starts generating, and somewhere in the middle it just... Not in the sense of stopping. It keeps producing tokens, but they're tokens it already produced.
Same transitions, same framing sentences, sometimes whole paragraphs dropped in verbatim. And the weird thing is, it's not like the model forgets what it's doing. The prose is still coherent. It just decided the best next thing to write is the thing it already wrote.
That's the part that makes this hard to debug. If the output degraded into gibberish, you'd know exactly where the failure happened. But a repetition loop looks, on the surface, like valid output. The model isn't breaking — it's converging on a stable state that happens to be useless.
The twenty percent hit rate is the kind of number that drives engineers crazy. Too frequent to ignore, too infrequent to catch reliably in testing. You generate ten episodes, they're all fine, you think you fixed it, and then episode eleven loops.
What makes this a systems problem rather than a prompt problem is that no single component is unambiguously at fault. The prompt is well-structured. The outline is good. The model isn't malfunctioning in any detectable way. The loop emerges from the interaction between the prompt, the generation process, and something happening inside the model's attention patterns as the context grows.
It's like diagnosing an engine knock that only happens on humid Tuesdays when the tank is below a quarter full. Each piece looks fine in isolation.
Daniel's pipeline adds another layer. He's generating in bursts — sections at a time — which means the model sees its own output fed back in as context for the next burst. If a section ends in a way that looks like a natural conclusion, the model might decide the correct continuation is to begin again from the top.
The outline says "now write section four," but the accumulated context is screaming "you just finished, wrap it up," and the model splits the difference by wrapping up and then restarting.
That's the intersection Daniel mentioned. Prompt design, model behavior, pipeline architecture. You can't fix one without understanding how the other two are contributing.
Let's get into the mechanisms. What's actually happening inside the model when it loops?
There are three things worth pulling apart, and the first one is the attention sink problem. As generation continues and the context window fills up, the model's attention starts collapsing onto the most recent tokens. It's not a gradual thing — research shows there's a threshold effect around seventy percent of the context window size. Once you cross that line, the model increasingly treats its own recent output as the most relevant context for what to generate next.
It's essentially reading its own last paragraph and going "yes, and" to itself, over and over.
And "yes, and" is fine for improv comedy, but for script generation it means the model locks onto a local pattern and can't escape. The planning instructions — "now write section four about X" — are still in the context, but they're buried. The attention mechanism is weighted so heavily toward the recent tokens that the outline becomes invisible.
Which explains why the loop doesn't happen at random points. It tends to hit somewhere in the middle of the episode, once enough content has accumulated to saturate the attention window.
And this gets worse with certain model architectures. Some attention implementations have what researchers call an "attention sink" — specific tokens that absorb disproportionate attention weights, usually the first few tokens of the sequence. If those sink tokens happen to be part of a repetitive structure, the model effectively gets pulled into orbit around them.
That's mechanism one: the model forgets where it's going because it's too busy staring at where it's just been. What's mechanism two?
Temperature and sampling dynamics. When Daniel's pipeline generates, it's almost certainly using a low temperature setting — probably something point three or point four — because you want coherent, consistent output. But low temperature makes the model more deterministic. It picks the highest-probability token at each step, and if the last few hundred tokens have established a pattern, the highest-probability continuation of that pattern is... more of the pattern.
It's the path of least resistance, probabilistically speaking. The model isn't trying to be lazy. The math just says that repeating what just worked is the safest bet.
Here's the counterintuitive part. You might think, fine, crank up the temperature, let the model be more creative. But high temperature has its own looping risk. If the probability distribution concentrates heavily on a narrow set of tokens — which happens when the context is highly structured, like a script — then even with high temperature, the model can still fall into a repetitive attractor state. It just does it with slightly different word choices each time.
Temperature alone doesn't solve it. You're just choosing between verbatim repetition and paraphrased repetition.
That's the frustrating thing. The sampling parameters that produce the best output most of the time — coherent, on-topic, well-structured — are the same parameters that make the model vulnerable to loops when the context gets long enough.
Mechanism three is the one Daniel hinted at directly. The prompt structure decays.
This is the one that's easiest to overlook because it seems like a prompt engineering problem, but it's really a context management problem. You start with clear instructions: "Write a five hundred word section about failure pattern in LLM generation." The model writes four hundred words. It's approaching the target, but it's not quite there. And by this point, the original instruction is buried under four hundred words of generated text. The model can still technically "see" it — it's in the context window — but attention has shifted.
The model has a target length, it knows it hasn't hit it yet, but it's lost the plot on what it was supposed to write about. The easiest way to add length is to circle back and rephrase what it already said.
That's the padding behavior. And it's directly tied to word-count-based length control. Here's the thing most people don't realize: large language models don't natively count words. They operate on tokens, and the relationship between tokens and words varies by language, by content, by the specific tokenizer. A model can't accurately track how many words it's produced.
When you tell it "write five hundred words," you're giving it a target it can't actually measure. It's like asking someone to walk five hundred meters while blindfolded and only telling them when they've gone too far.
What does the model do when it's unsure if it's hit the target? It adds more. And the safest way to add more without introducing new content — which might be wrong or off-topic — is to restate existing content. The model is essentially optimizing for "don't mess up" rather than "be original.
Let me give you a concrete picture of how this plays out. The scriptwriter gets to a section about failure pattern. It writes four hundred words — solid analysis, good examples, everything's working. But the target was six hundred. The model can't count words accurately, but it knows it's supposed to keep going. So it scans its recent output, finds the last two hundred words, and says them again with slightly different phrasing. Now it's at six hundred. But the section is half original, half duplicate.
In burst generation, that duplicated section gets fed into the context for the next burst. So now the model sees a pattern of repetition in its own output history, which makes it more likely to repeat again. The loop feeds itself.
We've got three mechanisms that compound. Attention collapse makes the model fixate on recent output. Low-temperature sampling makes it follow the path of least resistance. And word-count targets create an incentive to pad rather than progress. Any one of these would be manageable. All three together create exactly the twenty percent failure rate Daniel described.
The twenty percent number makes sense once you understand these as interacting conditions rather than independent bugs. The loop only triggers when all three conditions align — the context is saturated, the sampling locks in, and the length target creates padding pressure. That doesn't happen every time. It happens when the episode length, the section structure, and the specific content happen to line up just wrong.
Which is why it's been so hard to reproduce. You can't just write a test case that triggers it reliably. You have to understand the conditions that make it probable.
The review agent failure is worth sitting with for a minute, because it's exactly the kind of second-order problem that makes pipeline engineering hard. Daniel had a review agent in place — a second LLM pass that reads the generated script and flags issues. The idea is sound. Catch the loops before they hit TTS.
Instead of catching loops, the review agent's feedback started showing up in the final script. You'd get an episode where Corn and Herman are mid-discussion and suddenly there's a line about "the previous section contains redundant content, consider removing.
Which is almost certainly what happened. The review agent produces comments, those comments get passed back to the scriptwriter as context for revision, and the scriptwriter — being a language model doing exactly what language models do — incorporates them as content rather than instructions. It doesn't distinguish between "this is a note to the editor" and "this is dialogue for the podcast.
You swap one failure pattern for another. The twenty percent repetition rate drops, but now five percent of episodes have an AI quality-control inspector interrupting the conversation like an uninvited producer.
Daniel made a rational call. He removed the review agent, accepted the twenty percent repetition rate, and moved on. In a high-volume pipeline where eighty percent of output is excellent, that tradeoff makes sense. You can skip the bad episodes. You can't skip the ones where the review agent has contaminated the text with -commentary — those are structurally broken in a way that's harder to filter.
The question is whether there's an architecture that fixes both problems without introducing a third. And I think the answer is yes, but it requires moving the quality control outside the generative loop entirely.
That's the key insight. The original failure pattern — the repetition loop — lives inside the generative model's behavior. The review agent failure pattern lives in the feedback loop between two generative models. Both problems share the same root: you're asking an LLM to do something it's not structurally good at, which is self-monitoring during generation.
The fix isn't a better prompt or a smarter review agent. It's changing where and how you enforce constraints.
Let's start with the most direct fix, the one Daniel already identified as the hardest to engineer: length control. The current system uses word count as the target. The model approximates, pads to hit the target, and that padding is where loops are born. The solution is to stop using word count entirely and switch to a token budget with an explicit advance signal.
Instead of "write five hundred words about failure pattern," you say "you have a budget of seven hundred tokens for this section. When you reach the budget, output the token ADVANCE and stop.
The model can track tokens more naturally than words — tokens are its native unit. And the advance signal is crucial because it externalizes the decision to move on. The model doesn't have to guess whether it's done. It hits the budget, it emits the signal, the pipeline parses that signal and loads the next section prompt. No ambiguity, no padding.
There's a case study on this that's pretty striking. A production pipeline similar to Daniel's switched from word-count targets to token budgets with stop signals, and repetition incidents dropped from twenty percent to about three percent. Not zero — you still get edge cases — but an eighty-five percent reduction from one architectural change.
The three percent that still loop tend to be episodes where the content itself is inherently repetitive — lists, taxonomies, that kind of thing — which you can catch with a much simpler mechanism.
Which brings us to solution two. Instead of generating long sections, you generate in smaller bursts — say two hundred words at a time — and after each burst, you re-prompt the model with the full accumulated context plus an explicit "continue without repeating" instruction.
This is sliding window with checkpointing. The idea is that by keeping individual generation bursts short, you never let the context window get saturated enough to trigger the attention collapse we talked about. Each burst is a fresh generation call with the full history visible, so the model can see the outline and the progress markers clearly.
The tradeoff is more API calls and slightly higher latency. But for a pipeline that's generating episodes offline — which Daniel's is — that's basically free. You're not serving real-time users.
The third solution is the one I think is most elegant, and it directly addresses the review agent problem. Instead of having an LLM review the output, you use a stateless, deterministic post-processing step. Run the finished script through an n-gram overlap detector — something as simple as five-gram Jaccard similarity — and flag any sections where the similarity between non-adjacent paragraphs crosses a threshold.
If paragraphs three and seven are eighty percent identical at the five-gram level, the system trims the duplicate and keeps the original. No language model involved, no feedback loop, no risk of -commentary leaking into the script.
You can even use a lightweight classifier — a much smaller, cheaper model fine-tuned specifically on duplicate detection — but honestly, regex-based n-gram overlap gets you most of the way there. The point is that the validator is outside the generative loop. It reads the output, it doesn't participate in creating it.
This is the difference between a review agent and a post-processing classifier. The review agent is in the loop — its output feeds back into the generator. The classifier is after the loop — it only sees the finished product. One creates feedback cycles, the other is stateless and deterministic.
That's why all three of these solutions are robust in a way that prompt tweaking isn't. They don't ask the generative model to self-correct. They change the structure around the model so that the conditions that produce loops — saturated context, padding pressure, feedback contamination — simply don't arise, or are caught by a system that can't be contaminated.
The through-line is moving quality control from something the model tries to do internally to something the pipeline enforces externally. You stop asking the LLM to be its own editor.
That's a general principle that applies far beyond this specific bug. Anytime you find yourself writing a prompt that says "make sure you don't do X," and X keeps happening anyway, the answer is rarely a stronger version of that prompt. It's usually a structural change that makes X impossible or irrelevant.
What can you actually do about this? Three concrete steps, ordered from highest impact to easiest to implement.
Step one is the token budget switch. Replace every word-count target in the prompt with a token budget and an explicit advance signal. Instead of "write five hundred words," it's "you have seven hundred tokens. When you reach the budget, output the token ADVANCE and stop." That one change took a pipeline from twenty percent repetition down to three percent.
The advance signal is the part people skip, because it feels redundant. But it's not. It externalizes the decision to move on. The model doesn't have to guess whether it's done, which is where the padding behavior lives.
Step two is adding a stateless duplicate detector after generation. Run the finished script through a five-gram Jaccard similarity check. If two non-adjacent paragraphs share more than, say, eighty percent of their five-gram sequences, flag it and trim the duplicate. No language model involved, no feedback loop, no risk of -commentary leaking in.
This is the fix for the review agent problem Daniel already encountered. The review agent failed because it was in the loop — its output fed back into the generator. A post-processing classifier sits outside the loop entirely. It reads the finished product. It can't contaminate anything.
Step three is the one that makes engineers uncomfortable but it's the right call for high-volume pipelines. Accept a small failure rate. Three to five percent. Build a manual review queue for flagged episodes and stop trying to engineer a system that catches everything.
The pursuit of zero defects is what got us the review agent -loop in the first place. Every layer of automated quality control you add is another system that can fail in ways that are harder to detect than the original problem. A human scanning three episodes out of a hundred is cheaper and more reliable than a fourth agent that might start writing its own podcast.
The through-line across all three is the same principle we landed on earlier. Stop asking the model to be its own editor. Move quality control outside the generative loop. The model generates, the pipeline validates.
The thing that keeps me up, though — metaphorically, I nap fine — is what happens when context windows get truly enormous. We're already seeing models with million-token contexts. And the research suggests that might actually make repetition loops worse, not better.
That's the counterintuitive part. You'd think more room means less pressure, less saturation. But the attention sink problem scales with context size. When the model has a million tokens of history, the most recent few thousand tokens are still what dominate attention. And now there's vastly more "recent" content for the model to latch onto and echo back.
Bigger windows might give you more rope to hang yourself with. The model doesn't get better at remembering the outline. It gets better at fixating on its own last chapter.
There's a real open question here about whether the repetition problem is fundamental to the architecture or just a quirk of current context management. I don't think we know yet. But the safe bet for pipeline design is to assume it gets harder, not easier, as windows grow.
Which points toward the next frontier. Self-healing pipelines. Systems that detect the loop as it's forming and correct course in real time, without waiting for post-processing and without a human in the loop.
Imagine a monitor agent that watches the n-gram similarity score during generation. If the score starts climbing mid-section, the pipeline pauses, injects a structural reset prompt — "you are beginning to repeat, advance to the next section" — and resumes. The correction happens before the loop hardens.
That's the dream, anyway. Whether you can do it without introducing yet another feedback contamination problem is the engineering challenge. But given how far we've come from the early days of "just write a better prompt," I'm optimistic.
It's the same principle we landed on, just applied earlier in the process. Don't ask the model to notice it's looping. Build a system that notices for it.
Now, Hilbert's daily fun fact.
Hilbert: The Tyatya volcano on Sakhalin Island, which last erupted in the eighteen eighties, gets its name from the Ainu word for "grandmother" — a reference to the mountain's gentle slopes that belied its violently gas-rich eruptions, which released plumes with unusually high concentrations of hydrogen chloride.
Grandmother volcano with the acid breath.
The question Daniel's debugging session leaves us with is whether the million-token context window makes the loop problem better or worse. My money's on worse, but I'd love to be wrong. Either way, the fix isn't in the model. It's in the architecture around it.
If anyone listening is running a production LLM pipeline and has found a way to get that three percent residual failure rate down to zero without a human reviewer, email the show. We'd like to know.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this episode, leave us a review wherever you listen — it helps other people find the show. We'll be back with a new prompt soon.
Until then, may your context windows stay clear and your attention sinks stay shallow.