So Daniel sent us this one, and it's a design question more than a theory question, which I appreciate. Here's what he wrote: imagine a geopolitical crisis simulation where LLM-played personas of Khamenei, Netanyahu, Trump, and IRGC command are all taking turns in a sandbox. The simulation runs on IQTLabs/snowglobe, and then the simulation's summary gets fed into karpathy/llm-council — six lens agents deliberating in parallel, peer-reviewing anonymously, with a chairman writing the final forecast. The key constraint: the simulation is deliberately sealed from live news after turn zero. Actors see only a referee-maintained world state, never raw headlines, never Tavily results, never each other's private reasoning. The fresh-data side — Tavily, RSS feeds, ISW reports — only reaches the llm-council stage. Daniel wants us to walk through what the world state actually is, why the firewall exists, what it costs, and the question he's most interested in: how do you actually know if the firewall is working versus quietly breaking the simulation from the inside?
There's a lot packed in there. And I want to start before we get into the mechanics, because I think the framing Daniel chose is doing a lot of work. He's not asking about a chatbot or a retrieval pipeline. He's asking about actor simulation, which is a fundamentally different category of problem.
Right, and that distinction matters immediately. Because if you just describe what snowglobe is doing at a surface level, it sounds like a fancy role-play. But the actual engineering question is about epistemic containment.
Which is exactly the right frame. And by the way, today's script is being generated by Claude Sonnet four point six, which feels appropriate given we're talking about LLMs playing geopolitical actors.
The meta levels here are genuinely dizzying. Okay, so let's actually establish what the world state is in this architecture, because I think people conflate it with just a context window or a system prompt.
The world state in snowglobe is the authoritative, referee-authored description of what is currently true in the simulation world. Every agent reads it at the start of each turn before they act. But here's what makes it distinct: it is not a transcript, it is not a log, and it is not a summary of what the agents said. It is a curated description of what actually happened, as determined by the referee.
So the referee is doing editorial work, not just compression.
Compression is part of it, but editorial judgment is the core. The world state contains public facts — troop movements, declared strikes, official diplomatic statements, economic shifts. It contains resolved outcomes, meaning the referee's determination of whether an agent's attempted action actually succeeded. So if the Khamenei agent says "I launch a cyberattack on the Israeli power grid," the world state doesn't say "Khamenei attempted a cyberattack." It says something like "A cyberattack on Israel's power grid caused a four-hour blackout in the Tel Aviv metropolitan area." The attempt gets translated into a consequence.
And that translation is where a huge amount of information gets lost, or potentially added.
It does. And it also contains environmental constants — geography, pre-existing treaty status, weather if it's relevant. What it deliberately excludes is just as important. The private reasoning of other agents never enters the world state. So Khamenei sees what Netanyahu did, but never why. The chain of thought is sealed. Failed intentions also never appear — if an agent tries something the referee deems impossible given their resources, it just doesn't happen and it doesn't get logged as an attempt.
That last one is interesting. Because in reality, failed intentions actually matter a lot. If Iran tries to do something and fails, the fact that they tried is signal.
That's a real cost, and we'll get to it. But the design logic is that the simulation is trying to model what actors can perceive, not what an omniscient observer would see. And in the real world, a failed covert operation that leaves no trace is genuinely unobservable to the other side.
Fair. Okay, so the world state is the referee's curated description of observable reality in the sim. Now the big question: why can't the actors just see the news? This sounds like an obvious constraint but I want to understand the actual mechanism of collapse.
This is the part I find most interesting from a model behavior standpoint. When you give an LLM playing Netanyahu a real headline from Haaretz about what happened in the actual world an hour ago, you don't get Netanyahu reasoning from the simulated world state. You get the model doing what it was trained to do, which is be a helpful assistant that aligns with reality.
So the training objective just overrides the persona.
The training bias is incredibly strong toward acknowledging and acting on real-world information. The model has been trained on billions of examples where the correct response to "here is a news headline" is "here is what this means and what should happen next." That's the dominant pattern. So when you inject live news into the simulation, the agent stops reasoning as an actor with limited information in a sandbox and starts summarizing the headline. You lose the counterfactual. You lose the actor logic. The simulation becomes a very expensive news commentary system.
And this is the "inference-over-the-news collapse" that Daniel flags. The sim doesn't fail loudly, it just... drifts into becoming a mirror of media consensus.
Which is actually worse than a loud failure, because you might not notice. The outputs still look coherent and plausible. They're just not measuring what you thought you were measuring. The WarAgent paper from Hua et al., published on arXiv in twenty twenty-four, ran LLM agents through World War One and World War Two scenarios and found this exact failure mode — agents with access to broader context would anchor on historical outcomes rather than reasoning from the immediate situation.
So you'd get agents essentially predicting the actual historical outcome because they know it, not because their actor logic produced it.
Right. The simulation becomes a retelling rather than an exploration. And Daniel's pipeline is explicitly trying to avoid that — the whole point of the snowglobe stage is to generate a signal that is independent of what the news is saying right now. If the simulation just tracks the news, it has zero independent diagnostic value.
Let's talk about the other firewall: why agents can't see each other's private reasoning. Because that one is subtler.
The echo chamber effect is about what happens when agents can read each other's internal assessments. If the Khamenei agent's chain of thought includes "I am projecting aggression but I am genuinely afraid of escalation and would accept a face-saving exit," and the Netanyahu agent can see that, the simulation immediately stops being a crisis and starts being a negotiation.
Because the Netanyahu agent now has information that the real Netanyahu would never have.
And more than that — the LLM playing Netanyahu will use that information, because it's trained to be cooperative and to find solutions. The "fog of war" that drives real crises, the misperception, the ego-driven errors, the irrational escalation because neither side knows the other wants out — all of that collapses. You get a cooperative optimization game where agents negotiate in their internal reasoning toward a stable equilibrium.
Which is the exact opposite of what makes crises interesting and dangerous in the real world.
The ACBench evaluation from ICML twenty twenty-five actually touches on this — they found that when you compress or expose agent reasoning in multi-agent setups, the agents' effective behavioral space narrows dramatically. They start optimizing for social coherence rather than independent decision-making.
So both firewalls — no live news, no peer reasoning — are protecting the same thing: the independence of each agent's reasoning from information they wouldn't plausibly have in the real scenario.
That's the core principle. And it's why the world state is the only shared channel. It's the one piece of information every real-world actor would plausibly have access to — the observable facts of what just happened.
Okay, so now I want to dig into the referee's job, because this is where it gets genuinely hard. The referee is authoring the world state from agent outputs without inventing consequences that nobody committed to. Where is that line?
Snowglobe's Control agent uses what you might call a physics check — it compares each agent's attempted action against two things: what the agent actually has available internally, what the repo calls the "Stick," and the external relationship context, what it calls the "Board." So if an agent with no naval assets tries to blockade a port, the referee narrates a failure. The action never enters the world state as a success.
And for interactions between agents — if one fires and one has a defense system?
The referee resolves the interaction using the country profiles provided at setup. It decides the probability of intercept success based on the documented capabilities, not arbitrary judgment. So if the IRGC agent launches fifty ballistic missiles and the simulation has Israel's Iron Dome and Arrow system parameters loaded, the referee calculates how many get through based on those specs and narrates the result. It's not inventing an outcome from nothing — it's adjudicating a physical interaction using pre-specified rules.
So narration is describing the result of agent inputs, and invention is adding events that no agent triggered.
That's the line. A high-fidelity sim minimizes invention. You don't get earthquakes unless the scenario starts with seismic activity as an environmental constant. The referee's job is to be a physics engine, not a storyteller.
But here's the thing — even a physics engine involves choices. The referee decides what fifty missiles hitting Tel Aviv means in terms of casualties. That number comes from somewhere, and wherever it comes from, it's now in the world state and every subsequent agent turn is reacting to it.
This is where referee bias becomes a real problem. If the Control LLM has a systematic tendency toward de-escalation, it will consistently resolve ambiguous outcomes in ways that reduce casualties, reduce escalation, and nudge the simulation toward stability. The individual biases might be small, but they compound across turns. By turn ten, you have a simulation that is structurally biased toward peace regardless of what the actor personas would actually produce.
And there's no way to detect that from inside the simulation.
Not without external validation, which is part of what the llm-council stage is supposed to provide. But let's come back to that. The other thing I want to flag is turn-zero anchoring, because I think it's the single most underappreciated failure mode in this entire architecture.
Walk me through it.
Everything the simulation produces is downstream of the initial world state — turn zero. Every agent's first action is a response to that initial framing. Every subsequent turn is a response to the previous turn's state, which was itself a response to the one before, all the way back to turn zero. If turn zero frames the crisis as a low-intensity standoff with diplomatic channels open, you will get a very different simulation than if turn zero frames it as an acute military confrontation with communication lines cut.
And the agents can't correct for a bad initial framing because they only see the state they're given.
There's no mechanism for an agent to say "actually, I don't think the situation was like that at the start." They're bounded by the ontology they were handed. The WarAgent paper found this too — initial framing of power asymmetries had outsized effects on downstream escalation patterns. What's interesting is that in traditional wargaming, the "White Cell" — the human referee equivalent — spends enormous time on scenario design precisely because of this. The snowglobe docs actually reference CIA and CSI wargaming methodology from December twenty twenty-five, and the White Cell's scenario construction is treated as the most critical phase of the exercise, more important than the actual play.
So the turn-zero problem isn't unique to LLM simulation — it's inherited from decades of wargaming practice.
Which is actually reassuring in a way, because it means there's a body of knowledge about how to do it well. The answer in traditional wargaming is to have multiple subject matter experts stress-test the initial scenario before play begins. In the LLM pipeline, the equivalent would be running multiple turn-zero variants and checking whether the simulation produces qualitatively different outcomes — if it doesn't, your scenario is probably too constrained.
Let's talk about what the world state actually buys you, because I want to make sure we're being fair to the architecture before we pile on the costs.
The auditability argument is strong. Because every state change is logged and referee-authored, you can point to exactly what information led to a specific decision. If the Khamenei agent escalates in turn seven, you can pull the turn-six world state and see precisely what observable facts it was responding to. That's genuinely useful for analysis — you can run counterfactuals by modifying specific state entries and rerunning.
That's something you absolutely cannot do if agents are reading live news, because you can't control what they saw.
The independent signal argument is the one Daniel is most interested in, I think. The simulation produces a result that is, by design, not contaminated by current news. When you then feed that result to the llm-council stage — which does have access to Tavily search, RSS feeds, ISW reports — you get a comparison between "what actor logic predicts" and "what news-grounded analysis predicts." If those two things are similar, either the firewall broke or the situation is genuinely predictable. If they diverge, you have something interesting.
The divergence is the signal.
That's the whole diagnostic value of the hybrid pipeline. If snowglobe's sealed simulation predicts an IRGC escalation pattern that the ISW reports don't currently support, that's either a hallucinated trajectory or a leading indicator. The llm-council's job is to adjudicate which.
And cost control is real too — you're not running Tavily queries for every agent on every turn.
The context window economics matter at scale. If you have eight actors taking ten turns each with five hundred tokens of world state per turn, that's manageable. If each of those turns also included a fresh web search and the results, you'd be looking at an order of magnitude more cost and latency, and the outputs would be harder to analyze because every agent's context is different.
Okay, now the costs. We've touched on referee bias and turn-zero anchoring. What about information loss from compression?
This one is subtle and I think it's genuinely underweighted. When you compress ten turns of complex multi-agent dialogue into a five-hundred-word world state, you inevitably lose the "diplomatic vibe." The difference between a threat that was delivered with a conciliatory tone and a threat that was delivered with contempt — that distinction can determine whether the other side escalates or backs down. The world state might just say "Iran issued a warning about consequences for continued airstrikes." The texture of how that warning was delivered is gone.
And the next agent is now responding to the compressed version, not the actual interaction.
Over multiple turns, this creates a kind of narrative flattening. The simulation starts producing clean, legible geopolitical moves rather than the messy, ambiguous, ego-driven behavior that actually characterizes crises. The Social Simulacra work from Stanford — which was looking at LLM-populated social environments rather than geopolitical ones — found that compressed state representations caused agents to exhibit more stereotypical behavior over time. The nuance bleeds out.
And then there's what I think is the most structurally interesting cost: the inability to model genuine surprise.
This is where the world state mechanism has a hard architectural limit. If turn zero doesn't include a Black Swan event as a possibility, it essentially cannot emerge organically. The agents are reasoning within the ontology they were given. An accidental escalation can happen — agent A misreads agent B's defensive posture as offensive and responds — but a truly exogenous shock, a leadership assassination, a natural disaster that disrupts logistics, a third-party intervention from an actor not in the scenario — none of that can emerge because there's no mechanism for it.
In real wargaming, the White Cell can introduce those. They can say "it's now raining and your supply lines are degraded."
And in snowglobe, that would require a deliberate referee intervention — essentially the Control agent deciding to inject an event. Which is the "invention" category we said high-fidelity sims try to minimize. So you have a tension: the firewall that protects signal purity also prevents the simulation from being surprised.
Which means you're essentially testing "how do these actors respond to this specific scenario" rather than "what could happen that nobody anticipated."
That's a real constraint. And it matters for how you interpret outputs. You're not running a generative exploration of possibility space — you're stress-testing specific causal chains.
So is world state a hack around context limits, or is it a genuine epistemic tool? Because I know where this is going, but I want to hear you make the argument.
The argument that it's just a context hack would go: with infinite context, you could just give every agent the full transcript of every prior turn, every other agent's reasoning, and all the live news, and you'd get richer outputs. But I think that's wrong, and here's why: the problem isn't the amount of information, it's the epistemics of what information the agent is allowed to use.
Right — infinite context doesn't solve the training bias problem. An LLM with infinite context and access to live news still defaults to inference-over-the-news.
The world state is forcing the LLM to operate within a specific ontology of the possible. It's not just limiting what the model sees — it's structuring what the model is allowed to know. That's a fundamentally different function from compression. In traditional wargaming, the White Cell doesn't restrict information to save paper. They restrict it to maintain the integrity of the exercise. The "Whiskey on the Rocks" scenarios that snowglobe's documentation references — those are real Cold War-era naval wargames where the whole analytical value came from seeing how commanders reasoned under genuine uncertainty, not under omniscience.
So the world state is doing what the White Cell does: enforcing the epistemic conditions of the scenario, not just managing information volume.
Even with infinite context and free API calls, you'd still want a world state because you want actors reasoning from a shared, adjudicated reality rather than from their own private interpretation of a firehose of information. The world state is the commitment mechanism — it forces the simulation to have a definite causal model of what happened, rather than allowing each agent to have its own version of events.
Which is actually a much stronger argument for world state than "it saves tokens."
The token savings are real but they're almost beside the point. The epistemic function is the point. And this connects to something broader about multi-agent simulation design — the value of the simulation as an analytical tool depends entirely on the clarity of its causal model. If you can't say "this event caused this response," you don't have a simulation, you have a stochastic text generator.
Okay, let's get to the question Daniel is most interested in, because this is the diagnostic one. In the hybrid pipeline — snowglobe sealed, llm-council with fresh data — how do you know the firewall is working versus quietly breaking?
The clearest sign of health is divergence that is logically coherent. If the snowglobe simulation produces an escalation pattern that the llm-council, looking at fresh ISW data, finds surprising — "the actors are predicting this but the current observable indicators don't support it" — that is the firewall working. The simulation has generated an independent signal. The council's job is then to evaluate whether that signal is a hallucinated trajectory or a plausible leading indicator that the news hasn't caught up to yet.
So the healthy state is: simulation says X, council says "interesting, news says not-X, let's reason about why."
The failure mode is almost the opposite. If the simulation perfectly tracks real-world events that occurred after turn zero, that's a strong sign of data leakage. The agents are probably using their training cutoff knowledge to "guess" what happened rather than following the referee's state. They're not reasoning from the world state — they're reasoning from what they know the world to be like, and the world state is just being ignored.
How would you detect that operationally? Because it's not like the agents announce "I am now consulting my training data."
A few signals. First, look for anachronistic specificity — if an agent references a specific event, a specific number, a specific name that wasn't in the world state, that's leakage. The agent is pulling from somewhere outside the sanctioned information channel. Second, look for convergence between simulation outputs and contemporaneous news. If you run the simulation on Monday and by Thursday the news looks exactly like what the simulation predicted, and that happens consistently, you're not running a simulation — you're running a very expensive echo of the model's training data.
And the third signal?
Generic outputs. If the simulation is consistently producing "status quo" forecasts, "escalation followed by de-escalation," the most boring possible outcomes — that's often a sign that turn-zero was too vague to anchor the agents in a specific causal model. They're defaulting to their priors about how geopolitical crises typically resolve, which are baked into their training, rather than reasoning from the scenario. The council then gets a bland summary and produces a bland forecast, and the whole pipeline has zero diagnostic value.
So the three failure signals are: anachronistic specificity, convergence with contemporaneous news, and generic outputs from an underspecified turn-zero.
And the corresponding health signals are: logically coherent divergence from the news, specific causal chains that you can trace back through the state log, and outputs that surprise the council in ways that are still internally consistent with the scenario.
There's something almost paradoxical about this. The simulation is most valuable when it disagrees with the news in a way that makes sense.
That's the whole epistemics of it. If it always agrees with the news, it's redundant. If it disagrees in ways that are incoherent, it's broken. The sweet spot is principled disagreement — "here is why actor logic might produce a different outcome than news-grounded analysis expects, and here is the specific causal chain that leads there."
And that's only possible if the firewall held.
Which is why Daniel's question about diagnostic signals is the right question to be asking. Because you can do all the architecture correctly — sealed simulation, careful referee, well-specified turn-zero — and still have a broken firewall if the underlying models are bleeding through their training priors. The diagnostic work is ongoing, not a one-time setup check.
Let me push on the referee bias problem one more time before we get to takeaways, because I think it's underspecified. How do you audit the referee?
This is genuinely hard. The referee is itself an LLM, which means it has its own priors about how conflicts resolve, its own biases toward certain kinds of outcomes. The most practical approach is to run the same scenario with different referee models and compare the world states they produce. If a GPT-4 class referee and a Claude class referee produce substantially different world states from the same agent outputs, you have referee-dependent variance, which means your results are not just a function of your actor personas — they're a function of your referee choice.
And you'd want to log every compression decision the referee makes. Not just the final world state, but the reasoning about what got included and what didn't.
The state diff approach — tracking what changed between turn N minus one's world state and turn N's world state, and why — gives you auditable evidence of what the referee is doing. If you see the referee consistently downgrading casualty estimates, consistently describing diplomatic language as more conciliatory than the agent actually wrote it, that's referee bias you can measure.
Okay. Takeaways. If you're actually building one of these pipelines, what's the concrete guidance?
Three things. First: world state is mandatory if you want actor simulation rather than news summarization. This is not optional architecture — it is the definition of what you're building. But design your referee to log every compression decision, not just the final output. You need to be able to audit what the referee chose to exclude, because that's where bias hides.
Second?
Test firewall integrity by running parallel simulations — one with the sealed world state mechanism, one where agents have access to live news — and measure the divergence in council outputs. If the two pipelines produce similar forecasts, either your firewall doesn't matter because your scenario is too constrained, or it's broken because the agents are finding another path to the same information. The divergence is your validation signal.
And third?
Audit turn-zero obsessively. It is the single point of failure for the entire pipeline. Run at minimum three variants of your initial world state — one where the crisis is more acute than your baseline, one where it's less acute, one where a key variable is different — and check whether the simulation produces qualitatively different trajectories. If all three produce the same outcome, your turn-zero is too vague to be driving the simulation. You're just getting the model's prior about crisis resolution.
And the practical implementation for the state diff approach — if you're building this in snowglobe specifically, you'd want a structured output from the referee that includes a "what changed" field alongside the narrative state.
A diff view between turns is enormously useful for debugging. It lets you see immediately whether the referee is making substantive editorial choices or just reformatting the same information. If the diff is small when a major action occurred, the referee is under-resolving. If the diff is large when nothing happened, the referee is inventing.
I want to close on the open question, because I think it's the right one to leave people with. As context windows grow — we're already seeing million-token contexts — does world state become obsolete? Or does it become more important?
My strong intuition is that it becomes more important, not less. The context limit argument for world state is the weakest argument for it. The epistemic argument — that you want actors reasoning from a shared, adjudicated causal model rather than from a firehose of information — gets stronger as context windows grow, not weaker. Because the larger the context window, the more tempting it is to just dump everything in and let the model sort it out. And the more you do that, the more you lose the ability to say "this output is a function of these specific inputs."
The world state becomes a causal ledger rather than a context hack.
A causal ledger is exactly the right framing. Every entry is a commitment: this happened, this caused this, these are the facts every actor is operating from. That kind of explicit causal modeling is valuable precisely because LLMs are bad at maintaining it implicitly. The world state is infrastructure that enforces what the model struggles to do on its own.
And the research direction that follows from that is: how do you build a referee that is itself auditable, that has calibrated uncertainty about its own compression decisions, and that can flag when it's being asked to resolve something it doesn't have enough information to resolve fairly?
That's the right research question. And it's not just an LLM simulation question — it's a question about how you build any system where a central authority is mediating between independent agents and a shared ground truth. The wargaming community has been working on White Cell design for decades. The LLM simulation community is about three years old. There's a lot of methodology to import.
Alright. The core thesis, for anyone who wants the one-sentence version: world state is epistemic infrastructure that enforces causal reasoning under controlled uncertainty, not a workaround for token limits. Design it accordingly.
And if your simulation's outputs look like the news, something is broken.
Big thanks to our producer Hilbert Flumingtop for keeping this whole operation running. Thanks to Modal for the GPU credits that power the pipeline — genuinely could not do this without them. This has been My Weird Prompts. If you want to get notified when new episodes drop, search for My Weird Prompts on Telegram. Until next time.
See you then.