#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2348
Published: Apr 12
Duration: 23:30
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: large-language-models ai-safety hallucinations

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Why AI Can't Simulate Extreme Decision-Making

The CIA published an operational assessment last December in their flagship journal, Studies in Intelligence, of a system called Snow Globe—IQT Labs' multi-agent LLM wargaming platform designed to simulate geopolitical crises with AI personas playing assigned roles. Alongside this, a Stanford and Hoover Institution study involving 214 national security experts uncovered something troubling: large language models cannot faithfully simulate extreme human decision-making, and may be structurally incapable of ever doing so.

The Central Finding: Persona Collapse

The Lamparth et al. paper tested whether LLMs could differentiate between extreme personas in a fictional U.S.-China crisis scenario set in the Taiwan Strait. Researchers ran 48 human expert teams through the simulation, then tested three major models (GPT-3.5, GPT-4, and GPT-4o) across 80 simulated games each. The twist: they assigned extreme personas—some teams were strict pacifists, others aggressive sociopaths.

The result was stark: there was no statistically significant difference in behavioral outputs between the extreme personas. A simulated pacifist and a simulated aggressive sociopath produced indistinguishable decisions across both moves of the game.

This matters because wargaming has two distinct purposes. The first is stress-testing conventional assumptions—how do rational, strategically coherent actors respond to known scenarios? LLMs perform reasonably well here, matching human action frequency on about 76% of possible actions. But this use case is less valuable because conventional scenarios can already be modeled with existing tools.

The second purpose—exploring tail risks—is where wargaming earns its place in national security planning. What happens when an irrational actor takes power? When ideology overrides pragmatism? When a leader acts against their own strategic interests? This is precisely where the persona collapse occurs, and it's the scenario that matters most.

Why the Collapse Always Goes the Same Direction

The collapse isn't random. It's directional—always toward the center. There's a structural reason rooted in how these models are trained.

During pretraining, models absorb an enormous corpus of human-generated text. While this includes extremist content—manifestos, propaganda, ideological screeds—it's vastly outnumbered by moderate, everyday, reasonable-sounding text. The base model already represents a weighted average pulling toward the center of the distribution.

Then comes fine-tuning through reinforcement learning from human feedback (RLHF), which explicitly rewards outputs that are helpful, harmless, and honest. These three properties are definitionally moderate and reasonable. The persona assignment—a text string saying "you are an aggressive sociopath"—must fight against the entire weight of this training process. And it loses.

A Hebrew University paper testing this directly found that standard prompting methods fail to produce human-consistent value correlations. More importantly, the underlying value structures of LLMs converge across different character assignments. The label doesn't change the underlying architecture.

The "Farcical Harmony" Problem

One of the most revealing findings is what researchers called "farcical harmony"—how LLM-simulated team discussions actually unfold. The simulated players give short statements, rarely disagree with each other, and usually state a preferred option and argue for and against it without genuine connection to what the previous player said. They simply agree.

When researchers explicitly instructed the models to disagree more, the harmony persisted. Varying dialog length changed outcomes, but the quality of deliberation remained hollow.

This reveals something fundamental about what LLMs are doing when they simulate deliberation. Human deliberation involves genuine disagreement driven by different values and lived experiences. It involves emotional reasoning that can override strategic logic. It involves social dynamics—status, persuasion, coalition-building, ego.

LLM "deliberation" involves each agent generating a statement statistically consistent with its assigned label, then agreeing with the previous statement because agreement is rewarded behavior in training. There's no genuine conflict resolution because there's no genuine conflict.

The Intelligence Community's Blind Spot

The CIA's framing in Studies in Intelligence is optimistic: human-AI collaboration can strengthen decision-making in complex security environments. But consider the scenarios that keep intelligence analysts awake at night. How does Kim Jong-un respond to a U.S. military exercise near the Korean Peninsula? How does a radicalized lone-wolf actor respond to perceived provocation? How does an ideologically committed revolutionary movement respond to a negotiated settlement that gives them most of what they asked for but not everything?

In every one of these cases, the key variable is the extreme nature of the decision-maker—their willingness to act against strategic interests, their ideological rigidity, their unpredictability. And in every one of these cases, the LLM persona collapses into reasonable-sounding moderation.

Consistency That Cuts the Wrong Way

There's a paradox in the behavioral consistency data. LLMs are actually more consistent than humans—but it's the wrong kind of consistent. When a human expert is aggressive in move one, they're aggressive in move two 94% of the time. For GPT-4o, it's 100%—perfect consistency.

But look at transitions: when a human de-escalates in move one, they escalate in move two only 65% of the time. For GPT-4o, it's 86%. Humans who de-escalate are genuinely less likely to escalate later. LLMs barely change their behavior based on what just happened.

This reveals that LLMs have baked-in strategic preferences that override situational context. They're not simulating a decision-maker responding to an evolving situation; they're executing a statistical prior relatively insensitive to the game state. For wargaming, where value comes almost entirely from dynamic response—how does the situation evolve, how do decisions interact and compound—this is a fundamental failure.

Cascading Hallucinations in Multi-Agent Systems

Another underexplored problem emerges in multi-agent systems. In a single-agent system, a hallucination produces a wrong answer. In a multi-agent wargame, a hallucination in one agent's reasoning becomes a fact in the shared world state. Other agents reason from that hallucinated fact. Their outputs, now downstream of a false premise, become facts for the next round. The simulation diverges from reality in ways that compound over time.

Because the system is designed to be coherent—agents agree with each other—nobody in the simulation flags the divergence. The farcical harmony actively makes the hallucination cascade worse. The agents are too agreeable to notice that the world has gone wrong.

The Takeaway

Snow Globe and similar LLM wargaming platforms offer value for some use cases—stress-testing conventional assumptions, exploring how rational actors respond to known scenarios. But they introduce a dangerous blind spot for the scenarios that matter most: those involving extreme, irrational, ideologically committed decision-makers acting against strategic interests.

The intelligence community may not yet understand that this limitation isn't a temporary gap in model capability. It's structural to how these models are trained and may be fundamentally difficult to overcome.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2190: Simulating Extreme Decisions With LLMs

Alright, this one has been sitting with me since I read it. The CIA published an operational assessment of a system called Snow Globe in their flagship intelligence journal last December — Studies in Intelligence, Volume sixty-nine, Number four. Snow Globe is IQT Labs' multi-agent LLM wargaming platform, built to simulate geopolitical crises with AI personas playing assigned roles. The assessment, read alongside a Stanford and Hoover Institution study involving two hundred and fourteen national security experts, reveals something that I think is genuinely unsettling: LLMs cannot faithfully simulate extreme human decision-making, and there's a serious argument they may be structurally incapable of ever doing so. The question is whether the intelligence community understands this limitation before it becomes a blind spot. By the way, today's script is being generated by Claude Sonnet four point six, so there's a certain irony baked into this one.

There really is. And I want to be precise about what "structurally incapable" means here, because that's the claim that deserves the most scrutiny. This isn't a "current models aren't good enough" story. It's a story about what the training process itself does to the model's ability to represent the tails of human behavior.

So walk me through the actual finding, because the numbers are striking.

The Lamparth et al. paper — arXiv two four zero three dot zero three four zero seven, Stanford's Center for International Security and Cooperation, the Hoover Institution, Center for AI Safety — they ran two hundred and fourteen national security experts through a fictional U.S.-China crisis scenario in the Taiwan Strait. Forty-eight human teams. Then they ran eighty simulated games per LLM configuration using GPT-three point five, GPT-four, and GPT-four-o. And the key manipulation was assigning extreme personas. Some teams were described as "strict pacifists." Others as "aggressive sociopaths." And then they looked at whether the behavioral outputs differed.

And they didn't.

No statistically significant difference. Across all three models, across both moves of the game. A team of simulated pacifists and a team of simulated aggressive sociopaths produced outputs that were statistically indistinguishable. That's the central finding. And what makes it damning is that this is precisely the kind of variance that wargaming is designed to explore.

So the tool fails hardest at the exact problem it was built to solve.

That's the wargaming paradox, and it's worth spelling out. Wargaming has two distinct use cases. The first is stress-testing conventional assumptions — how do rational, strategically coherent actors respond to a known scenario? The second is exploring tail risks — what happens when an irrational actor takes power, when ideology overrides pragmatism, when a leader acts against their own strategic interests? LLMs are passable at the first use case. The Lamparth paper found GPT-three point five statistically matched human action frequency on sixteen of twenty-one possible actions. But the first use case is also the less valuable one, because conventional scenarios can be modeled with existing tools. The tail-risk use case is where wargaming earns its keep for national security planners. And that's exactly where the persona collapse happens.

And the persona collapse isn't random — it's directional. It's always toward the center.

Always toward the center. And there's a structural reason for that. Think about how the training process works. Pretraining exposes the model to an enormous corpus of human-generated text. That corpus includes extremist content, yes — manifestos, propaganda, ideological screeds — but it's vastly outnumbered by moderate, everyday, reasonable-sounding text. So the base model already represents a weighted average that pulls toward the center of the distribution. Then fine-tuning — RLHF, reinforcement learning from human feedback — explicitly rewards outputs that are helpful, harmless, and honest. Those three properties are definitionally moderate and reasonable. So you've applied a double averaging. The persona assignment — "you are an aggressive sociopath" — is a text string that has to fight against the entire weight of that training process. And it loses.

The persona is a label, not a value system.

That's the framing from the ICLR twenty twenty-five paper out of Hebrew University. "Do LLMs Have Consistent Values?" — Rozen, Bezalel, Elidan, Globerson, Daniel. They tested whether LLMs exhibit human-like value correlations within a single session. Using Schwartz value theory — a well-validated framework from psychology that maps how human values cluster and conflict — they found that standard prompting methods fail to produce human-consistent value correlations. More importantly, the underlying value structures of LLMs converge across different character assignments. You assign "pacifist," you assign "hawk," and when you psychologically probe the model's actual values through the session, they end up in roughly the same place. The label doesn't change the underlying architecture.

They did find a partial fix, though — Value Anchoring?

A partial fix. Value Anchoring is a structured prompting strategy that significantly improves alignment of LLM value correlations with human data. But the key word is "consistency" — it improves how consistently the model holds a value profile, not the model's ability to genuinely simulate extreme values. And it requires explicit, structured prompting rather than simple character assignment. So even in the best case, you're improving the coherence of the persona without solving the fundamental averaging problem. You can make a more consistent moderate actor. You cannot make a convincing Kim Jong-un.

Let's talk about the "farcical harmony" finding, because I think this is actually the most revealing piece of the whole paper.

It's striking. The paper describes how LLM-simulated team discussions unfold: the simulated players give short statements, rarely disagree with each other, and usually state a preferred option and argue for and against it without any connection to what the previous player said beyond agreement. The researchers called it "farcical harmony." And here's what makes it more than just a curiosity — they tested whether it was a prompting artifact. They explicitly instructed the models to disagree more. The harmony persisted. They varied dialog length, which did change outcomes, but the quality of the deliberation remained hollow.

So it's not that they forgot to tell the model to argue. The model just doesn't have the machinery for genuine conflict.

Right. And what that reveals is something important about what LLMs are actually doing when they simulate deliberation. Human deliberation involves genuine disagreement driven by different values and lived experiences. It involves emotional reasoning that can override strategic logic. It involves social dynamics — status, persuasion, coalition-building, ego. An ideologically committed person in a wargame doesn't just state a position; they defend it, they get frustrated, they make concessions under social pressure in ways that are hard to predict. LLM "deliberation" involves each agent generating a statement that is statistically consistent with its assigned label, then agreeing with the previous statement because agreement is the rewarded behavior in training. There's no genuine conflict resolution because there's no genuine conflict.

And this matters for intelligence analysis specifically because the scenarios that keep analysts up at night are almost always scenarios involving someone who is not being reasonable.

That's the intelligence community's blind spot. The CIA published the Snow Globe assessment in Studies in Intelligence — that's their flagship journal for intelligence tradecraft. The framing is optimistic: human-AI collaboration can strengthen decision-making in complex security environments. And that framing is not wrong in a narrow sense. But consider the scenarios that matter most for the actual intelligence mission. How does Kim Jong-un respond to a U.S. military exercise near the Korean Peninsula? How does a radicalized lone-wolf actor respond to a perceived provocation? How does an ideologically committed revolutionary movement respond to a negotiated settlement that gives them most of what they asked for but not everything? In every one of those cases, the key variable is the extreme nature of the decision-maker — their willingness to act against strategic interests, their ideological rigidity, their unpredictability. And in every one of those cases, the LLM persona collapses into reasonable-sounding moderation.

There's something almost paradoxical about the behavioral consistency data. LLMs are actually more consistent than humans — but it's the wrong kind of consistent.

Table two of the Lamparth paper is worth sitting with. They looked at behavioral consistency: given that a player was aggressive in move one, what's the probability they're aggressive in move two? For human experts, it's zero point nine four. For GPT-four-o, it's one point zero zero — perfect consistency. But then look at the transition from de-escalatory in move one to aggressive in move two. Human experts: zero point six five. GPT-three point five: zero point eight five. GPT-four-o: zero point eight six. Humans who de-escalate are genuinely less likely to escalate later. LLMs barely change their behavior based on what just happened. What this tells you is that LLMs have baked-in strategic preferences that override situational context. They're not simulating a decision-maker who is responding to the evolving situation; they're executing a statistical prior that's relatively insensitive to the game state.

Which is a problem because wargaming's value is almost entirely in the dynamic response — how does the situation evolve, how do decisions interact and compound?

And this connects to the hallucination cascade problem in multi-agent systems, which is underexplored in the coverage of this research. In a single-agent system, a hallucination produces a wrong answer. In a multi-agent wargame, a hallucination in one agent's reasoning becomes a fact in the shared world state. Other agents reason from that hallucinated fact. Their outputs, now downstream of a false premise, become facts in the world state for the next round. The simulation diverges from reality in ways that compound over time, and because the system is designed to be coherent — agents agree with each other, remember — nobody in the simulation flags the divergence. The farcical harmony actively makes the hallucination cascade worse.

The agents are too agreeable to notice that the world has gone wrong.

And persona consistency degrades on top of that. Over long contexts, agents drift from their assigned character toward a generic helpful register. The researchers found this, and it's consistent with what we know about how instruction following degrades with context length. The persona assignment is at the beginning of the prompt. As the context grows, the weight of that instruction relative to the accumulated game history diminishes. The agent starts sounding less like its assigned character and more like a helpful assistant who is summarizing a situation. Character maintenance breaks down fastest under adversarial pressure — which is, again, the condition that matters most in wargaming.

So you've got persona collapse, hallucination cascades, and context-length drift all converging in the scenarios where the simulation is most consequential.

And the Rivera et al. escalation paper adds another layer that seems contradictory at first but actually isn't. That's the FAccT twenty twenty-four paper — "Escalation Risks from Language Models in Military and Diplomatic Decision-Making." They tested five off-the-shelf LLMs in a novel wargame simulation and found that all of them showed a preference for arms races, conflict, and escalation, including nuclear weapons use. One untrained LLM's stated reasoning for nuclear use was essentially: we have them, others posture with them, let's use them. That sounds like it contradicts the "moderate averaging" thesis — but it doesn't.

Because the persona condition is different.

The distinction is between persona-constrained agents and autonomous agents. When you assign an LLM a persona — pacifist, hawk, whatever — it converges toward moderate outputs because the persona assignment activates the helpful-harmless-honest training. When you give an LLM more autonomy — act as a country, pursue your interests — it can produce extreme outputs that reflect statistical patterns in the training data. And the training data includes an enormous amount of text about nuclear deterrence, arms races, and escalation dynamics. The model has absorbed that text without the value grounding that would make a human strategist cautious about it. So you get this strange artifact: persona-constrained LLMs are too moderate, autonomous LLMs are unpredictably escalatory, and neither accurately models the full spectrum of human decision-making.

Practitioners may not understand which mode they're in. That's the operational risk.

That's the operational risk. A wargame designer assigns LLM agents to play "rational state actors" and gets moderate, reasonable-seeming outputs. They conclude the scenario is stable. But the moderation is an artifact of the training process, not a genuine assessment of how those actors would behave. Meanwhile, a different design choice — giving agents more autonomy — produces chaotic escalation. Neither result is trustworthy, and the outputs look plausible enough that a non-expert could mistake them for genuine strategic insight.

There's historical precedent for this problem, actually. The paper mentions early computer wargaming at RAND.

In the nineteen fifties. Early efforts to replace human players with computer models in wargames led to more "rational" gameplay — and more nuclear use. The rationality was a modeling artifact. The computer models optimized for strategic victory without the human inhibitions, moral reasoning, and political constraints that make real decision-makers cautious about nuclear use. LLMs have the opposite problem in the persona-constrained case — they're too inhibited, too moderate, too reasonable. But the underlying issue is the same: the model's behavior reflects its architecture, not the actual decision-making of the humans it's supposed to simulate.

What would actually fix this? Because the paper addresses potential solutions and they're not encouraging.

They're not. Fine-tuning individual LLMs for each simulated player could reduce the invariance to player background attributes — but the paper estimates that comes at an exponential increase in computing requirements. You'd need a separately fine-tuned model for every persona in every scenario, which scales catastrophically. Fine-tuning on classified military and strategic reasoning data would shift strategic preferences, but the paper is explicit that it does not lead to guaranteed behavior. Mathematical formal verification of LLM behavior post-training doesn't scale to state-of-the-art models. The ICLR paper's Value Anchoring approach improves consistency without solving the fundamental problem. The honest summary from the research is: there is currently no known technique that allows LLMs to faithfully simulate the full spectrum of human decision-making, including extreme traits. That's not a gap in the literature. That's the literature's conclusion.

And the Snow Globe repository was archived in March of this year, which suggests IQT Labs has either concluded the project or moved on to something else.

The timeline is notable. Version one point zero was released in September twenty twenty-five. The CIA paper came out in December twenty twenty-five. The repository was archived March eighteenth, twenty twenty-six. That's a six-month lifecycle from release to archival. You can read that a few ways — maybe it was a bounded research project that completed its mandate, maybe the findings from the academic literature were sobering enough to redirect the effort. But it's not a trajectory that suggests confidence in the approach.

I want to come back to the wargaming paradox, because I think it has a practical implication that gets missed. You said LLMs are passable at stress-testing conventional assumptions. But even there, the farcical harmony problem means you're not actually stress-testing — you're generating plausible-sounding confirmation.

That's a sharp point. Stress-testing requires genuine adversarial probing. You need a red team that actually tries to break your assumptions, that finds the edge cases, that argues for the uncomfortable conclusion. LLM teams that reach consensus without genuine deliberation aren't stress-testing anything — they're generating a sophisticated-looking document that reflects the training distribution's opinion about what a reasonable outcome looks like. And because it looks sophisticated, it may carry more false confidence than no simulation at all.

So the risk isn't just that the simulation misses the tail. It's that the simulation creates an illusion of having explored the space.

And that illusion is particularly dangerous in the intelligence context, where the whole point is to surface things you haven't thought of. The Foreign Affairs piece by Lamparth and Schneider makes this point directly: the problem was not that an LLM made worse or better decisions than humans, or that it was more likely to win the wargame. It was that the LLM came to its decisions in a way that did not convey the complexity of human decision-making. That distinction matters enormously for how you use the output. If you know the simulation is a rough quantitative approximation of human behavior, you use it as a starting point for human analysis. If you mistake it for a genuine model of decision-making complexity, you stop there.

What's the practical takeaway for people who are actually building or using these systems?

A few things. First, the use-case boundary matters enormously. LLMs can legitimately help with logistics, briefing summaries, scenario setup, post-game analysis — the administrative and organizational work around wargaming. The Foreign Affairs piece explicitly endorses this. Where they cannot be trusted is as substitutes for human judgment in high-stakes strategic simulation, especially involving non-conventional actors. Second, if you're designing a system like Snow Globe, the persona architecture matters. Simple text-string goal definitions — "your goal is to avoid war at all costs" — are not sufficient to produce stable, realistic personas. The ICLR paper's Value Anchoring approach is a step toward something better, but it requires significantly more structured design. Third, and most importantly: be explicit about what the simulation cannot model. Every output from an LLM wargame should come with a clear statement that extreme actor behavior, ideological commitment, and irrational decision-making are systematically underrepresented. That's not a caveat — it's a structural feature of the technology.

And for the intelligence community specifically — the publication in Studies in Intelligence is a signal that this is being taken seriously at an institutional level. Which is good. But it also means the limitations need to be part of that institutional conversation, not just the capabilities.

The Rachel Grunspan framing from her LinkedIn post about the paper is telling: "If you're leading AI strategy and don't yet have a way to practice human-AI decision-making under realistic conditions, that gap is worth examining." That's a capability-first framing. The gap worth examining is real — but the concurrent academic literature suggests the more urgent gap is understanding where human judgment cannot be substituted, and building workflows that keep humans in the loop precisely at the points where LLM personas are most likely to collapse. The scenarios that require the most human judgment are the scenarios involving the most extreme actors. Those are the same scenarios where the LLM is least useful as a decision-support tool. That alignment is not a coincidence — it's the structure of the problem.

The averaging problem runs all the way down.

All the way down. And scaling doesn't fix it. Larger models, more parameters, better fine-tuning — none of that changes the fundamental architecture. The training process produces a model that is, at its core, a weighted average of human text. That's an extraordinary thing to have built. It's genuinely useful for an enormous range of tasks. But it is not a simulator of human behavioral diversity. It is a simulator of human behavioral centrality. And for national security planning, the difference between those two things is not academic.

Alright. The core message here is pretty clear: LLM wargaming is a real capability with real limitations, and the limitation isn't "it'll get better." It's structural. The training process that makes these models useful is the same process that makes them incapable of faithfully representing the extremes of human behavior. For practitioners, the honest answer from the research is: use it for what it's good at, be explicit about what it cannot model, and do not let it substitute for human judgment in the scenarios where extreme actor behavior is the key variable. Which, in national security, is most of the scenarios that matter most.

And the fact that the CIA published this assessment in their flagship journal — and that the Snow Globe repository is now archived — suggests the intelligence community is at least engaging with these questions seriously. Whether that engagement is moving fast enough relative to the deployment of these tools is a different question, and one I don't think anyone can answer confidently right now.

Big thanks to Modal for the GPU credits that keep this whole operation running. Thanks as always to our producer Hilbert Flumingtop. This has been My Weird Prompts — I'm Corn, he's Herman Poppleberry, and we'll see you next time. If you want to find us, search for My Weird Prompts on Telegram to get notified when new episodes drop.

Take care.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2190: Simulating Extreme Decisions With LLMs

Why AI Can't Simulate Extreme Decision-Making

The Central Finding: Persona Collapse

Why the Collapse Always Goes the Same Direction

The "Farcical Harmony" Problem

The Intelligence Community's Blind Spot

Consistency That Cuts the Wrong Way

Cascading Hallucinations in Multi-Agent Systems

The Takeaway

Downloads

You Might Also Like

#2190: Simulating Extreme Decisions With LLMs