Here's what Daniel sent us this time. He's asking how you actually make an LLM play a specific geopolitical persona - like Khamenei, Netanyahu, Putin, Xi, or an IRGC commander - in a way that goes beyond just vibes. He wants us to cover the full technical stack: system prompts with doctrine documents in context, few-shot examples drawn from real speeches and translated state media, RAG over biographies and prior decisions, fine-tuning on translated corpora including the legal and ethical questions that raises, and hybrid approaches. And then the really hard part: how do you evaluate whether a persona is actually in character when there's no ground truth? He wants us to ground it in real frameworks and papers on persona fidelity. So... no pressure.
This is a genuinely meaty one. And it connects directly to something we've been building toward in the AI wargaming space - because once you've got a world state, the next question is how do you populate it with actors that actually behave like their real-world counterparts, rather than cardboard cutouts that say the right slogans but make strategically incoherent decisions.
That's the crux of it, right? Because there's a version of this that's easy and useless - you just tell the model "you are Putin, act accordingly" and it gives you some gravelly speeches about sovereignty and NATO encirclement, and it sounds vaguely right but it's basically just the model's training data about Putin's public image.
And that's exactly the failure mode we need to dissect. The gap between surface-level mimicry - style, tone, vocabulary - and what you'd call strategic fidelity, which is the decision-making logic, the ideological constraints, the historical traumas that actually drive policy. Those are completely different problems and they require completely different engineering approaches.
Before we get into the stack, can you just frame the technical taxonomy? Because there are basically four or five distinct methods here and I want to make sure we're covering them in the right order.
So the way I think about it is as a series of layers. You start with system prompting and doctrine scaffolding, which is your foundation. Then you add few-shot examples as a voice layer. Then retrieval-augmented generation as a memory layer. And finally fine-tuning, either full or parameter-efficient, as a character layer. Most serious implementations end up being hybrids of these. And by the way, today's script is powered by Claude Sonnet four point six, which is doing its own kind of persona simulation right now, which feels appropriate.
A little too appropriate.
Let's start with the foundation. System prompting and doctrine scaffolding. The basic idea is that you construct what researchers call a thin scaffold - a system prompt that defines the persona's role, their core values, their linguistic constraints, and critically, their red lines. The non-negotiable stances that the persona will not cross regardless of what the user asks.
And for a state actor like Khamenei, what does that actually look like in practice?
So for Khamenei you'd define the Velayat-e Faqih framework - the guardianship of the Islamic jurist - as a foundational constraint. Everything the persona says has to be internally consistent with that theological-political doctrine. You'd also inject the actual text of key doctrinal documents directly into the context window. Fatwas, major speeches, the constitutional framework of the Islamic Republic. The idea is that the model's attention mechanism is reading that material as live context when it generates responses.
So you're essentially loading the persona's ideology as a document, not as a description of the ideology.
That distinction matters a lot. Describing Khamenei's ideology to the model is like giving it a Wikipedia summary. Injecting the actual Velayat-e Faqih text, or a key fatwa on nuclear weapons, puts the model in a position where it's reasoning from the primary source rather than a summary. Modern long-context windows - we're talking models with two hundred thousand or even one million token contexts - make it feasible to load quite substantial doctrine corpora directly.
But you said prompting alone leads to what you called hallucinated bravado. What's the failure mode there?
Mode collapse is the technical term. What happens is the model gravitates toward the most statistically likely response given the persona description - which is basically the caricature version. It produces the greatest hits. Khamenei denounces the Great Satan. Putin references the humiliation of the nineties. Xi talks about the century of humiliation. These are real positions, but they're also the positions that appear most frequently in training data, so the model over-indexes on them and loses the nuance of how these actors actually navigate internal contradictions, domestic political pressures, or moments where their public position diverges from their strategic calculus.
It's doing the impression without understanding the person.
Right. Which is why you need the next layer. Few-shot examples. This is what you might call the voice layer. Instead of describing the persona, you show the model examples of the persona handling specific types of situations. You might provide five to ten exchanges: how does this actor respond to a hostile Western interviewer versus how do they address a domestic military audience versus how do they handle a question about an operational failure.
And the corpus for this is translated state media? TASS, IRNA, Xinhua?
Those are the primary sources, plus UN transcripts, official speeches with verified translations, and where available, intelligence community open-source collections. The key thing about few-shot examples is that they anchor vocabulary in a very specific way. There's a well-documented example in persona research where the choice between "Zionist entity" and "Israel" isn't just stylistic - it encodes an entire political ontology. The model learning from examples will pick up those lexical choices in a way that a system prompt description won't reliably produce.
Although I'd imagine translation quality is a serious variable here. You're working with translated Farsi, translated Russian, translated Mandarin - and the translation itself encodes interpretive choices.
That's a significant problem that doesn't get enough attention. When you're building a corpus of, say, IRGC commander statements translated from Farsi, the translation pipeline introduces its own biases. Persian statecraft has rhetorical structures that don't map cleanly onto English - the way authority is invoked, the relationship between religious and military language. A poor translation flattens all of that. Some of the more serious simulation frameworks are actually working with multilingual models and doing retrieval in the original language, then generating responses in whatever language the simulation requires.
Okay, so we've got the scaffold and the voice. What does RAG actually add on top of that?
RAG is the memory layer, and this is where things get genuinely interesting for geopolitical simulation. The core problem with prompting and few-shot is that it gives you a static snapshot of the persona. But real actors have histories. They've made specific decisions in specific contexts. They've been wrong, they've pivoted, they've had to reconcile contradictory positions.
And a model's base training weights might not have granular knowledge of, say, a particular IRGC commander's specific statements about asymmetric warfare doctrine in two thousand and nineteen.
Exactly that. So what you build is a vector database containing biographies, intelligence assessments, historical timelines, prior decisions, known doctrine documents. When the simulation poses a question or scenario to the persona, the RAG pipeline retrieves the most relevant historical material - the persona's actual past actions - and injects that into the context before generation.
The case study Daniel flagged here was an IRGC commander persona built using RAG over translated Iranian military journals. What makes that interesting technically?
A few things. First, Iranian military journals have a very specific discourse around asymmetric warfare - the doctrine of "forward defense," the logic of proxy networks as strategic depth - that is underrepresented in Western training data. A model without RAG over those sources will default to Western analytical frameworks for thinking about the IRGC, which produces a fundamentally different strategic logic than what the IRGC actually articulates internally.
So the model without RAG is basically simulating what a Western analyst thinks an IRGC commander thinks, rather than what an IRGC commander actually thinks.
That's a clean way to put it. And the second interesting thing about that case study is the retrieval architecture itself. You can't just dump everything into the vector store and hope semantic search finds the right material. You need structured metadata - date, context type, audience, operational domain - so that when the simulation asks about a naval scenario in the Strait of Hormuz, the retrieval isn't pulling statements about Lebanon or Yemen that happen to use similar vocabulary.
What are the latency and cost trade-offs of RAG versus prompt-only? Because in a real-time simulation with multiple interacting personas, you've got a performance problem.
This is a real constraint. A well-designed RAG pipeline for a single persona query might add two hundred to five hundred milliseconds over a prompt-only approach, depending on vector store size and retrieval depth. In a multi-agent simulation where you've got eight or ten personas interacting in sequence, that compounds quickly. Some frameworks are addressing this with pre-computed retrieval - essentially caching the most likely relevant documents for common scenario types - which trades some adaptability for throughput. Others are experimenting with speculative retrieval, where the system predicts what context will be needed before the query is fully formed.
Now let's talk about fine-tuning, because this is where the engineering gets heavier and the ethical questions get thornier.
Fine-tuning is what you'd call the character layer. The goal isn't just to give the model information about the persona - it's to modify the model's actual weights so that the persona's decision-making logic is baked in rather than retrieved or prompted. The foundational paper here is Character-LLM from Shao and colleagues in twenty twenty-three, which introduced the concept of training on experience trajectories rather than just text.
What's an experience trajectory as opposed to just text?
So instead of training the model on transcripts of what a character said, you train it on structured sequences that encode the internal logic: the situation, the emotional state, the reasoning process, and then the output. The idea is to capture not just what the persona says but the decision chain that produces those outputs. For a political actor, that might mean encoding the sequence: perceived threat to regime stability, ideological constraint from doctrine, domestic audience considerations, historical precedent from a similar situation, and then the public response.
That's a meaningful difference from just fine-tuning on speeches.
Substantial difference. And the parameter-efficient version of this is LoRA - Low-Rank Adaptation - which lets you inject a persona module into a base model like Llama three without retraining the full model. The practical advantage is that you can maintain multiple persona modules for different actors and switch between them, or even compose them for scenarios where you're modeling internal factions within a government.
Okay, but let's talk about the problems with fine-tuning, because I think people assume fine-tuning is always the gold standard and it really isn't.
The corpus quality problem is severe. If you're fine-tuning on Xi Jinping's speeches, you're working with a corpus that is, by design, highly curated and performative. These are public-facing documents that reflect what the leadership wants to project, not necessarily the internal reasoning. You're fine-tuning the model to be good at public Xi Jinping, which may be quite different from strategic Xi Jinping.
And there's a legal dimension here too. Using state media for training - TASS, Xinhua - raises questions about copyright and about whether you're essentially laundering propaganda into a training corpus.
The Royal Society published a legal analysis on this in twenty twenty-four - the question of whether LLMs simulating real people have a duty to truth, or whether they're protected as creative performance. The copyright question is somewhat distinct: state media in many jurisdictions is government-produced and may not carry the same copyright protections as commercial content, but there are real questions about whether fine-tuning on Xinhua content for a simulation that then produces synthetic statements attributed to Chinese officials creates legal liability.
And the propaganda exploitation risk is the darker version of this. A high-fidelity fine-tuned persona of an adversary is a very short distance from a deepfake persona that you use for disinformation.
That's the dual-use problem that every serious team working in this space has to grapple with. The same technical stack that lets you build a useful red-teaming tool for policy stress-testing can produce a synthetic media asset that puts fabricated statements in a real leader's mouth with considerable plausibility. The difference between a legitimate simulation and a disinformation weapon is largely intent and access control, not the underlying technology.
Which is not a very comfortable place to be.
It's not. And it's why the more responsible frameworks are building evaluation pipelines that are specifically designed to detect when a persona is drifting from documented positions into fabrication - which we should talk about because the evaluation problem is genuinely hard.
Let's do that. But first, let me make sure I understand the hybrid approach, because I think that's where most serious implementations actually land.
The hybrid that's emerged as the practical standard is what you'd call retrieval-augmented persona with a thin system-prompt scaffold. The architecture looks like this: you have a base model, potentially with a LoRA persona module fine-tuned for stylistic consistency. You have a system prompt that encodes the hard constraints - the red lines, the core ideological framing, the linguistic register. And you have a RAG pipeline over the persona's documented history and doctrine. The system prompt handles the character, the RAG handles the memory and facts, and the fine-tuning handles the voice.
So fine-tuning for style, RAG for substance.
That's the clean version. The comparison that's useful here is Putin's corpus: if you fine-tune on Putin's speeches and statements, you get a model that's very good at sounding like Putin's public rhetoric - the historical grievances, the specific vocabulary around sovereignty and great power status. But it's relatively rigid. It doesn't adapt well to novel scenarios because it's learned the patterns of his documented statements rather than the underlying strategic logic.
Whereas a RAG approach over the Valdai Club speech from twenty twenty-three, or the twenty twenty-one essay on Ukraine, gives you dynamic grounding.
Right. You can retrieve the specific argument Putin made about Ukraine's historical relationship to Russia and have the persona reason from that documented position, rather than generating a plausible-sounding version of it from fine-tuned weights. The trade-off is that RAG is only as good as your corpus coverage. If you're in a scenario that has no historical precedent in the retrieval database, the model falls back on either the fine-tuned patterns or base model behavior.
Okay. Evaluation. This is the part that I find most conceptually interesting, because you've built this elaborate system and now you need to know if it's actually working, but there's no ground truth. You can't call Putin and ask him how he'd respond to this scenario.
The evaluation problem is where the field has had to get genuinely creative. The PersonaGym framework from twenty twenty-five is probably the most systematic approach - it puts persona agents through structured environments, social dilemmas, negotiation scenarios, crisis situations, and measures behavioral consistency under pressure. The key insight is that you're not evaluating whether the persona gives the "correct" answer, because there isn't one. You're evaluating whether the persona behaves consistently with its documented characteristics across varied contexts.
So it's coherence testing rather than accuracy testing.
That's a useful frame. And it breaks down into what Shin and colleagues in twenty twenty-five called atomic-level evaluation. You decompose the persona into atomic traits - specific, documented positions and behavioral tendencies. For Putin, that might be something like "distrust of NATO expansion," "emphasis on historical Russian territorial claims," "preference for ambiguity in operational signaling." Then you score each response on how many of those traits are expressed and whether they're expressed consistently with documented precedent.
How do you handle the internal contradictions? Because real political actors hold contradictory positions all the time. Putin has simultaneously claimed Ukraine is an inseparable part of Russia and negotiated with Ukraine as a sovereign entity. A persona that's perfectly consistent might actually be less faithful than one that replicates the contradictions.
That's one of the sharpest critiques of the atomic trait approach - it can over-reward consistency in a way that produces an idealized, rationalized version of the actor rather than the actual messy human. Some frameworks are addressing this by including documented contradictions as explicit traits. The contradiction itself becomes part of the persona specification: "this actor simultaneously holds position A and position B and has historically resolved the tension by invoking framing C."
What about the LLM-as-judge approach? Because I've seen that used in persona evaluation and it seems both clever and a bit circular.
It's genuinely useful and genuinely limited. The approach is to use a second model - prompted as a political scientist or regional expert - to evaluate the persona's outputs for historical and ideological accuracy. The advantage is that you can scale evaluation without requiring human expert time on every response. The limitation is that the judge model has the same training data biases as the persona model. If both models have absorbed the same Western analytical framing of, say, Iranian strategic culture, the judge will reward outputs that match that framing even if it diverges from how Iranian officials actually think.
So you can end up with a system that is consistently wrong in the same direction and evaluates itself as correct.
Which is why human-in-the-loop validation with genuine regional expertise is still irreplaceable for high-stakes applications. The LLM judge is useful for catching obvious out-of-character responses - the persona saying something that directly contradicts a documented position - but it's not reliable for catching the subtler form of failure where the persona is coherent but strategically shallow.
Let's talk about the cardboard cutout problem more specifically, because I think there's an interesting technical fix here that's underappreciated.
The internal monologue approach. The core insight is that the failure mode - mode collapse into stereotypical responses - happens because the model is generating the public-facing output directly from the prompt. The fix is to insert an intermediate step where the model first retrieves the relevant doctrine, then reasons through the persona's likely internal state given the situation, and only then generates the public response.
Chain-of-thought for persona simulation.
And it's surprisingly effective. When you make the model articulate the reasoning chain - the ideological constraint that applies here, the historical precedent that's relevant, the domestic audience consideration that shapes the framing - the output is substantially more strategically nuanced than when you go directly to the response. The persona is reasoning from doctrine rather than pattern-matching to expected outputs.
The MoralSim research from twenty twenty-five is relevant here too, right? Because that looked at how ethical constraints affect persona behavior under pressure.
MoralSim is interesting because it specifically tests what happens when you put persona agents in scenarios that create tension between their stated values and their strategic interests. And the finding is that without explicit ethical constraint architecture, even well-specified personas tend to resolve these tensions in ways that reflect the base model's values rather than the persona's documented values. Which is a significant validity problem for adversarial persona simulation - if your Khamenei persona is secretly resolving moral dilemmas like a safety-trained American AI assistant, your simulation is not modeling Iranian strategic culture.
That's a subtle but serious failure. The persona sounds right but reasons wrong.
And detecting it requires exactly the kind of domain expertise that's hard to scale. A political scientist who specializes in Iranian revolutionary ideology will immediately notice when the persona's reasoning about, say, the relationship between religious legitimacy and military action doesn't match how that tension is actually navigated within the Islamic Republic's institutional framework. A generic LLM judge won't catch that.
Let's bring this toward practical takeaways, because I think there's actually useful guidance here for people who are building these systems.
For anyone working on high-stakes persona simulation, the starting point should be a hybrid architecture rather than committing to any single method. The reason is that each method's weaknesses are different. System prompts fail at strategic depth. Few-shot fails at novel scenarios. RAG fails at corpus gaps. Fine-tuning fails at adaptability. A hybrid where RAG provides dynamic context and a thin system prompt provides stylistic and doctrinal guardrails covers more of the failure space than any single approach.
And the sequencing matters. You don't start with fine-tuning.
Fine-tuning should be the last step, and only if you have high-quality, diverse corpus material. The worst outcome is fine-tuning on a corpus that's narrow or biased and baking those biases into the weights, because then they're much harder to correct than biases that live in a retrieval database or a prompt you can edit.
What about evaluation? Because you can build the most sophisticated architecture in the world and still not know if it's working.
The multi-axis evaluation framework is the practical standard. You want to be measuring at least three things separately. Doctrinal consistency: does the persona cite and reason from its documented ideological positions accurately? Rhetorical alignment: does the speech pattern, vocabulary, and framing match the persona's documented style? And strategic plausibility: when you put the persona in a decision scenario, does the decision-making logic cohere with how the real actor has historically made similar decisions?
And for each of those axes, you need different evaluation tools. The LLM judge is probably most useful for rhetorical alignment. Human experts are most critical for strategic plausibility. And doctrinal consistency you can partially automate with citation checking against a known corpus.
That's a clean decomposition. The other practical point is the tooling. For teams who want to experiment with these architectures, LangChain and LlamaIndex both have solid RAG pipeline implementations that you can adapt for persona retrieval. For fine-tuning, Hugging Face's PEFT library makes LoRA accessible without needing to stand up a full training infrastructure. But - and this is important - the tooling is the easy part. The hard part is corpus curation and expert validation, and no amount of engineering sophistication compensates for poor corpus quality or the absence of genuine domain expertise in the evaluation loop.
There's also an interesting open question about what happens as the base models get more capable. Because some of what we've been describing - the need for elaborate retrieval and fine-tuning to get strategic depth - is partly a compensation for limitations in the base model's world knowledge and reasoning.
The capability curve creates a moving target problem. As base models become better at reasoning from context and maintaining consistent personas over long conversations, some of the engineering overhead decreases. But the evaluation problem doesn't get easier - if anything, it gets harder, because a more capable model produces more plausible-sounding outputs that are also harder to distinguish from genuine strategic reasoning.
Which brings us back to the dual-use question, because the same capability improvement that makes the simulation more useful for legitimate policy stress-testing also makes it more dangerous as a disinformation vector.
The frontier here is multi-agent simulation - scenarios where multiple persona agents interact with each other in real time. Think a virtual UN Security Council where the AI versions of each permanent member's representative are negotiating a resolution. The interaction dynamics between personas creates emergent behavior that's genuinely difficult to evaluate, because you're no longer just assessing whether individual personas are in character, you're assessing whether the strategic interactions between them reflect real geopolitical dynamics.
And that's where the world-state layer that we've discussed before becomes critical - because the personas need to be operating from a shared, consistent model of the world in order for their interactions to be meaningful rather than just theater.
The next layer of problems in this space is exactly that: how do you maintain coherent world-state across a multi-agent simulation where each persona's actions update the state that all the other personas are reasoning from? But that's probably a whole episode on its own.
I think we've covered a lot of ground here. What's your single most important takeaway from this?
The phrase I keep coming back to is "reasoning from doctrine, not pattern-matching to expectations." The difference between a useful geopolitical persona simulation and a cardboard cutout is whether the model is actually applying the actor's documented decision-making logic to novel situations, or whether it's retrieving the most statistically likely output given the persona description. Every technical choice in the architecture - the doctrine injection, the RAG pipeline, the internal monologue step, the atomic trait evaluation - is in service of that one distinction.
Mine is the evaluation problem. Because I think people build these systems, they sound impressive, and nobody has a rigorous framework for knowing whether they're actually working. The PersonaGym approach and the atomic trait decomposition from the twenty twenty-five papers give you at least a starting point for systematic evaluation rather than just vibes-based assessment of whether it sounds right.
Which is where we started. Moving beyond vibes requires not just better architectures but better ways of knowing whether the architecture is doing what you think it's doing.
Alright. The open question I'll leave people with: as these systems get more capable and more widely deployed for policy simulation, where does the line between a useful analytical tool and a sophisticated propaganda machine actually sit? And who gets to draw it? I don't think anyone has a satisfying answer to that yet.
Thanks to our producer Hilbert Flumingtop for keeping this show running. Big thanks to Modal for providing the GPU credits that make this whole operation possible. This has been My Weird Prompts. If you haven't found us on Spotify yet, search My Weird Prompts and give us a follow. Until next time.
See you then.