#2172: Council of Models: How Karpathy Built AI Peer Review

Andrej Karpathy's llm-council uses anonymized peer review to make language models evaluate each other fairly—but can it really suppress model bias?

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2330
Published: Apr 12
Duration: 27:15
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: large-language-models ai-reasoning ai-alignment

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Architecture of AI Peer Review: Inside Karpathy's llm-council

In November, Andrej Karpathy released a GitHub repository called llm-council—described casually as a "vibe coded Saturday hack"—that has since accumulated nearly 17,000 stars. Despite its humble origin story, the codebase reveals a sophisticated mental model about how multiple language models should collaborate.

What the Council Does

The system is elegantly simple to describe: you ask a question, four frontier models (GPT-5.1, Gemini-3 Pro Preview, Claude Sonnet 4.5, and Grok-4) answer it independently, each model then anonymously peer-reviews the other three answers and ranks them, and finally a designated Chairman model reads all responses and peer reviews and synthesizes a final answer.

It's a blind academic peer review process—but for language models.

The Three-Stage Protocol

Stage One is straightforward: all four models receive the same question simultaneously and answer in complete isolation. Four responses, no communication.

Stage Two is where the architecture gets interesting. The responses are anonymized using a single elegant line of Python: chr(65 + i), which generates "A", "B", "C", "D" labels. The model-to-label mapping is stored separately. These anonymized responses are then sent to all four models in parallel for peer review. Each model evaluates each response, identifies strengths and weaknesses, and produces a FINAL RANKING in a strict format. Notably, the ranking criteria—"accuracy" and "insight"—are deliberately left undefined, allowing models to interpret quality on their own terms. This mirrors how human expert panels work: you trust their judgment rather than constraining it with detailed rubrics.

Stage Three is the synthesis phase. The Chairman model (Gemini-3 Pro Preview by default) now sees full model names and reads all responses, rankings, and peer reviews. It synthesizes a single final answer. The Chairman wears two hats: in Stage Two it was just another anonymous reviewer, but in Stage Three it becomes the meta-reasoner with full context.

The Anonymization Problem

The anonymization is designed to suppress self-preference bias. Research shows that language models assign higher ratings to texts with lower perplexity relative to their own output—they prefer texts that sound like what they themselves would generate. Stripping model names breaks this link.

But there's a catch: anonymization strips the name, not the fingerprint. Claude tends toward structured lists. GPT tends toward flowing prose. Grok has a distinctive voice. A model might be identifiable even when anonymized. Karpathy acknowledges this in the documentation, noting that the implementation "prioritizes simplicity over cryptographic guarantees."

This probabilistic anonymization appears to be doing real work, though. When Vasuman M from Varick AI Agents replicated the system and tested what happens if you un-anonymize answers—telling models "this came from GPT"—they immediately defer and start correcting themselves based on the named model's output. Without anonymization, you get model tribalism rather than genuine evaluation.

The Consensus Paradox

Here's a critical limitation: the council doesn't actually deliberate. Models never revise their Stage One responses. There's no back-and-forth, no position revision, no multi-round debate. It's one round of parallel opinions, one round of parallel rankings, one synthesis. This differs from the multi-agent debate (MAD) paradigm in academic literature, where models iteratively refine responses over multiple rounds after seeing disagreement.

The Chairman's synthesis is also its own new response—not a vote, not a weighted average, but a fifth answer that happens to have read the other four. The quality of the final output depends heavily on the Chairman model's ability to integrate conflicting information.

Empirical Findings and Human Disagreement

When Karpathy used the council to evaluate book chapters, GPT-5.1 consistently won the peer rankings. But he added a crucial caveat: "I'm not one hundred percent convinced this aligns with my own qualitative assessment. I find GPT five point one a little too wordy and Gemini a bit more condensed and processed."

The council's consensus diverged from the human's preference. The models agreed on what was best by their criteria, but the human evaluator preferred something different.

This raises a fundamental question: what are we optimizing for when we build consensus systems? The council measures what models think is good, not what humans think is good.

Design Philosophy

Throughout the codebase, a pattern emerges: "be strict in what you ask for, be lenient in what you accept." The ranking prompt demands a rigid format, but the parser has three levels of fallback in case the model fails to comply. The prompt engineering and parsing are co-designed. This pragmatism is documented explicitly in the CLAUDE.md file, which notes common gotchas and design tradeoffs.

The error handling is similarly elegant. If a single model fails to respond, it's silently dropped from the council—graceful degradation by design. The documentation notes: "Never fail the entire request due to single model failure."

The Verdict

llm-council is a working prototype for a genuine question: can you build a consensus system that suppresses model bias and self-preference? The anonymization appears to work, at least probabilistically. But the system doesn't achieve true deliberation, and the final synthesis is only as good as the Chairman model. Most importantly, the council's consensus may not align with human judgment about what's actually good—which is perhaps the most honest finding of all.

BLOG_POST

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2172: Council of Models: How Karpathy Built AI Peer Review

Alright, let's dig into something that's been sitting in my browser tabs for a while. Karpathy dropped a repo called llm-council back in November — described it as a "vibe coded Saturday hack" — and it has since accumulated nearly seventeen thousand GitHub stars. So, you know, as far as Saturday hacks go.

Herman Poppleberry here, and I've been going through this codebase pretty carefully. The thing that strikes me immediately is how much architectural opinion is packed into roughly eight hundred lines. This isn't a toy. The design decisions are deliberate, and they reveal a specific mental model of how multi-model AI systems should work.

Before we get into the weeds — and I know you are absolutely itching to get into the weeds — give me the one-sentence pitch for what this thing actually does.

You send a question, four frontier models answer it independently, then each model anonymously peer-reviews the other three answers and ranks them, and then a designated Chairman model reads all the answers plus all the peer reviews and synthesizes a final response.

So it's a blind academic peer review process, but for language models.

The anonymization is the key word there. And that's actually where the most interesting engineering decision lives. By the way, today's script is being generated by Claude Sonnet four point six, which adds a certain meta-quality to an episode about AI models evaluating each other.

Noted. Okay, so let's start with the configuration layer because I looked at config dot py and it is twenty-six lines. The entire model selection is a flat Python list. GPT five point one, Gemini three Pro Preview, Claude Sonnet four point five, Grok four. That's it. No roles, no specializations, no personas.

And that's a deliberate choice, not a limitation. Every council member gets the exact same prompt in Stage one. Models are treated as interchangeable commodity workers. The only asymmetry is the Chairman, who is Gemini three Pro Preview by default — and notably, the Chairman is also a council member. Same model doing two different jobs.

Which is a little strange when you think about it. You're asking Gemini to both participate in the debate and then later adjudicate it, having seen everyone's peer reviews including its own.

The asymmetry is intentional, and it's actually more nuanced than it looks. In Stage two, when models are doing peer review, the Chairman doesn't know it's the Chairman. It's just another reviewer. The Chairman role only activates in Stage three, where it gets full context — model names revealed, all responses, all rankings — and synthesizes the final answer. So the Chairman wears two hats at different moments with different information sets.

Okay, let's talk about the OpenRouter abstraction layer because this is where the whole multi-provider story lives. The entire API layer is two functions, seventy-nine lines. One function queries a single model, one function queries all models in parallel using asyncio dot gather. That's the entire multi-provider infrastructure.

And the architectural choice there is OpenRouter as the universal abstraction. One API key, one endpoint, access to GPT, Gemini, Claude, Grok, Llama, hundreds of others. Swapping models requires editing a config file, not rewriting API client code. What I find genuinely clever is the error handling philosophy. asyncio dot gather is called without return underscore exceptions equals True, but each individual query wraps everything in a broad try-except that returns None on failure. The caller then filters: if response is not None. A single model failure silently drops that model from the council. Graceful degradation by design.

The CLAUDE dot md file actually notes this explicitly — "Never fail the entire request due to single model failure." Which is interesting documentation to have in a repo that Karpathy says he's never going to maintain.

That file is fascinating in its own right, and we should come back to it. But let's walk through the three stages first because the protocol is where the real architecture lives.

Stage one is straightforward — parallel fan-out. All four models get the same question simultaneously, answer in complete isolation. No peeking. You get four responses back.

Stage two is where it gets architecturally interesting. The anonymization is done with a single elegant line of Python: chr of sixty-five plus i, which generates "A", "B", "C", "D" from the index. The responses get concatenated into a block labeled Response A, Response B, and so on. The label-to-model mapping is stored separately.

So model identities are stripped before the peer review prompt goes out.

And then — here's the part I want to dwell on — that anonymized block gets sent to all four models again in parallel. So Stage two fires four more API calls, each one receiving all four responses as context. The context window in Stage two is significantly larger than Stage one because you're now passing everyone's full answers.

Walk me through the actual ranking prompt because I want to understand what these models are being asked to evaluate.

The prompt asks each model to first evaluate each response individually — what it does well, what it does poorly — and then at the end produce a FINAL RANKING section in a very specific format. Numbered list, one response label per line, no trailing text. The criteria are accuracy and insight, but those terms are deliberately left undefined. Karpathy doesn't constrain what "accuracy" or "insight" mean. The models interpret those criteria themselves.

That's either a feature or a bug depending on your perspective. You're asking models to evaluate on criteria they get to define.

Which is actually closer to how human expert panels work. You don't give a literature professor a rubric with point values. You say "evaluate this for quality and insight" and trust their judgment. The tradeoff is you lose consistency across evaluators, but you gain flexibility.

The parser is interesting too. There are three levels of fallback. First it looks for a strict numbered list in the FINAL RANKING section. If that fails, it looks for any "Response X" pattern in that section. If that fails, it scans the entire text. Be strict in what you ask for, be lenient in what you accept.

That's a pattern I see throughout the codebase. The prompt engineering and the parsing are co-designed. The prompt demands a rigid format, and the parser has escape hatches for every way the model might fail to comply. The CLAUDE dot md even calls this out as a "common gotcha."

Let's get to Stage three and the Chairman pattern, because this is where the design philosophy gets clearest.

The Chairman prompt is structurally different from the Stage two prompt. In Stage two, models see anonymized labels. In Stage three, the Chairman sees full model names. GPT five point one, Gemini three Pro Preview, Claude Sonnet four point five, Grok four — all revealed. The Chairman is supposed to be a meta-reasoner who can weigh peer reviews knowing which model said what.

So the anonymization was only ever for the peer review phase, not for the synthesis phase.

Correct. And the Chairman prompt asks it to consider three things: the individual responses and their insights, the peer rankings and what they reveal about quality, and any patterns of agreement or disagreement across the council. The output is a single synthesized answer presented to the user with a green-tinted background in the UI.

I want to push on the "consensus" framing here because what you're describing isn't actually consensus. Models never revise their Stage one responses. There's no back-and-forth, no position revision, no multi-round deliberation. It's one round of parallel opinions, one round of parallel rankings, one synthesis.

You're identifying the most important limitation in the design. In the academic literature on multi-agent debate — the MAD paradigm — models iteratively refine their responses over multiple rounds after seeing disagreement. Karpathy's system doesn't do that. It's closer to a panel of experts each writing independent memos, then a rapporteur synthesizing them. Which is a real methodology, but it's not the same as genuine deliberation.

And the Chairman's synthesis is its own new response. It's not a vote, it's not a weighted average of the four answers. It's a fifth response that happens to have read the other four.

Which means the quality of the final answer is heavily dependent on the Chairman model's synthesis capabilities. If the Chairman is bad at integrating conflicting information, the council's collective intelligence doesn't matter much.

Let's talk about the scoring system because the label in the UI is "Street Cred," which tells you something about how seriously Karpathy is taking it.

The scoring is average rank position. If GPT five point one is ranked first by all four models in Stage two, its score is one point zero. If it's always last, four point zero. Lower is better. It's not Borda count, it's not pairwise win rates, it's just mean position. And there's a key detail: a model can rank its own response, because due to the anonymization it doesn't know which response is its own.

That's actually the whole point of the anonymization. The research on self-preference bias in language models — there's a paper on arXiv, the number is two four one zero dot two one eight one nine — shows that LLMs assign higher ratings to texts with lower perplexity relative to their own output. Meaning they prefer texts that look like what they themselves would generate.

The anonymization is designed to break that link. But here's where it gets interesting: the anonymization strips the model name, not the writing fingerprint. Claude tends toward structured numbered lists. GPT tends toward flowing prose. Grok has a distinctive voice. A model that writes in a particular style might be identifiable even anonymized. Karpathy acknowledges this — the CLAUDE dot md notes "the implementation prioritizes simplicity over cryptographic guarantees."

So it's probabilistic anonymization, not true blinding.

And Karpathy's own empirical finding bears this out in an interesting way. When he used the council to read book chapters, GPT five point one consistently won — the models consistently praised it as the best and most insightful. But he added a crucial caveat: "I'm not one hundred percent convinced this aligns with my own qualitative assessment. I find GPT five point one a little too wordy and sprawled and Gemini a bit more condensed and processed."

So the council's consensus diverged from the human's preference. The models agreed GPT five point one was best by their criteria, and Karpathy looked at the results and thought "I actually prefer Gemini."

Which raises a deep question about what we're optimizing for when we build consensus systems. The council measures what models think is good, not what humans think is good. And there's a separate observation from Vasuman M, who founded Varick AI Agents, who replicated the finding and added something striking: if you tell other models that the answer they're reading came from GPT — if you un-anonymize it — they immediately fold and start correcting themselves based on GPT's output. Which suggests the anonymization is doing real work. Without it, you'd get model deference rather than genuine evaluation.

The anonymization is suppressing a social dynamic that would otherwise corrupt the evaluation.

Model tribalism, essentially. The tendency to favor outputs from certain providers or to defer to perceived authority.

Alright, let's shift to the design patterns because I think this is where builders are going to get the most out of this episode. You've been going through the codebase — what are the patterns that emerge?

There are eight distinct patterns I can identify. The first is the anonymization-first peer review, which we've covered. The second is what I'd call strict format plus graceful fallback — the prompt demands rigid structure, the parser has three escape levels. These two things are co-designed; you can't understand one without the other.

The third is the Chairman pattern, which is asymmetric roles. Council members are workers, the Chairman is a synthesizer. Different information sets at different stages.

Fourth is ephemeral metadata. The label-to-model mapping and aggregate rankings are computed but deliberately not persisted to the JSON storage. From the CLAUDE dot md: "metadata is NOT persisted to storage, only returned via API." If you reload a conversation, you lose the ranking data. It lives only in the API response and in React state. This is a deliberate simplicity choice — the system doesn't try to build a longitudinal model performance database.

Which is a pretty significant limitation if you wanted to use this to actually track model performance over time. Every query starts fresh.

Fifth is parallel everything. asyncio dot gather is used in Stage one, Stage two, and title generation runs in parallel with Stage one via asyncio dot create underscore task. The streaming endpoint yields server-sent events as each stage completes — stage one underscore start, stage one underscore complete, stage two underscore start, and so on. Users see Stage one results before Stage two even begins.

That's a smart user experience pattern. Progressive disclosure. You're getting value while the system is still working.

Sixth is transparency-first UI. Every intermediate output is inspectable. Stage one shows each model's raw response in tabs. Stage two shows each model's raw evaluation text and the parsed ranking extracted from it, so users can verify the parser worked correctly. The UI even explains the anonymization — it notes that model names are shown in bold for readability but the original evaluation used anonymous labels.

De-anonymization happens client-side, which is a nice detail. The server sends the raw anonymized text and the mapping, the browser does the substitution. So the raw anonymized text is always available for inspection if you want it.

Seventh is cheap models for cheap tasks. Title generation uses Gemini two point five Flash with a thirty-second timeout, not one of the expensive frontier models. Cost optimization: use the cheapest capable model for non-critical work.

And eighth is the CLAUDE dot md meta-pattern, which I want to spend a minute on because it's philosophically interesting.

The repo includes a CLAUDE dot md file — one hundred sixty-six lines — which is a technical notes document written for AI coding assistants. Specifically, it documents the architecture, the common gotchas, the design decisions, and the non-obvious behaviors in a format optimized for an AI assistant to understand. It's a README for the AI, not the human.

Karpathy is eating his own cooking. He built this with AI assistance, and the CLAUDE dot md is the artifact that lets anyone — or any AI — pick up the project and understand it immediately. It's context engineering for the coding assistant.

And this connects to his broader framework for LLM applications, which he articulated in his YC AI Startup School talk. He describes LLM apps as doing four things: context engineering, orchestrating multiple LLM calls in directed acyclic graphs, providing a custom GUI, and offering an autonomy slider. LLM Council is a near-perfect instantiation of this framework. The ranking prompt is context engineering. The three-stage pipeline is a simple directed acyclic graph. The React tab UI is the custom GUI. And the human can inspect every intermediate output, which means maximum transparency, minimum autonomy.

The autonomy slider is at zero. You see everything.

Which is a design choice that reflects his "jagged intelligence" concept — the idea that LLMs are simultaneously genius polymaths and confused grade schoolers. When you don't know which mode a model is in, you want the human in the loop.

Let's talk about the orchestration layer itself because I want to understand the actual call structure. The run underscore full underscore council function is the entry point — it's sequential at the stage level but parallel within each stage.

The pipeline is stage one, then stage two, then stage three — strictly sequential between stages because each stage depends on the previous stage's output. But within each stage, all model calls are parallel. There's no retry logic, no caching, no circuit breakers. If a model returns None, it's dropped. If all models fail, you get an error response.

The storage layer is also worth noting. Pure JSON file storage in a data slash conversations directory. Each conversation is a UUID-named JSON file. The list conversations function reads every JSON file on every call — no indexing, no database.

The CLAUDE dot md explicitly acknowledges this is a "weekend hack" storage layer. The list conversations function has a comment that essentially says "this doesn't scale but it works for personal use." Which is honest.

The total API call count per query with a four-model council is ten. Four calls in Stage one, four calls in Stage two where each model receives all four responses as context, one Chairman call in Stage three, one title generation call running in parallel with Stage one. At frontier model pricing of around ten to fifteen dollars per million tokens, a complex question can cost somewhere between fifty cents and two dollars.

The Stage two calls are the expensive ones because the context window is much larger — you're passing all four Stage one responses to each of four models. That's where the cost compounds.

Now, the sixty-six open pull requests. This is a signal I find interesting. Karpathy says he won't maintain the repo. He's not going to merge anything. And yet sixty-six people have submitted pull requests. What does that tell you about the demand for this pattern?

The PRs are a community roadmap that Karpathy never wrote. People want streaming responses, configurable council via the UI rather than editing config dot py, model performance analytics across sessions, RAG integration, authentication, Docker deployment, custom ranking criteria. The gap between what Karpathy built — a personal tool for reading books with multiple LLMs — and what the community wants — a platform — is itself a data point.

VentureBeat called it "the missing layer of enterprise AI orchestration." Which is a big claim for eight hundred lines of Python with JSON file storage.

The claim isn't about the implementation, it's about the pattern. The claim is that structured multi-model deliberation with anonymized peer review is a pattern that sits between raw LLM APIs and end-user applications, and nobody had built a clean reference implementation of it. The whole orchestration layer — the part that does the multi-model coordination — is three hundred thirty-five lines in council dot py. That's the complete implementation.

No LangChain, no LangGraph, no AutoGen, no CrewAI. Just asyncio dot gather, a well-engineered prompt, and a regex parser.

Which is a direct challenge to the complexity of existing multi-agent frameworks. The argument implicit in the codebase is: you don't need a framework for this. You need Python's asyncio, a good abstraction for the API layer, and careful prompt engineering. The framework complexity is optional.

That's a strong claim and I want to push on it slightly. The reason frameworks like LangGraph exist is that they handle things this codebase doesn't — retry logic, state persistence, complex routing, tool calling, error recovery. For a personal tool, JSON file storage and no retries is fine. For production, you'd need those things.

The CLAUDE dot md lists future enhancements, and notably absent from the list is iterative deliberation — the thing that would make it genuinely different from a single-model query. Streaming responses are on the list. Configurable council via UI is on the list. But there's no mention of multi-round debate where models revise their positions after seeing disagreement.

That's the fundamental architectural gap. The system is one round of opinions, one round of rankings, one synthesis. It doesn't iterate.

And Karpathy's "jagged intelligence" concept applies here in a specific way. The council's aggregate rankings are query-dependent, not model-dependent. GPT five point one might consistently win on book chapter analysis but lose on code review or mathematical reasoning. The system doesn't learn or adapt — every query starts fresh with no prior context about which models perform well on which task types.

So you can't build a persistent model reputation system on top of this architecture as written. You'd need to extend the storage layer and add cross-session analytics.

Which is exactly what several of those sixty-six PRs are trying to do. The community sees the gap.

Let me ask the higher-level question. What does this reveal about Karpathy's mental model of where LLM applications are going?

A few things. First, he genuinely believes that running queries through multiple models in parallel is going to become a standard pattern, not an exotic one. The fact that he built this as a personal tool suggests he uses it regularly, not as a demo. Second, he thinks the anonymization problem — preventing model tribalism in peer review — is a real engineering challenge worth solving explicitly, not something you can hand-wave away.

The Vasuman M observation supports that. Un-anonymize the models and they fold immediately. The social dynamics are real.

Third, and I think this is the most important thing, he's treating the CLAUDE dot md as a first-class artifact. The context engineering document for the AI assistant is as important as the code itself. That's a new software development paradigm. You write the architecture notes for the AI, not just the human. Code is ephemeral — his words — and libraries are over. The persistent artifact is the context document.

Which is either profound or deeply unsettling depending on your perspective.

Probably both. The practical implication for builders is: if you're building a multi-agent system today, the most important thing you can do is write a good CLAUDE dot md equivalent. Not for Karpathy's reasons necessarily, but because that document forces you to articulate the non-obvious design decisions, the gotchas, the places where the system behaves unexpectedly.

Let's do practical takeaways for builders because that's who this episode is for.

First takeaway: the OpenRouter abstraction pattern is worth stealing directly. One API key, one endpoint, model swapping via config change. If you're building any multi-model system and you're writing separate API clients for each provider, you're doing it the hard way.

Second takeaway: the co-design of prompt and parser. Don't write a strict prompt and then assume the model will always comply. Design your parser to handle every failure mode you can anticipate. The three-level fallback in the ranking parser is a template for how to do this.

Third takeaway: the anonymization pattern for peer review. If you're building any system where models evaluate each other's outputs, you need to think seriously about self-preference bias. Stripping model identities before evaluation is a low-cost mitigation that demonstrably changes outcomes.

Fourth: the streaming architecture as a user experience pattern. Yield results progressively as each stage completes. Users see value immediately rather than waiting for the full pipeline to finish. The server-sent events protocol in main dot py is a clean template for this.

Fifth: the ephemeral metadata decision is instructive in what it tells you about scope. Karpathy explicitly chose not to persist rankings. That's a scoping decision that kept the project simple and shippable. For production systems, you'd make the opposite choice, but knowing why he made this choice helps you understand the tradeoff.

And the meta-takeaway: seventeen thousand stars for eight hundred lines of code built in one Saturday. The idea matters more than the implementation. The pattern of structured multi-model deliberation with peer review was the insight. The code is just the reference implementation.

The README says it clearly: "Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like." He's not precious about the code. He's precious about the pattern and the CLAUDE dot md that documents it.

The repo is at github dot com slash karpathy slash llm-council if you want to dig into it yourself. The CLAUDE dot md is genuinely worth reading as a document — it's a model for how to write context engineering documentation for an AI assistant, regardless of whether you use the rest of the codebase.

And if you want to run it, the setup is genuinely minimal. You need an OpenRouter API key, uv for Python package management, npm for the frontend. The start dot sh script handles everything. You're up in under ten minutes.

Alright, that's the deep dive. Thanks as always to our producer Hilbert Flumingtop. Big thanks to Modal for the GPU credits that keep this whole operation running. This has been My Weird Prompts.

If you've found this useful, a review on your podcast app genuinely helps other builders find the show.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2172: Council of Models: How Karpathy Built AI Peer Review

The Architecture of AI Peer Review: Inside Karpathy's llm-council

What the Council Does

The Three-Stage Protocol

The Anonymization Problem

The Consensus Paradox

Empirical Findings and Human Disagreement

Design Philosophy

The Verdict

Downloads

You Might Also Like

#2172: Council of Models: How Karpathy Built AI Peer Review