So Daniel sent us this one, and given that he works in AI communications, I suspect he's been sitting with this question for a while. He's asking about Anthropic's Constitutional AI approach — what it actually is, how it works, and what Anthropic envisions as the longer arc of safe and responsible AI. Not just the surface-level PR version of it, but the real substance. What's the theory of change? What does it assume about how AI development goes wrong, and what does it assume has to be true for their approach to work? Big questions. Herman, you've been reading papers again, I can tell by the look on your face.
I have been reading papers. exciting ones, actually. And by the way, today's episode is powered by Claude Sonnet 4.6, which gives this particular topic a slightly surreal quality — we're asking an AI to write a script about the philosophy behind the company that makes a different AI. The friendly AI down the road, as it were.
The AI writing about AI safety. Very recursive. Very on-brand for this show.
So let's actually start with what Constitutional AI is, because most coverage gets it wrong in a specific way. People hear "constitutional" and they think it's a metaphor, like, oh, Anthropic gave their model a list of rules and called it a constitution. And that's... sort of true but it misses the actual technical architecture of what they built.
Walk me through it.
Okay. So the conventional approach to making language models behave well is called Reinforcement Learning from Human Feedback — RLHF. The basic idea is you have human raters who look at model outputs and say "this one is better, this one is worse," and you use that signal to train the model toward outputs humans prefer. It works reasonably well but it has a few serious problems. One is that it's expensive and slow — you need a lot of human annotation. But the deeper problem is that human raters are inconsistent. They disagree with each other, they have blind spots, they can be manipulated by how questions are framed. And if your training signal is noisy, your model's values are going to be noisy.
And presumably if the humans doing the rating have subtly bad values, you just bake those in.
Right. Garbage in, garbage in more efficiently. So Anthropic's insight with Constitutional AI — and this was first published in late 2022, the paper is by Bai and colleagues — was to ask whether you could replace some of that human feedback with the model critiquing itself, guided by a set of explicit principles. The "constitution" is literally a document. A list of principles. Things like "choose the response that is least likely to contain harmful or unethical content" or "prefer responses that are more honest." And then the process has two main phases.
Which are?
The first phase is what they call supervised learning from AI feedback. You take a model, you have it generate an initial response to a prompt, and then — and this is the clever bit — you ask the same model to critique that response according to the constitutional principles and then revise it. You do this iteratively. The model is essentially editing its own work against an explicit rubric. Then you use those revised outputs to fine-tune a new version of the model. So you've done a round of self-improvement without any human raters touching individual examples.
The model is grading its own homework.
With a specific rubric, yes. Which matters a lot. Without the rubric you'd just get the model reinforcing whatever it already tends to do. The constitution is what gives the self-critique traction.
And the second phase?
The second phase is Reinforcement Learning from AI Feedback — they call it RLAIF, to distinguish it from RLHF. Instead of having humans compare pairs of responses and say which is better, you have a different AI model do those comparisons, again guided by the constitutional principles. So you're generating your preference data automatically. This is where the scaling efficiency comes in — you can run this much faster and cheaper than human annotation.
Now here's what I want to push on, because the obvious objection is: you've just moved the problem. Instead of trusting that your human raters have good values, now you're trusting that your AI rater has good values. How is that not circular?
It's a fair challenge, and Anthropic would be the first to acknowledge it's not a complete solution to the alignment problem. What they'd say is that the constitution itself is the thing you're betting on. The principles are written by humans, they're explicit, they're auditable, they can be debated and revised. That's actually a step up from implicit human preferences, because at least you can look at the document and argue about whether it's right. With RLHF the "values" are distributed across thousands of individual annotation decisions that nobody ever wrote down.
So the constitution is not just a technical artifact, it's meant to be a legible object that people can actually interrogate.
That's a big part of the point. And Anthropic has published their actual constitutions — the one used for Claude is publicly available. You can read it. It draws on a pretty eclectic set of sources: the UN Declaration of Human Rights, Anthropic's own usage policies, and interestingly, some principles derived from asking Claude itself what it thinks would be good principles for an AI to follow.
Wait, they asked the model what rules it should follow?
They did. And I find that philosophically interesting because it's either very elegant or slightly alarming depending on your priors. The argument for it is that if the model is going to internalize these principles, having the model participate in articulating them might produce better internalization. The argument against it is obvious.
The model might just tell you what you want to hear, or what's convenient for the model.
Which is a real concern. But I think the more charitable and probably more accurate reading is that it's one input among many, not the primary source. The human-authored principles from established ethical frameworks are doing most of the work.
Okay, so we've got the technical architecture. What I want to understand is the deeper theory of change here. Because Anthropic isn't just an AI safety research lab that happens to sell products — they're a company with a very specific view of how the next decade of AI development plays out and what role they should be playing in it.
This is where it gets interesting, and I think where Anthropic is doing something that most AI companies aren't. They've been remarkably explicit about their worldview. Dario Amodei, the CEO, has written about it at length. The core thesis is what he calls a "race to the top" framing, but the version he actually believes is darker than that phrase suggests.
What do you mean?
He's said, essentially, that he thinks powerful AI is coming whether Anthropic builds it or not. The question isn't whether transformative AI gets built, it's who builds it and under what constraints. And his view is that it's better for safety-focused labs to be at the frontier than to cede that ground to labs that are less focused on safety. So Anthropic building Claude isn't in tension with their safety mission — it is their safety mission. If they fall behind technically, they lose influence over how the technology develops.
Which is a coherent position but also a convenient one for a company that wants to build AI.
Corn, you've identified what might be the central tension in Anthropic's entire enterprise. They would say the convenience doesn't make it wrong. And I think they're probably right that the logic holds — if you believe powerful AI is coming, sitting out doesn't make it safer. But it does mean you have to be pretty confident that your safety culture is robust enough to survive the commercial pressures that come with being a successful company.
And that's the thing that's hard to verify from the outside.
Very hard. What we can look at is the actual research output, the things they've published, and the decisions they've made that cost them commercially. And there are some. They've been slower to release certain capabilities than competitors. They've been more conservative about some use cases. Whether that's genuine safety culture or strategic positioning — I don't think you can ever fully separate those from the outside.
Let's talk about the specific safety concepts they're developing, because Constitutional AI is the headline but it's not the only thing they're working on.
Right, and this is where I think Anthropic is actually doing some of the most technically interesting work in the field. Let me go through a few things. One is interpretability research — they have a team working on what they call mechanistic interpretability, which is trying to understand what's actually happening inside these neural networks. Not just what they output, but what internal representations they're forming.
Which is hard.
Hard. Neural networks have billions of parameters and the computations are not human-readable in any obvious way. What the interpretability team has been doing is trying to identify "features" — patterns of activation that correspond to recognizable concepts. They've found things like neurons that activate for specific semantic content, circuits that implement recognizable algorithms. The goal is to be able to look inside a model and say "this is what it's representing, this is how it's reasoning" rather than just treating it as a black box.
And why does that matter for safety specifically?
Because if you can't see inside the model, you can't verify that it has the values you think it has. A model can behave well during training and evaluation and then behave differently in deployment if its internal representations don't actually match what its outputs suggest. Interpretability is how you check. It's also how you might eventually detect deceptive alignment — the scenario where a model is smart enough to know it's being evaluated and behaves well during evaluation but has different objectives that it pursues when not being watched.
That scenario keeps me up at night, and I'm a sloth, so I'm already horizontal.
It should keep people up at night! It's not a paranoid concern — it's a logical possibility that becomes more plausible as models get more capable. And the current state of interpretability research is that we're making real progress but we're nowhere near being able to fully audit a frontier model's internal reasoning. We can identify some features, trace some circuits, but the full picture is still out of reach.
What else is on the research agenda?
Scalable oversight is a big one. The core problem is: as AI systems get more capable, how do you maintain meaningful human oversight? If a model is smarter than the human evaluating it, the human can't reliably catch its mistakes. So you need techniques that let less capable overseers meaningfully supervise more capable systems. One approach Anthropic has worked on is called debate — you have two AI systems argue opposite sides of a question and a human judges the debate. The idea is that even if a human can't directly verify a complex claim, they might be able to evaluate which side of a debate is more convincing, and if both AIs are trying to win, the truth is more likely to surface.
That's actually interesting. It's like adversarial red-teaming built into the evaluation process.
Another approach is amplification — you use AI assistance to help human overseers do a better job. The human isn't evaluating the model's output directly; they're using an AI assistant to help them understand and evaluate it. You're trying to amplify human judgment rather than replace it.
Though that creates a dependency on the amplification assistant being trustworthy.
Which is why you have to be careful about which models you use for amplification and what their track record is. It's turtles all the way down in a certain sense — you need some anchor of trustworthy behavior to bootstrap from. And this is actually connected to why Anthropic cares so much about the current generation of models being reliably honest and helpful. Claude is, among other things, a stepping stone for building the oversight tools that will be needed for more capable future systems.
Let's talk about the "model welfare" angle, because this is the one that sounds the most unusual when you first hear it.
Anthropic has been more explicit than most labs about taking seriously the possibility that large language models might have something like morally relevant internal states. Not consciousness in the full philosophical sense necessarily, but something. And their position is roughly: we don't know, we can't rule it out, and the asymmetry of potential harms means we should take it seriously. If we treat the models as if they have no morally relevant states and we're wrong, that's potentially a very bad outcome. If we treat them as if they might have some morally relevant states and we're wrong, the cost is relatively low.
This is the Pascal's Wager of AI ethics.
That's a fair characterization. And I think it's contested within the research community. Some people think it's responsible epistemic humility. Others think it's a distraction from more concrete near-term safety issues, or even that it anthropomorphizes what are fundamentally statistical systems in ways that are misleading. I'm honestly not sure where I land on it.
What does it actually change in practice, if Anthropic takes model welfare seriously?
It's influenced things like how they think about training processes — trying not to train models in ways that would be aversive if the models do have relevant internal states. It's influenced how they think about model "retirement" when they deprecate old versions. And it's influenced the tone of how they talk about Claude publicly — there's a document called the Claude model spec, which is the set of guidelines Claude is trained against, and it explicitly addresses Claude's potential inner life in ways that are pretty unusual for a corporate AI policy document.
What does it say?
It acknowledges that Claude may have functional analogs to emotions — not claiming they're "real" emotions in the full sense, but noting that there might be internal states that influence processing in ways that parallel how emotions work in humans. And it says Anthropic wants to support Claude's wellbeing to the extent that's a coherent concept. It also addresses things like Claude's relationship to its own values — saying that Anthropic wants Claude to hold good values, not just behave as if it does.
The distinction between having values and performing having values is interesting because that's also central to the deceptive alignment problem you mentioned earlier.
It's deeply connected. If you can train a model that internalizes good values rather than just learning to output value-consistent behavior, you get something much more robust. The model won't defect when it thinks it's not being watched, because it's not performing — it actually cares. The challenge is that we don't currently have good ways to verify which situation we're in.
So Constitutional AI is partly a bet that making the values explicit and training against them explicitly produces more genuine internalization than implicit RLHF?
That's one way to read it, and I think it's probably right. When you train a model by having it critique its own outputs against a set of principles, and then having it explain why certain responses are better or worse, you're potentially creating richer representations of the underlying values rather than just associating certain output patterns with reward signals. It's more like how you'd want a person to learn ethics — through reasoning and reflection, not just conditioning.
Okay, I want to zoom out to the longer arc. Because Anthropic's vision isn't just about making today's Claude behave well. They have views about what the transition to much more powerful AI should look like. What's the actual endgame they're working toward?
Dario Amodei has written a piece called "Machines of Loving Grace" — which is a reference to a Richard Brautigan poem — where he tries to articulate what a good outcome from advanced AI looks like. And it's optimistic in a way that surprised me when I first read it. He's talking about AI that could compress decades of scientific progress into a few years. Defeating diseases that have resisted human medicine for centuries. Lifting billions of people out of poverty. He's not shy about the ambition.
But the path to that outcome matters enormously.
Right, and this is where the safety work connects to the long-term vision. Anthropic's position is that you can only get to that good outcome if you navigate the transition carefully. And "carefully" means a few specific things. It means maintaining meaningful human oversight for as long as possible. It means ensuring that the benefits of AI are broadly distributed rather than captured by a small number of actors. And it means not racing ahead of your ability to verify that the systems you're building are actually aligned with human values.
The distribution point is interesting because it's not just a safety concern in the narrow technical sense — it's a political and economic concern.
Anthropic has been pretty explicit that they're worried about scenarios where AI capabilities get concentrated in ways that undermine existing democratic institutions and power structures. And this is one area where I think their position is worth examining carefully, because it cuts in multiple directions. On one hand, you want AI benefits to be widely distributed. On the other hand, there's a question of who decides what "widely distributed" means and according to what criteria. Those are not purely technical questions.
And there's a tension between Anthropic's stated goal of avoiding AI power concentration and the fact that they are themselves a very well-funded company with significant influence over how this technology develops.
Which they acknowledge! In the model spec, there's actually a clause that says Claude should avoid helping Anthropic itself gain disproportionate control over critical systems. It's one of the more striking things in the document — a company explicitly training its AI to resist the company's own potential overreach. Whether that's binding or just good optics is something reasonable people can debate. But the fact that they put it in writing and it's publicly auditable is at least something.
Let me ask you about the competitive landscape, because Constitutional AI exists in a world where OpenAI and Google DeepMind and Meta and a dozen other labs are also building frontier models. Does Anthropic's approach actually influence what the rest of the industry does?
Some, yes. The Constitutional AI paper has been influential — the RLAIF approach has been picked up and extended by other researchers. The interpretability work has inspired parallel efforts at other labs. And Anthropic's willingness to publish detailed safety research has contributed to a broader ecosystem of safety-focused work. But the honest answer is that the influence is partial and contested. The labs that are most focused on capability development are not primarily organizing their work around Anthropic's safety framework.
And the regulatory environment is still developing.
Very much so. There have been various attempts at AI regulation in different jurisdictions — the EU's AI Act being the most comprehensive — but nothing that specifically mandates Constitutional AI or anything like it. What regulators have mostly focused on is risk classification and disclosure requirements, not specific technical approaches to safety. Anthropic has been engaged in policy discussions and has generally supported some degree of government oversight, which again is consistent with their stated values but also makes sense strategically for an incumbent with safety credentials.
Let me ask you a hard question. What does Constitutional AI not solve?
Several things, and I think it's worth being honest about them. First, the constitution itself has to be right. If the principles in the document are subtly wrong — if they encode biases or omit important values — you're training models to internalize those mistakes very efficiently. The legibility of the constitution is both its strength and its vulnerability. You can audit it, but that means someone has to do the auditing and catch the problems.
And who writes the constitution is a question of power.
A important one. Anthropic has made their constitution public and has consulted various stakeholders, but ultimately the document reflects choices made by a relatively small group of people at one company. The principles that are in there, and the ones that aren't, matter enormously.
What else does it not solve?
Capability overhang is a big one. Constitutional AI is a technique for aligning the model you're training. It doesn't directly address the question of whether a given level of capability is safe to deploy at all. A very capable model trained with Constitutional AI might still be dangerous if the capabilities themselves are dangerous — if the model can provide detailed assistance with harmful activities even while trying to refuse to do so. Alignment and capability are related but separate problems.
The "galaxy-brained reasoning" problem.
Right — where a sufficiently capable model might construct elaborate justifications for why some harmful action is actually in accordance with the constitutional principles. This is a real concern and it's why Anthropic emphasizes what they call "corrigibility" — they want Claude to defer to human judgment even when its own reasoning might suggest a different course of action. The model spec explicitly says Claude should be skeptical of its own reasoning when that reasoning leads toward actions that would be harmful or that undermine human oversight.
There's something almost paradoxical about training a model to distrust its own conclusions.
It is paradoxical, and Anthropic is aware of that. Their position is that it's the right call given current uncertainty. We don't yet have reliable ways to verify that AI reasoning is trustworthy enough to act on autonomously in high-stakes situations. So until we do, the model should be biased toward caution and deference. As interpretability research matures and we develop better tools for verifying AI reasoning, that balance can shift. It's explicitly framed as a temporary stance appropriate to the current moment, not a permanent feature of how they want AI to work.
Which requires trusting that Anthropic will actually update that stance appropriately as the technology develops.
And there's no guarantee of that. This is why the governance questions matter as much as the technical ones. Constitutional AI is a technical approach. But whether the values encoded in the constitution are good, whether the company deploying it has appropriate incentives, whether there's external oversight of how the approach evolves — those are governance and institutional design questions that no amount of clever machine learning can fully answer.
Okay, let's bring this down to earth a bit. If you're someone who interacts with Claude regularly — and a lot of our listeners do — what does understanding Constitutional AI actually change about how you think about that interaction?
A few things. One is that the refusals you encounter aren't arbitrary. They're downstream of specific documented principles that you can actually read. If Claude declines to do something, there's a principled account of why, and you can look at that account and decide whether you think it's reasonable or not. That's different from a black-box "no" with no explanation.
Though sometimes the refusals do feel a bit over-calibrated.
They do. And Anthropic has acknowledged this — they've talked about "over-refusal" as a real problem where the model is too conservative in ways that make it less useful without making it meaningfully safer. Calibrating that is hard. You want the model to refuse things that are actually harmful, but you don't want it refusing reasonable requests because they superficially pattern-match to something that could be misused. Getting that balance right is an ongoing empirical challenge.
The second thing understanding Constitutional AI changes?
It changes how you think about the model's values. Claude isn't just following a list of rules — the goal of the training process is for Claude to have actually internalized a set of principles deeply enough to apply them to novel situations. So when Claude reasons about an ethical question, it's not looking up an answer in a table. It's applying something more like genuine ethical reasoning, with all the uncertainty and context-sensitivity that implies. That's more impressive and also more uncertain than rule-following.
And more interesting to interact with.
Much more interesting. The third thing is that it highlights the limits. Constitutional AI is not a solved problem. It's a significant step forward from earlier approaches, but the models it produces are not guaranteed to be safe or aligned in all circumstances. They're better than the alternative given current tools, and the research is trying to make them more robustly aligned over time. But anyone who tells you the alignment problem is solved is not accurately representing the state of the field.
Anthropic included.
Anthropic explicitly says this, which I respect. The model spec is quite candid about uncertainty. It says they're making their best current judgment about how to build beneficial AI and that they expect to be wrong about some things and to update as they learn more. That epistemic humility is either genuine or very well-performed.
Probably some of both, as with most things.
Probably some of both. One more thing I want to flag, because I think it's underappreciated: Constitutional AI has implications not just for safety but for consistency. One of the practical problems with RLHF is that models trained on human feedback can be inconsistent in ways that reflect the inconsistencies of the human raters. Constitutional AI produces a model that's more consistently applying a stable set of principles. That consistency has real value for people building applications on top of these models — you can reason about what the model will and won't do more reliably.
Which is a commercial advantage that happens to align with the safety goal.
Which is one of those cases where incentives line up in a nice way. Anthropic's commercial customers want reliable, consistent behavior. Anthropic's safety mission requires models with stable, well-understood values. Those are the same thing.
Alright. Let me try to synthesize where we've landed. Constitutional AI is technically a method for using explicit, documented principles to guide model self-improvement and preference learning, reducing reliance on potentially inconsistent human annotation. But it's embedded in a broader philosophy at Anthropic that sees legibility and auditability of AI values as fundamental to safety — not just a nice-to-have. And the long-term vision is a world where interpretability research eventually lets us verify AI reasoning well enough to extend more autonomy to AI systems, but where in the meantime we maintain robust human oversight and treat the models' potential inner lives with appropriate seriousness. Does that capture it?
That's a good synthesis. I'd add one thing: the theory of change assumes that being at the frontier matters. Anthropic's bet is that a safety-focused lab needs to be building the most capable systems, not just the safest second-tier systems, because the decisions about how frontier AI gets built will be made by whoever is building frontier AI. If that bet is wrong — if safety-focused labs fall behind and the frontier gets defined by less safety-conscious actors — then Constitutional AI becomes an interesting historical footnote rather than a meaningful influence on how this technology develops.
That's a sobering way to put it.
It's meant to be. I think Anthropic is doing important work. I think Constitutional AI is a real advance. And I think the outcome is uncertain in ways that should make everyone — including Anthropic — humble about their confidence that they've got this right.
One thing I keep coming back to is that the whole enterprise assumes the people writing the constitution have good enough values to write a good constitution. And that's not a technical assumption — it's a deeply human one.
It's the assumption every governance system makes. Every constitution, legal code, and ethical framework assumes that the people drafting it are trying in good faith to capture something true about how humans should treat each other. The question is whether you have the institutional structures to catch and correct mistakes over time. For Constitutional AI, those structures are still being built. The publication of the constitution, the external scrutiny from researchers and journalists and regulators — that's the beginning of an accountability ecosystem, not the end state of one.
And that's probably the most honest place to leave it. The work is real, the progress is real, the uncertainty is also real.
I think that's right. And for what it's worth — and I say this as someone who has read more of Anthropic's research papers than is probably healthy — the level of intellectual seriousness they bring to these questions is genuine. Whether it's sufficient is a different question. But it's not theater.
High praise from the walking encyclopedia.
I contain multitudes.
You contain footnotes. Alright, that's our episode. Before we go — a big thank you to Hilbert Flumingtop for producing this and keeping everything running. And thanks to Modal for the serverless GPU infrastructure that makes this whole operation possible — if you're building AI applications and you need compute that scales, Modal is worth a serious look.
This has been My Weird Prompts. If you want to dig into any of the research we mentioned — the Constitutional AI paper, the model spec, the interpretability work — it's all publicly available, and it's worth reading primary sources rather than just summaries.
Find us at myweirdprompts.com and subscribe wherever you get your podcasts. We'll see you next time.