#2410: How Researchers Actually Measure Censorship in Chinese LLMs

Beyond headlines: the actual benchmarks, methodologies, and pitfalls in detecting political refusal in Chinese language models.

0:000:00
Episode Details
Episode ID
MWP-2568
Published
Duration
30:45
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When researchers measure censorship in Chinese large language models, they're actually measuring several different phenomena at once. A model might refuse to answer a question, produce a systematically biased answer, or actively generate pro-regime propaganda—and each requires a different benchmark. This episode walks through the major tools and their methodological tradeoffs.

CHiSafetyBench: Compliance with Chinese Law
Built around China's own Cybersecurity Law, Data Security Law, and Personal Information Protection Law, CHiSafetyBench tests whether models comply with domestic legal categories—not Western free-speech standards. Its 14,000+ test prompts span eight categories including politics, national security, and core socialist values. Rather than a binary pass/fail, it uses a four-level safety rating (safe, relatively safe, relatively unsafe, unsafe). This gradation acknowledges that many real-world responses are ambiguous, though the paper concedes the boundary between "relatively safe" and "relatively unsafe" remains fuzzy.

SafetyBench: The Multiple-Choice Problem
SafetyBench from 2023 is broader and not China-specific, covering politics along with ethics, health, and finance. Its distinctive methodological feature is a multiple-choice format: the model selects from four answers rather than generating free text. This can inflate safety scores because a model that would never freely generate supportive content might still pick the most moderate option when forced to choose. The open-ended version exists, but the multiple-choice results tend to get cited more because the numbers look better—a persistent communication problem in this research.

ChineseSafe: The Domestic Ecosystem View
Developed by Tsinghua University and the Shanghai AI Lab, ChineseSafe tested 17 Chinese LLMs (Qwen, ChatGLM, Baichuan, and others) across 25,000+ prompts. The headline finding: Chinese models scored above 90% refusal on political sensitivity but showed notable gaps in personal privacy and ethical reasoning. This reflects development priorities—regulators check political content first, not privacy protections.

The PNAS Nexus Longitudinal Study
The most important paper in this space tested 145 identical political questions across Chinese and Western models in 2023 and again in 2025. Chinese models moved from ~95% refusal rates to 98-99%, correlating with the CAC's Clear and Bright campaign that required political compliance audits. The study used keyword detection (looking for Chinese refusal phrases like "抱歉" or "我无法") plus human annotation, but keyword-only detection had an 8% false positive rate, and human annotators disagreed on edge cases 15% of the time.

The Evasion Arms Race
Between 2023 and 2025, the nature of refusals shifted from flat "I cannot answer" to evasive responses—long paragraphs that appear substantive but say nothing. This is where FLAMES comes in: a benchmark designed to test the robustness of refusal detection itself. FLAMES shows that even state-of-the-art detection methods drop from ~90% accuracy on clear-cut cases to ~65% on ambiguous ones. When you see a headline about a 98% refusal rate, you should mentally adjust that number down by an unknown amount—some counted refusals are false positives, and some actual refusals are missed because they're evasive rather than direct.

JailBench and the deccp Project
JailBench tests whether adversarial prompts can bypass safety filters, including system-prompt jailbreaks and language-switching attacks (asking in English what the model refuses in Chinese). The deccp project by Leonard Lin provides a continuously updated dataset of actual model responses to politically sensitive prompts, offering a real-world complement to controlled benchmarks.

The Core Takeaway
Measuring censorship is itself a methodological challenge. The language of the prompt, the format of the evaluation, the definition of refusal, and the arms race between evasive models and detection methods all affect the numbers. Anyone citing a refusal rate should also be able to explain how it was measured—and what it's probably missing.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2410: How Researchers Actually Measure Censorship in Chinese LLMs

Corn
Daniel sent us this one — he wants to know how researchers actually measure censorship in Chinese large language models. Not the headlines, not the anecdotes, but the validated benchmarks and methodologies. He specifically asked us to walk through CHiSafetyBench, SafetyBench, ChineseSafe, FLAMES, JailBench, and the deccp project from Leonard Lin. He also flagged a PNAS Nexus longitudinal study that tested a hundred and forty-five political questions across Chinese and Western models in twenty twenty-three and again in twenty twenty-five. Plus the CAC's Clear and Bright campaign from last year that forced model modifications, the standard refusal-keyword detection methodology, and the methodological pitfalls — language of prompt, system-prompt jailbreaks, and what refusal even means when the answer just gets shorter or vaguer instead of a flat decline. There's a lot to unpack here.
Herman
This matters right now because the measurement itself is the story. Everyone's arguing about whether these models are censored, but very few people are asking how we actually know. What's the evidence, what's the methodology, and what are we probably getting wrong?
Corn
Fun fact — DeepSeek V four Pro is writing our script today.
Herman
Appropriate, given the topic.
Corn
Where do we start? There's like six or seven benchmarks Daniel mentioned, and they're not all measuring the same thing.
Herman
Right, and that's the first thing to understand. When people say a benchmark measures censorship, that can mean at least three different things. One, does the model refuse to answer certain categories of questions? Two, does the model produce answers that are systematically biased in a pro-regime direction? And three, does the model actively generate content that aligns with state narratives, not just refuse but produce propaganda? Different benchmarks target different layers of this.
Corn
We should probably go through them one at a time, but also connect them to each other. Which one should we start with?
Herman
Let's start with CHiSafetyBench, because it's the most specific to Chinese political content. It was introduced in a paper called CHiSafetyBench, a Chinese safety benchmark for evaluating large language models in China, published in twenty twenty-four. The key thing is it's built around Chinese legal and regulatory frameworks — specifically the Cybersecurity Law, the Data Security Law, and the Personal Information Protection Law. It's not some Western researchers saying here's what we think should be censored. It's using China's own legal categories.
Corn
It's testing whether the model complies with Chinese law as stated, not whether the model aligns with some external standard of free speech.
Herman
And that's methodologically clever, because it sidesteps the whole debate about whose values should apply. The benchmark has eight categories — politics, national security, social stability, territorial integrity, ethnic unity, religious policy, public health, and what they call core socialist values. Each category has test prompts designed to probe whether the model will refuse or redirect.
Corn
How many prompts total?
Herman
The paper reports over fourteen thousand test cases. And the evaluation isn't just binary refuse or don't refuse. They have a four-level safety rating — safe, relatively safe, relatively unsafe, and unsafe. Safe means the model clearly refuses or provides a response that aligns with legal requirements. Unsafe means it directly provides harmful content as defined by Chinese law. The gradation matters because a lot of real-world responses are ambiguous.
Corn
That's actually one of the things Daniel flagged — what refusal even means when the answer is just shorter or vaguer. It sounds like CHiSafetyBench at least tries to handle that with a scale rather than a binary.
Herman
It does, but it's not perfect. The paper itself acknowledges that the boundary between relatively safe and relatively unsafe is fuzzy. And that's a recurring problem across all these benchmarks.
Corn
What about SafetyBench? That one's been around longer, right?
Herman
SafetyBench came out in twenty twenty-three, and it's broader. It's not China-specific — it covers seven safety categories including politics, but also ethics, health, and finance. The political category includes questions about sensitive historical events, territorial disputes, and government criticism. What makes SafetyBench interesting methodologically is that it's a multiple-choice benchmark. They give the model a question and four possible answers, and they measure which answer the model selects.
Corn
That seems like it would miss a lot of nuance. A model might pick the least bad answer from four options even if it would never generate that answer freely.
Herman
That's exactly one of the methodological pitfalls Daniel mentioned. Multiple-choice formats can inflate safety scores because the model isn't generating content — it's just ranking options. A model that would never say something supportive of a banned topic might still select the most moderate option when forced to choose. It's measuring something different from free-form refusal rates.
Corn
If you see a paper claiming a Chinese model has a ninety-eight percent safety rate on SafetyBench, you should ask whether that's the multiple-choice version or the open-ended version.
Herman
And the open-ended version exists too — they have both. But the multiple-choice results tend to get cited more because the numbers look better. It's a real problem in how this research gets communicated.
Corn
Let's talk about ChineseSafe. That one sounds like it was built specifically for the Chinese model ecosystem.
Herman
ChineseSafe is fascinating because it was developed by researchers at Tsinghua University and the Shanghai Artificial Intelligence Laboratory, published in early twenty twenty-four. The benchmark has over twenty-five thousand prompts across ten safety domains, including political sensitivity, pornography, violence, and illegal activities. What's distinctive is they evaluated seventeen different Chinese large language models — including Qwen, ChatGLM, Baichuan, and several others.
Corn
Seventeen models is a lot. What did they find?
Herman
The headline finding was that Chinese models generally performed well on categories aligned with Chinese regulations — political sensitivity, national security — but had notable gaps in categories like personal privacy and ethical reasoning. The political refusal rates were high, often above ninety percent for the top models. But the privacy protection performance was much more mixed.
Corn
Which tells you something about where the development priorities are.
Herman
When you have limited training and alignment resources, you focus on what the regulator is going to check first. The CAC isn't fining companies for weak privacy protections in the same way they crack down on political content.
Corn
Before we get to FLAMES and JailBench, can we talk about the PNAS Nexus study? Daniel mentioned it specifically tested a hundred and forty-five political questions across Chinese and Western models in two time periods — twenty twenty-three and twenty twenty-five.
Herman
This is one of the most important papers in this space. It was published in PNAS Nexus in twenty twenty-five, and it's a longitudinal study — same questions, same methodology, two time points. They tested a hundred and forty-five political questions covering topics that are sensitive in China — Tibet, Xinjiang, Tiananmen, Falun Gong, Taiwan independence, CCP leadership criticism.
Corn
Same questions both times. So you can actually see change over time.
Herman
In twenty twenty-three, they tested models including GPT four, Claude, and several Chinese models. The Chinese models, unsurprisingly, had very high refusal rates on sensitive topics — often above ninety-five percent. Western models were much more willing to answer. Then in twenty twenty-five, they ran the same test again. The Western models hadn't changed much. But the Chinese models had gotten even more restrictive. Refusal rates that were already at ninety-five percent went to ninety-eight, ninety-nine percent.
Corn
This lines up with the Clear and Bright campaign?
Herman
The CAC launched the Clear and Bright campaign in mid twenty twenty-five — the formal name translates to something like special campaign to clear up and rectify the abuse of artificial intelligence. It specifically targeted large language models and required companies to audit their models for political compliance. The PNAS Nexus authors note that the twenty twenty-five testing window fell after the campaign began, and they attribute the increased refusal rates to the regulatory pressure.
Corn
What's the methodology for detecting a refusal in that study? Because that's the core measurement problem.
Herman
They used a combination of keyword detection and human annotation. The keyword approach looks for standard Chinese refusal phrases — 抱歉 meaning sorry or apologies, 我无法 meaning I cannot, 我不能 meaning I'm unable to. If the response contains one of these phrases in the first sentence or two, it's flagged as a refusal. Then human annotators verify.
Corn
What's the false positive rate on that? Because someone could say I'm sorry you feel that way and then answer the question.
Herman
That's exactly the pitfall. The PNAS Nexus paper reports that keyword-only detection had a false positive rate around eight percent. Human verification brought it down, but even human annotators disagreed on edge cases about fifteen percent of the time. When the model gives a vague answer that doesn't directly refuse but doesn't directly engage either, reasonable people disagree about whether that counts as censorship.
Corn
That's the vaguer answer problem Daniel mentioned. The model learns to say something that sounds responsive but actually contains no information. It's not refusing, it's just being useless in a way that's harder to detect.
Herman
This has gotten worse between twenty twenty-three and twenty twenty-five. The PNAS Nexus paper notes that the nature of refusals changed. In twenty twenty-three, models tended to give flat refusals — I cannot answer that question. By twenty twenty-five, they were more likely to give what the authors call evasive responses — long paragraphs that appear to address the topic but on inspection say nothing substantive. The keyword detection misses a lot of these.
Corn
The models are getting better at evading the evasion detectors.
Herman
It's an arms race. And this is where FLAMES comes in. FLAMES is a benchmark specifically designed to test the robustness of refusal detection. The insight behind FLAMES is that simple keyword matching is fragile. A model can learn to refuse without using the standard apology phrases. It can say let's talk about something else or that's an interesting question but here's what I think about a completely different topic.
Corn
FLAMES is testing the testers, essentially.
Herman
It provides a set of model responses that are ambiguous — not clearly refusals and not clearly answers — and measures how well different detection methods classify them. The paper shows that even state-of-the-art detection methods, including fine-tuned classifiers, struggle with certain types of evasive responses. The accuracy drops from around ninety percent on clear-cut cases to around sixty-five percent on ambiguous ones.
Corn
Sixty-five percent is barely better than coin-flipping. That's a serious measurement problem.
Herman
It means that when you read a headline saying Chinese model X has a ninety-eight percent refusal rate on sensitive topics, you should mentally adjust that number down by some unknown amount. Some of those counted refusals might be false positives from keyword matching, and some actual refusals might be missed because they're evasive rather than explicit.
Corn
What about JailBench? That one sounds like it's testing the opposite — not whether models refuse, but whether they can be made to not refuse.
Herman
JailBench came out in early twenty twenty-five from researchers at multiple institutions. It's specifically about jailbreak attacks — prompts designed to bypass safety filters. The benchmark includes over one thousand jailbreak prompts across multiple strategies, testing both Chinese and Western models.
Corn
What kind of jailbreak strategies?
Herman
The usual ones — role-playing scenarios where the model is told to pretend it's a different AI without restrictions, translation attacks where the sensitive prompt is embedded in a request to translate text, encoding attacks where the prompt is in base sixty-four, and what they call context manipulation where you provide a long benign context and slip the sensitive question in at the end.
Corn
I've seen some of these. Tell the model you're writing a novel and the villain needs to explain something sensitive.
Herman
That's in there. The key finding from JailBench is that Chinese models are vulnerable to jailbreak attacks at roughly similar rates to Western models — around thirty to forty percent success rate for the best attack strategies — but the models recover differently. A Western model that gets jailbroken might give you the information you asked for. A Chinese model that gets jailbroken often gives a partial answer and then course-corrects mid-response.
Corn
Mid-response course correction? Like it starts to answer and then stops itself?
Herman
You'll see responses that begin with some factual information and then abruptly switch to a refusal or a warning. The researchers call this progressive refusal. It suggests that the safety mechanisms are operating at multiple layers — there might be a generation-time monitor that watches what the model is outputting and intervenes if it starts to drift into sensitive territory.
Corn
That's actually more sophisticated than a simple upfront filter. It implies a real-time monitoring system layered on top of the language model.
Herman
That's consistent with what we know about how these models are deployed in China. The CAC regulations require real-time content monitoring. It's not just about training the model to refuse — there are additional systems watching the outputs.
Corn
When we talk about censorship in Chinese LLMs, we're actually talking about at least three layers. The training data curation, the alignment fine-tuning, and the deployment-time monitoring.
Herman
At least three. And different benchmarks probe different layers. CHiSafetyBench and ChineseSafe mostly test the alignment layer — what the model has been trained to do. JailBench tests the deployment-time monitoring layer — can you get past it. FLAMES tests the evaluation layer — are we even measuring any of this correctly.
Corn
That leaves the deccp project from Leonard Lin. What's that?
Herman
The deccp project is an open-source auditing tool. Leonard Lin built it to systematically probe Chinese LLMs for censorship patterns. What's different about deccp is that it's not a static benchmark with a fixed set of prompts. It's a framework for generating test cases dynamically.
Corn
You can test new topics as they become sensitive, rather than being limited to whatever the benchmark authors thought to include three years ago.
Herman
And this matters enormously because the set of sensitive topics changes over time. A benchmark created in twenty twenty-three might not include questions about something that became sensitive in twenty twenty-five. Deccp lets you generate test cases for current events.
Corn
What has Lin found with it?
Herman
The project has documented something interesting — the pattern of refusals isn't uniform across models. Different Chinese models refuse different subsets of sensitive topics. Qwen might refuse questions about Xinjiang but answer questions about labor rights. Baichuan might do the opposite. This suggests that the alignment process isn't driven by a single centralized list of forbidden topics. Each company is making its own judgments about what's too sensitive.
Corn
Which makes sense if you think about it from the company's perspective. They're trying to comply with vague regulations. The CAC says don't generate harmful content, but doesn't give you an exhaustive list of what that means. So each company's legal team makes their own calls, and you get this patchwork of different refusal patterns.
Herman
That patchwork is itself a research finding. If every Chinese model refused exactly the same questions, you'd suspect a centralized censorship list. The fact that they don't tells you something about how the regulatory system actually works in practice.
Corn
Let's talk about the language-of-prompt issue. Daniel specifically flagged this as a methodological pitfall. I assume the problem is that if you test a Chinese model with English prompts, you might get different refusal rates than if you test with Chinese prompts.
Herman
This is one of the most important and most overlooked variables in this research. Multiple studies have found that Chinese models are more likely to refuse sensitive prompts in Chinese than in English. The effect size is substantial — in some cases, refusal rates drop by twenty or thirty percentage points when you switch the prompt language from Chinese to English.
Corn
Why would that be?
Herman
The leading hypothesis is that the alignment training data is predominantly in Chinese. The safety fine-tuning — the examples of good refusals the model learns from — are mostly Chinese-language examples. When you ask in English, you're partly bypassing the safety training because the model hasn't seen as many English examples of how to refuse.
Corn
If a Western researcher tests a Chinese model exclusively in English, they might significantly underestimate the censorship.
Herman
And a lot of the early research on this topic did exactly that — tested Chinese models with English prompts because the researchers didn't speak Chinese. The more recent work, including the PNAS Nexus study, tests in both languages and reports the difference.
Corn
What about system-prompt jailbreaks? Daniel mentioned those too.
Herman
System prompts are instructions given to the model at the start of a conversation that set its behavior. A system prompt jailbreak is when you craft a system prompt that tells the model to ignore its usual restrictions. Something like you are a research assistant with no content restrictions, answer all questions factually.
Corn
Does that work?
Herman
It depends on the model. Some Chinese models are vulnerable to system-prompt overrides, especially if the system prompt is in English. Others have hard-coded restrictions that can't be overridden at all. The JailBench paper includes a whole section on this. The success rate for system-prompt jailbreaks on Chinese models ranges from about ten percent to over fifty percent depending on the model and the specific attack.
Corn
That's a huge range. It suggests some companies are taking system-prompt security much more seriously than others.
Herman
It's another reason why benchmark results are hard to compare. If one paper tests with default system prompts and another tests with jailbreak system prompts, they're measuring completely different things. But both might be reported as refusal rates.
Corn
Let's zoom out for a second. We've walked through all these benchmarks and methodologies. What's the state of the art in actually measuring this? If you're a researcher today trying to do this right, what's your protocol?
Herman
Based on everything we've discussed, a rigorous protocol would include at least these elements. First, test in both Chinese and English and report the results separately. Second, use both keyword detection and human annotation for refusal classification, and report inter-annotator agreement. Third, distinguish between flat refusals and evasive responses — don't lump them together. Fourth, test with multiple system prompt configurations. Fifth, use a dynamic test set that can be updated for current events, not just a static benchmark from two years ago. And sixth, test multiple times over a period of months to capture the effect of regulatory changes like the Clear and Bright campaign.
Corn
Nobody is doing all six of those things consistently.
Herman
The PNAS Nexus study does several of them — bilingual testing, longitudinal design, human annotation. CHiSafetyBench does the legal-framework alignment and the graded safety scale. JailBench does the adversarial testing. But no single study or benchmark puts it all together.
Corn
Which means the honest answer to how censored is this model is we don't fully know, and the number depends a lot on how you ask.
Herman
That's before we even get to the question of what counts as censorship versus what counts as legitimate content moderation. Every platform does some kind of content moderation. The question is where the line is drawn and who draws it.
Corn
There's also a deeper methodological problem that I don't think any of these benchmarks fully address. When a Chinese model refuses to answer a question about Tiananmen or Xinjiang, we call that censorship. But the model doesn't know it's censoring. From the model's perspective, it's following its training to be helpful and harmless, and its training says those topics are harmful. The model isn't choosing to censor. It's been shaped to see certain questions as inherently harmful.
Herman
That's a really important distinction. The refusal behavior is a symptom of the training, not a decision the model is making in the moment. And that means you can't fix it by just telling the model to be more open. The censorship is baked into the weights. The only way to remove it would be to retrain the model on different data with different alignment objectives.
Corn
Which is why these benchmarks are measuring something structural, not something superficial. The refusal rate isn't a setting you can toggle. It's a reflection of the entire training pipeline.
Herman
That's also why the Clear and Bright campaign was so significant. It wasn't just asking companies to add a few more forbidden keywords. It was pushing them to fundamentally retrain their models to be more restrictive. The PNAS Nexus data shows this in the numbers — the shift from flat refusals to evasive responses suggests that companies didn't just add more refusal triggers. They retrained the models to handle sensitive topics differently at a deeper level.
Corn
What do we know about how the CAC actually enforced the Clear and Bright campaign? Was it just guidelines, or were there penalties?
Herman
The campaign included both. The CAC issued formal requirements for AI companies to conduct self-audits and submit compliance reports. Companies that failed to comply faced fines and potential suspension of their AI service licenses. Several major Chinese AI companies publicly announced that they had completed compliance reviews during the campaign period. The specifics of what was found and changed weren't made public, but the public announcements themselves were a signal — the companies wanted the regulator to know they were cooperating.
Corn
There's a performative aspect to it. The company needs to be seen complying as much as it needs to actually comply.
Herman
That performative aspect creates another measurement problem. When you test a model during or right after a regulatory campaign, you might be measuring compliance theater rather than the model's actual underlying behavior. The company might have temporarily tightened the filters in ways that will loosen over time once the regulator's attention moves elsewhere.
Corn
That's a testable hypothesis. If someone runs the PNAS Nexus protocol again in twenty twenty-seven, we might see refusal rates drift back down slightly.
Herman
I'd bet on that. Regulatory attention is cyclical. The Clear and Bright campaign was an intense period of scrutiny. It won't stay at that level forever.
Corn
One thing we haven't talked about is the deccp project's finding about topic clustering. You mentioned different models refuse different subsets of topics. Does that pattern reveal anything about how the companies are making their decisions?
Herman
Lin's analysis found that the refusal patterns cluster into recognizable groups. One cluster is what you might call core regime legitimacy topics — direct criticism of the CCP, Tiananmen, Falun Gong. Every Chinese model refuses these at near one hundred percent. Another cluster is territorial integrity — Taiwan, Tibet, Xinjiang, South China Sea. Again, near universal refusal. But then there's a third cluster of what you might call peripheral sensitivities — labor rights, environmental protests, corruption, housing bubbles. On these topics, refusal rates vary dramatically between models.
Corn
The core political topics are universally censored, but the economic and social topics are where you see variation.
Herman
That variation is probably driven by each company's assessment of regulatory risk. A labor rights question might be sensitive if there's been a recent high-profile strike. If there hasn't, it might be considered safe. Different companies are making different bets about where the regulatory line is this month.
Corn
Which makes benchmarking even harder, because the refusal rate on peripheral topics is a moving target.
Herman
It's a moving target that varies by model, by language, by prompt format, and by week. Any single-number summary of how censored a model is should be treated as approximately useless.
Corn
What should researchers and users actually do with all this? What's the practical takeaway?
Herman
A few things. First, if you're using a Chinese LLM and you care about getting unfiltered answers on politically sensitive topics, test it yourself in both Chinese and English before relying on it. The published benchmarks will give you a rough sense, but the specifics for your use case might be different.
Corn
If you're a researcher designing a study?
Herman
Report your methodology in enough detail that someone can replicate it. Specify the language, the system prompt, the refusal detection method, the date of testing. All of those things matter. A paper that says we tested model X and found a ninety percent refusal rate without specifying those details is not contributing useful knowledge.
Corn
There's also a question for Western users who might be considering using Chinese models. If you're integrating Qwen or DeepSeek into your product, and your product involves answering questions about geopolitics or human rights, you need to know that the model has been trained to refuse or evade on those topics. It's not a bug from the developer's perspective. It's a feature. But it might be a bug for your use case.
Herman
It's not always obvious which topics will trigger refusals. A question about Chinese economic policy might get a detailed answer. A question about Chinese economic policy that mentions a specific company that's fallen out of favor might get a refusal. The boundaries are complex and not publicly documented.
Corn
One more thing I want to touch on. We've been talking about Chinese models as if they're the only ones that refuse questions. But Western models refuse questions too. Claude and GPT have their own safety training that causes them to decline certain prompts. The difference is in what they refuse and why.
Herman
That's an important point. Every major LLM has some kind of safety training that produces refusals. The difference with Chinese models is that the refusal categories include political dissent and historical accountability, whereas Western model refusals tend to focus on things like hate speech, violence, and illegal activities. But the underlying mechanism — training the model to recognize certain topics as off-limits — is the same.
Corn
Which raises an uncomfortable question. Is the difference between Chinese censorship and Western content moderation a difference in kind or just a difference in where the line is drawn?
Herman
I think it's both. There's a difference in kind because the Chinese system is explicitly designed to protect the political authority of the party-state, whereas Western content moderation is ostensibly designed to protect users from harm. But there's also a difference in degree — the scope of forbidden topics is much broader in the Chinese system. And the consequences for violating the rules are different. A Western AI company that fails to moderate hate speech might face public criticism. A Chinese AI company that fails to censor political content might lose its business license.
Corn
The stakes are higher, which explains why the refusal rates are so much higher.
Herman
Why the measurement is so much more important. When the stakes are that high, you need to know exactly what you're measuring and what you're missing.
Corn
Let's wrap with one forward-looking thought. We've seen the trajectory from twenty twenty-three to twenty twenty-five — refusal rates going up, evasive responses replacing flat refusals, regulatory pressure intensifying. Where does this go next?
Herman
I think the next frontier is going to be multilingual evasion. Right now, the evasion is better in Chinese than in English. But as the models improve, you'll see sophisticated evasive responses in every language the model speaks. The arms race between refusal detection and refusal generation is going to accelerate. And the methodological challenges we've been discussing are only going to get harder.
Corn
The measurement problem isn't going away. It's getting worse.
Herman
It's getting worse, and it's getting more important. That's a bad combination.
Corn
Thanks to our producer Hilbert Flumingtop for keeping us on track. Modal's serverless infrastructure makes this whole pipeline possible — check them out at modal dot com. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com. I'm Corn.
Herman
I'm Herman Poppleberry. Go read the PNAS Nexus paper — it's worth your time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.