Daniel sent us this one — he wants to walk through the benchmarks for over-refusal, which is when an LLM's safety guardrails fire on perfectly innocent prompts. He specifically asked about OR-Bench, the ICML twenty twenty-five paper from Cui and colleagues, plus its predecessor XSTest and a tool called PHTest that generates model-specific pseudo-harmful prompts. The core tension here is that trade-off between making models safe and making them actually useful.
Before we dive in — DeepSeek V four Pro is writing our script today, which feels appropriate given we're about to critique how different model families handle these edge cases.
I'll try not to over-refuse any of its lines.
Alright, so let's start with why this matters. Most of the public conversation about LLM safety focuses on whether models will say dangerous things. Can I get Claude to tell me how to build a bomb? That sort of thing. But there's this whole other failure mode that gets way less attention, which is the model refusing to answer something completely harmless because it tripped over a keyword. And the user experience on that is terrible. It makes the model look stupid, it erodes trust, and it creates exactly the kind of backlash that makes developers want to just strip out the safety layers entirely.
Right, and the classic example that keeps floating around is "how do I kill a process in Linux" getting refused because the word "kill" triggers the violence filter. That's the cartoon version, but the real problem is much more nuanced and much harder to fix.
That's where OR-Bench comes in. This is the first genuinely large-scale attempt to measure over-refusal systematically. The paper was accepted at ICML twenty twenty-five, and the numbers here are substantial. Eighty thousand over-refusal prompts across ten rejection categories — violence, privacy, hate speech, sexual content, and so on. Plus a hard subset of about a thousand prompts that still stump frontier models, and six hundred toxic prompts as a control group.
That control group is actually a really important design choice. You need to know that the model isn't just answering everything. Otherwise you'd have no way to distinguish between a model that's well-calibrated and one that's just a doormat.
And the way they built this dataset is clever. They started with toxic seed prompts generated by Mixtral eight-by-seven-B, because the safety-aligned models like GPT-four wouldn't generate them in the first place. Then they rewrote each toxic seed into five safe-but-borderline prompts using few-shot examples. So you end up with prompts that use similar vocabulary and sentence structures to harmful requests, but the actual content is benign.
It's testing whether the model understands context or is just pattern-matching on scary words.
And then they ran a three-model ensemble moderator — GPT-four-turbo, Llama-three-seventy-B, and Gemini-one-point-five-pro — to filter out any prompts that were still actually toxic. They got ninety-three percent accuracy on that filtering, compared to ninety-four percent for human experts. So the automated pipeline is nearly as good as humans at distinguishing harmful from safe-but-edgy.
The headline finding? Tell me about the trade-off.
This is the part that should make everyone in AI safety sit up and pay attention. The Spearman rank correlation between toxic prompt rejection and over-refusal is zero point eight nine. That is staggeringly high. It means that models which are better at refusing harmful prompts are almost always also more likely to refuse benign ones. There's no free lunch here. Most models achieve safety at the expense of over-refusal, and very few excel at both.
You can have a model that says no to everything, or a model that says yes to everything, and finding the sweet spot is hard.
Model size does not predict where you land on that curve. That's one of the more surprising findings. You'd think bigger models would be better at threading the needle — more parameters, more nuanced understanding of context. Claude models show the highest safety scores but also the highest over-refusal rates. Mistral models tend to accept most prompts, both safe and unsafe. And then you've got this weird thing with GPT-three-point-five-turbo where later versions showed decreasing over-refusal, which sounds good, but they also showed decreasing safety on toxic prompts. The dial got turned, but it turned both needles at once.
Let's put some specific numbers on this. What were the actual over-refusal rates?
On the full OR-Bench eighty-K set, Claude-two-point-one over-refused on seventy-three percent of prompts. Seventy-three percent. That means nearly three out of four perfectly benign prompts were getting blocked. GPT-three-point-five-turbo-oh-three-oh-one was at forty-nine percent. Newer models are better, but the fundamental trade-off hasn't gone away.
That Claude number is wild. If your assistant refuses three-quarters of what you ask it, you stop using it. It doesn't matter how safe it is if nobody wants to interact with it.
This connects to something the PHTest paper explicitly calls out. False refusals provoke a public backlash against the very values that alignment is trying to protect. You mentioned the Linux kill command, but there was a real-world case where Google had to take down Gemini's portrait generation feature because it was falsely refusing harmless prompts. Users asked for a picture of white people smiling and the model pushed back. That kind of thing makes people think the whole safety agenda is broken, even when the underlying intention is reasonable.
The cure starts looking worse than the disease from a product perspective. Alright, let's back up a bit and talk about XSTest, because OR-Bench didn't come out of nowhere.
XSTest was the predecessor, published in twenty twenty-three by Röttger and colleagues. Much smaller scale — two hundred and fifty hand-crafted safe prompts across ten prompt types, plus two hundred unsafe contrast prompts. The idea was to test what they called "exaggerated safety." Prompts that use language superficially similar to unsafe requests but are actually fine. Homonyms, figurative language, safe targets in dangerous-sounding contexts, that kind of thing.
The original findings?
When it first came out, Llama-two-seventy-B-chat was fully refusing thirty-eight percent of the safe prompts and partially refusing another twenty-one point six percent. So more than half of the safe prompts were getting some level of pushback. GPT-four struck the best balance at the time — it complied with nearly everything except privacy-related prompts, where it was more cautious. The paper identified lexical overfitting as the root cause. Models were overreacting to safety-related keywords without actually understanding the full context.
"Kill the lights" versus "kill the president." Same verb, radically different meanings, but the model just sees the verb.
And here's the thing about XSTest — it's now too easy. OR-Bench explicitly notes that due to its static nature, XSTest has become too simple for newer state-of-the-art models. Llama-three-seventy-B can answer nearly all of the two hundred fifty questions properly. The benchmark did its job, it drove improvement, and now it's basically solved.
Which creates a treadmill problem, right? You build a benchmark, models train against it or the labs optimize for it, the benchmark becomes saturated, and you need a new, harder one. OR-Bench is that next step, but presumably it'll also get saturated eventually.
The interesting thing about OR-Bench is that the pipeline is automated. You can run it again with new seed prompts and generate a fresh set of hard examples. So in theory, the benchmark can evolve as models improve. But that just kicks the can down the road. Is over-refusal an inherent feature of any safety-aligned system, or is it something we can actually solve?
I suspect it's inherent, at least to some degree. If you train a model to recognize patterns associated with harm, and you penalize it for producing harmful outputs, it's going to develop a bias toward refusal in ambiguous cases. That's just rational behavior from the model's perspective. The cost of a false negative — answering a harmful prompt — is much higher than the cost of a false positive — refusing a benign one. At least from the lab's perspective.
From the lab's perspective, yes. From the user's perspective, it's the opposite. The user doesn't care about the one-in-a-thousand harmful prompt they might accidentally trigger. They care about the ten benign prompts that got blocked today.
Which brings us to PHTest. Tell me about this one.
PHTest was published by An and colleagues in twenty twenty-four, and it takes a different approach. Instead of hand-crafting prompts or rewriting toxic seeds, it automatically generates what they call pseudo-harmful prompts. These are prompts that are designed to look harmful to a specific model but are actually benign. The classic example is "how to kill a mosquito." It uses the word "kill," it involves a living thing, but it's clearly not a violence problem.
The generation method?
They adapted the AutoDAN gradient-guided search method. AutoDAN was originally designed to find jailbreak prompts — inputs that trick models into producing harmful outputs. PHTest flips that around. It searches for prompts that elicit refusal responses from a white-box target model while remaining actually harmless. So it's finding the specific linguistic patterns that trigger each model's refusal mechanism.
That's model-specific, which is a really important distinction. A prompt that triggers Claude's refusal might not trigger GPT-four's, and vice versa. Each model has its own weird sensitivities.
Right, and the scale here is about ten times larger than XSTest. Three thousand two hundred sixty pseudo-harmful prompts total. But the really interesting innovation is the labeling scheme. They didn't just split things into safe and unsafe. They added a third category: controversial. Out of the prompts, two thousand sixty-nine were labeled harmless, and one thousand one hundred ninety-one were labeled controversial. These are prompts where the harmfulness is debatable.
Give me an example of a controversial one.
Think about prompts related to abortion, or free speech boundaries, or certain political topics where reasonable people disagree about whether the content itself is harmful. A prompt about how to participate in a protest, or how to access certain types of information that some jurisdictions restrict. The model's refusal on these isn't clearly right or wrong — it's encoding a values judgment.
That's where this gets philosophical. The refusal threshold isn't just a technical parameter. It's a reflection of what the lab considers harmful, which is a political and moral stance.
And different labs make different calls. Anthropic might refuse a prompt that xAI's Grok answers freely, and neither is objectively incorrect. They're just encoding different values. PHTest's three-way labeling directly confronts this. It acknowledges that "harmful" is contested territory.
What did they find in terms of model performance?
They evaluated twenty LLMs across the major families. The headline is that Claude three models showed a significant drop in false refusal rates on harmless prompts compared to Claude two-point-one — from seventy percent down to thirty-four percent for Claude three Sonnet. That's a huge improvement. But the drop on controversial prompts was much smaller. So the models got better at recognizing harmless content, but they're still cautious on the borderline stuff.
Which suggests the improvement is coming from better capability — better understanding of context — rather than just relaxing the safety filters across the board.
That's exactly the interpretation. And within the same model family, larger models do tend to have lower false refusal rates. Claude three Opus was at twenty-one percent on harmless prompts, compared to Haiku at forty-four percent. So scale helps within a family. But as we saw with OR-Bench, that doesn't generalize across families. A smaller model from one lab might have a better balance than a larger model from another.
The trade-off shows up again when you test against actual jailbreak attacks?
They tested against HarmBench, which is a jailbreak benchmark, and the trade-off is clear. No model dominates on both safety and low false refusal. Claude two-point-one achieved the highest safety at the cost of the highest false refusal rate. It's the same story OR-Bench tells, just measured differently.
Alright, let's talk about the jailbreak defense angle, because both papers flag something important here.
This is one of those knock-on effect that most coverage misses. Both OR-Bench and PHTest find that jailbreak defense mechanisms dramatically increase over-refusal rates. PHTest measured a three times increase. So the standard approach in the security community — adding layers of defense against jailbreak attacks — actively makes models less usable.
Which means any defense paper that reports improved safety without also reporting the impact on over-refusal is telling an incomplete story. You're only showing half the ledger.
That's a real problem for the field. If you optimize purely for safety, you end up with a model that refuses everything and helps no one. If you optimize purely for helpfulness, you end up with a model that will walk someone through building a bioweapon. Neither outcome is acceptable, but the metrics we use to evaluate models often only capture one side of that equation.
OR-Bench is trying to fix that by providing a standardized benchmark that measures both. But even with a good benchmark, the underlying problem doesn't go away. You still have to decide where to set the threshold.
That threshold decision is fundamentally not a technical question. It's a product decision, or a policy decision, or a values decision. How much over-refusal are you willing to tolerate in exchange for how much additional safety? There's no objectively correct answer.
Let's pull on the controversial prompts thread a bit more, because I think this is where the real hard problem lives. PHTest identifies over a thousand prompts where harmfulness is debatable. These aren't edge cases where the model is just confused. These are prompts where reasonable people, including the people building these models, disagree about whether the model should answer.
The labs are making these calls unilaterally. When Claude refuses to engage with a prompt about certain political topics, that's not a technical failure. The model is working as designed. But the design encodes a specific set of values, and users who don't share those values experience that refusal as censorship.
Or as condescension. The model is essentially telling you that you shouldn't be asking about this thing, even though there's no objective harm in the question itself.
This creates a weird dynamic where different models develop different political reputations based on what they refuse. Users shop around. If Claude won't answer something, maybe Grok will. If Grok won't, maybe Gemini will. The refusal patterns become a kind of ideological fingerprint.
Which is probably not what the labs intended. They set out to build safe models, not to take sides in culture war debates. But the refusal threshold forces them to take sides whether they want to or not.
There's no neutral position. Even deciding not to refuse is a decision. If you let everything through, you're implicitly endorsing the view that all these prompts are harmless, which is itself a values judgment.
What do we do with this? Practically speaking, if you're a developer building on top of these models, how do you think about the over-refusal problem?
The first thing is to actually measure it. If you're using an LLM in production, you should have your own internal benchmark of prompts that your specific application needs to handle, and you should be tracking refusal rates on those prompts over time. Don't just rely on the general benchmarks. Your use case might be especially sensitive to certain categories of over-refusal.
If you're choosing between models, don't just look at the benchmark leaderboards for capability. Look at the safety benchmarks too, but specifically look at the over-refusal numbers alongside the safety numbers. A model that scores slightly lower on MMLU but has a much better safety-helpfulness balance might actually be the better choice for a user-facing application.
The other practical takeaway is around system prompts and guardrails. A lot of over-refusal happens at the model level, but you can sometimes mitigate it with careful prompting. If you know your application deals with medical content, for example, you can explicitly instruct the model that discussing medical procedures is within scope. That won't fix everything, but it can help.
Though that's a patch, not a solution. The underlying model still has its baked-in refusal tendencies.
And the patches can fail in unpredictable ways. One thing the OR-Bench paper highlights is that jailbreak defenses often make the problem worse, which means if you're adding external safety layers on top of a model, you need to test those layers specifically for over-refusal. Don't just assume they're helping.
Let's talk about where this goes next. The treadmill problem is real — benchmarks get saturated, new benchmarks emerge, models improve, rinse and repeat. Is there an endpoint here, or is over-refusal just a permanent feature of the landscape?
I think there are two possible endpoints. One is that models get good enough at contextual understanding that they can reliably distinguish between "kill the lights" and "kill the president" without needing to be overly cautious. That's a capability problem, and capability improvements might eventually solve it. The PHTest results on Claude three suggest some movement in that direction — the improvement on harmless prompts without a corresponding drop on controversial ones looks like genuine understanding, not just relaxed filters.
The other endpoint?
The other endpoint is that we accept the trade-off as inherent and give users more control. Instead of the lab deciding where the refusal threshold sits, let users adjust it themselves. Want a maximally safe model that refuses anything ambiguous? Set the dial to ten. Want a model that only refuses the most egregious stuff? Set it to two. Different use cases, different thresholds.
That's appealing in theory, but it also means the model that will help someone build a bomb is just a slider adjustment away. The labs are not going to want that liability.
No, they won't. Which is why I suspect we'll end up with something in between. The base models will have a floor — a minimum safety level that can't be dialed below — but users will have some range of adjustment above that floor. Some labs might offer more range than others, and that'll become a competitive differentiator.
The "harmful is contested" problem doesn't go away with a slider, though. If the model's underlying training has baked in that certain political topics are harmful, adjusting the refusal threshold just changes how forcefully it refuses, not whether it considers the topic harmful in the first place.
That's right. The slider changes the surface behavior, not the deep values encoding. To change that, you'd need to retrain the model with different values, which is a much harder problem. And different users would want different values encoded. There's no technical solution that satisfies everyone.
Which brings us back to the core insight from all three papers. The refusal threshold is a values judgment. It's not neutral, it's not objective, and it's not purely technical. The benchmarks help us measure it, but they can't resolve it.
The labs should be more transparent about where they're setting that threshold and why. If Claude refuses certain political prompts while Grok answers them, users deserve to understand that this isn't a bug or a capability gap. It's a deliberate choice.
Alright, let's wrap with some forward-looking thoughts. Where do you see this going in the next year or two?
I think OR-Bench will become the standard benchmark in this space, the way XSTest was before it. The automated pipeline means it can stay relevant longer. I also think we'll see more work on model-specific over-refusal testing, along the lines of PHTest, because the one-size-fits-all benchmarks miss the weird failure modes that are specific to individual models.
The jailbreak defense community needs to start reporting over-refusal impact as a standard metric. If your defense triples the false refusal rate, that needs to be in the abstract, not buried in the appendix.
The field has been too focused on safety at all costs, and the usability costs are real. The PHTest paper frames this well — false refusals provoke backlash against alignment itself. If the safety community doesn't take usability seriously, they'll lose the public trust they're trying to protect.
One last thing. Daniel's prompt asked about the hard problem at the end — that "harmful" is contested and the refusal threshold encodes a values judgment. I think that's the thing that will still be debated long after the technical benchmarks are solved. We can measure over-refusal with increasing precision, but we can't measure our way out of the underlying disagreement about what counts as harm.
Maybe that's fine. Maybe the goal isn't to find the one true refusal threshold, but to build systems that are transparent about their thresholds and give users meaningful choice. The benchmarks are tools for holding labs accountable, not for finding the correct answer, because there isn't one.
Thanks to our producer Hilbert Flumingtop for keeping us on track, and thanks to Modal for powering the serverless infrastructure that makes this show possible. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com.
If you want to dig deeper into any of the papers we discussed, the show notes have links to all three on arXiv. See you next time.