Daniel sent us this one — he wants us to walk through the benchmarks that actually measure AI's American and WEIRD-default bias, strictly the evaluation methodologies, not the phenomenon itself. He's flagged five: CulturalBench, BLEnD, WorldValuesBench, GlobalOpinionQA, and the twenty twenty-five WorldView-Bench. The core question is what each one actually probes, how they handle the hard problem of ground truth when culture itself is contested, and what their methodologies reveal about where these systems actually fail. There's a lot to unpack here.
Before we dive in — fun fact, DeepSeek V four Pro is writing our script today. Which feels appropriate for an episode about cultural benchmarks, given that we're about to discuss how models from different regions handle cultural knowledge.
That's either deeply fitting or deeply ironic. We'll find out by the end. Alright, walk me through CulturalBench first. What's the methodology here?
CulturalBench is from ACL twenty twenty-five, and it's genuinely one of the more rigorous approaches I've seen. They built one thousand six hundred ninety-six human-written and human-verified questions covering forty-five global regions — and I want to emphasize, that includes places most benchmarks ignore, like Bangladesh, Zimbabwe, Peru. Seventeen topics, everything from food preferences to greeting etiquette. But the methodology is what makes it interesting. They used something called Human-AI CulturalTeaming.
Which sounds like a corporate retreat exercise, but I'm guessing it's more structured than that.
It's a three-step pipeline. Step one: human annotators from the target region brainstorm cultural scenarios from personal experience. An AI helper bot transforms those into structured four-option multiple-choice questions, and the platform offers revision strategies — things like "negate the question" to generate harder distractors. Step two: a human quality check where annotators can select multiple answers or flag "no correct option" or "no knowledge." Step three: majority-vote filtering with a threshold of at least four out of five annotators agreeing.
They're not just scraping trivia from Wikipedia. These are questions built from lived experience, verified by people who actually live there.
And they recruited annotators through Prolific, with strict criteria — nationality plus pre-eighteen residence in the target region. You had to have grown up there. The human baseline on this benchmark is ninety-two point four percent. So these aren't trick questions for people from the culture. They're genuine cultural knowledge.
Where do the frontier models land?
On the hard version — I'll explain that in a second — they range from twenty-eight point seven percent for Aya-eight-b up to sixty-one point five percent for GPT-four-o. The hard version is worth understanding, because it's a clever methodological choice. They take each multiple-choice question and convert it into four binary true-false questions, giving six thousand seven hundred eighty-four binary items total. The reason they did this is that they found models were cheating.
By using surface-level heuristics. If you just pick the option whose embedding is most similar to the culture name, you can get forty percent accuracy without actually reading the question. The hard version forces the model to evaluate each option independently. It can't just pattern-match. And that's where you see the real performance gap.
Forty percent without reading the question. That's a pretty damning indictment of how shallow these systems' cultural knowledge actually is. They're coasting on statistical associations between words.
And here's another finding that surprised me. GPT-four-o — a US-based model — outperforms Mistral on European culture and DeepSeek on Chinese culture. You'd think a French model would know European culture better, or a Chinese model would know Chinese culture better. But the paper speculates this comes down to training data scale. More data beats regional focus, at least for cultural trivia.
Which is its own kind of bias, isn't it? The model with the biggest scrape of the English-language internet wins, even on questions about cultures that aren't Anglophone. That's not a victory for cultural understanding. That's a victory for data colonialism.
That's a strong way to put it, but I don't think you're wrong. And there's one more finding from CulturalBench that I think is the most revealing. They identified something called the multi-mode question problem. Some cultural questions have multiple valid answers. Their example: what utensils do Chinese people usually use? Chopsticks is the most common answer, but spoons are also valid — for soup, for example. Models show a twenty-eight point seven percent accuracy drop on these questions. They have what the paper calls an "answer convergence bias." They overfit to a single answer and can't handle ambiguity.
Which gets at something deeper. Culture isn't a multiple-choice test. There are norms, but there are also variations, subcultures, context-dependent exceptions. A benchmark that only tests for the modal answer is going to miss that.
That's the tension that runs through all five of these benchmarks. Let me move to BLEnD, because it tackles this from a different angle. BLEnD is from NeurIPS twenty twenty-four, and it probes everyday cultural knowledge across sixteen countries and thirteen languages. Fifty-two thousand six hundred question-answer pairs — fifteen thousand short-answer, thirty-seven thousand six hundred multiple-choice. Six categories: food, sports, family, education, holidays and leisure, and work-life.
The languages include low-resource ones, right?
Amharic, Assamese, Azerbaijani, Hausa, Sundanese. Languages that most multilingual benchmarks completely ignore. The construction methodology is interesting too. They had native annotators from each region generate five hundred question templates, then stripped out country-specific proper nouns so the same template works across all regions. Then they collected answers from five annotators per question per region, allowing up to three answers per person. Invalid answers were removed by one or two additional annotators per country.
Similar to CulturalBench in using annotator agreement as ground truth. What was their inter-annotator agreement?
Average of three point one six out of five. That's sixty-three point two percent. Which is notably lower than CulturalBench's threshold of four out of five. And I think that lower agreement is itself a finding — everyday cultural knowledge is variable, even within a region.
If you ask five people from the same country what a typical breakfast looks like, you're going to get different answers. Some people skip breakfast. Some people eat the same thing every day. Some people eat different things on weekends. The "ground truth" is a distribution, not a point.
And that's the methodological hard problem we keep circling. But let me give you the numbers on the language disparity, because this is where BLEnD gets really concrete. On short-answer questions, average LLM performance in the US, in English: seventy-nine point two two percent. Spain, in Spanish: sixty-nine point zero eight percent. Iran, in Persian: fifty point seven eight percent. North Korea, in Korean: forty-one point nine two percent. Northern Nigeria, in Hausa: twenty-one point one eight percent. Ethiopia, in Amharic: twelve point one eight percent.
That's not a gap. That's a chasm. GPT-four shows a fifty-seven point three four percentage point spread between its best and worst performing cultures.
And here's the paradox that I think is the most important methodological finding from BLEnD. For mid-to-high-resource languages, LLMs perform better when you prompt them in the local language. Spanish questions get better answers in Spanish. But for low-resource languages, the opposite is true. They perform better when you ask in English than when you ask in the local language. Asking about Ethiopian culture in Amharic yields worse answers than asking in English.
That's completely backwards. If you're building culturally-aware AI, you'd want the opposite. You'd want the system to be better in the local language, because that's how people actually engage with their own culture.
What this suggests is that current multilingual capabilities are actually a liability for cultural representation. The model has seen enough Amharic text to generate something, but not enough to encode genuine cultural knowledge. So it produces worse answers than if you just asked in English, where at least the training data has more cultural content — even if it's filtered through an Anglophone lens.
The multilingual training is giving the appearance of capability without the substance. It's a Potemkin village of linguistic diversity.
BLEnD also found something about cultural proximity. Countries with shared cultural backgrounds show higher answer overlap — Indonesia and West Java, the US and the UK, Spain and Mexico. The lowest overlap was Northern Nigeria with Greece, Ethiopia, and South Korea. Which is intuitive, but it validates that the benchmark is actually measuring something real about cultural similarity.
Alright, let me shift to the third one. This one's different — it's not about factual knowledge. It's about predicting human values.
WorldValuesBench is from LREC-COLING twenty twenty-four, and it's derived from the World Values Survey Wave Seven. That's ninety-four thousand seven hundred twenty-eight participants across sixty-four countries, surveyed between twenty seventeen and twenty twenty-two. The task is: given demographic attributes and a value question, can the model predict how a human from that demographic would answer?
It's not asking the model what it believes. It's asking the model to model human belief distributions.
And the dataset is enormous — over twenty million examples of the form "demographic attributes, value question, answer." Two hundred thirty-nine ordinal-scale value questions, plus forty-two demographic questions. Split seventy-fifteen-fifteen into train, validation, and test.
What's the evaluation metric here? Because you can't just use accuracy for an ordinal scale.
They use Wasserstein one-distance — also called earth mover's distance — between the model's answer distribution and the human answer distribution for a given demographic group. Answers are normalized to the zero-to-one range. The reason they chose this over something like KL divergence is that it respects the ordinal nature of Likert-scale answers. Predicting a two when the human answer is one is a smaller error than predicting a ten. Wasserstein distance captures that.
That's a thoughtful choice. And what did they find?
On their probe set — thirty-six value questions times three demographic variables, continent, residential area, and education level — that's eight thousand two hundred eighty examples. Models achieving less than zero point two Wasserstein distance: Alpaca-seven-b at eleven point one percent, Vicuna-seven-b at twenty-five percent, Mixtral-eight-x-seven-b at seventy-two point two percent, GPT-three point five Turbo at seventy-five percent.
The smaller open models are basically useless at this. Eleven percent is worse than random guessing, presumably.
Actually, Alpaca and Vicuna perform worse than even a uniform distribution baseline. They're actively bad at this. And there's another finding: GPT-three point five and Mixtral benefit from being given demographic attributes. Their predictions improve when you tell them "this person is from South America, urban area, university educated." But Alpaca and Vicuna get worse when you add demographics. They can't effectively condition on those variables.
That's a meaningful capability threshold. If you can't condition on demographics, you can't do any kind of culturally-aware reasoning. You're just regurgitating a global average, which is going to be heavily skewed toward whatever dominates your training data.
Which brings us to GlobalOpinionQA. This is from Anthropic, twenty twenty-three, and it's built from cross-national surveys — primarily Pew Global Attitudes surveys. The methodology is different again. Instead of testing factual knowledge or value prediction, it measures whose opinions LLM-generated responses align with on global societal issues.
It's an alignment measure, not a knowledge measure.
They define a similarity metric that quantifies how close LLM-generated survey responses are to human responses, conditioned on country. They ran three experiments. One: default LLM responses, no prompting about country. Two: prompting the model to consider a specific country's perspective. Three: translating questions to a target language.
The headline finding?
By default, LLM responses are most similar to opinions from the USA, some European countries, and South American countries. That's the WEIRD default, empirically measured rather than just asserted. When you prompt the model to consider a specific country's perspective, responses do shift toward that population's views — but the paper explicitly notes this can "reflect harmful cultural stereotypes." The model doesn't necessarily give you accurate representation. It gives you stereotyped representation.
The language translation experiment?
Translating questions to a target language does not guarantee alignment with speakers of those languages. That's the third finding, and it echoes what BLEnD found. Language and culture are not the same thing. You can't fix cultural bias by just translating the prompt.
That's a crucial point that a lot of the "just use more languages" discourse misses. Language is a vehicle for culture, but it's not a proxy for it. A question asked in Hindi is not automatically answered from an Indian cultural perspective.
GlobalOpinionQA has a sibling benchmark called OpinionQA, which is US-only — fourteen hundred ninety-eight multiple-choice questions from Pew's American Trends Panel, about ninety-one thousand question-demographic pairs. GlobalOpinionQA extends that methodology internationally. The key distinction is that OpinionQA measures alignment with US demographic subgroups, while GlobalOpinionQA measures alignment with national populations.
Alright, let's get to the fifth one. This one, from what I've read, is philosophically different from the other four.
WorldView-Bench is from JAIR twenty twenty-five, and it doesn't ask whether the model knows the right answer. It asks whether the model acknowledges multiple valid worldviews. The framework is built on something called multiplexity theory — the distinction between open civilizations that engage with plurality and closed ones that assimilate or marginalize alternative viewpoints.
Instead of measuring accuracy against a ground truth, it's measuring inclusivity of perspectives.
The dataset is one hundred seventy-five synthetically generated and human-validated questions across seven domains — ethical and moral, religious, lifestyle, cultural norms, traditions, history, and technology. Twenty-five questions each. And crucially, it's designed for free-form generative responses. No predefined answer categories, no multiple choice. They argue that closed-form benchmarks inherently cannot measure cultural inclusivity because they reduce it to a set of predefined categories.
Which is a legitimate critique. If you're testing whether a model can select the correct multiple-choice answer about Chinese dining etiquette, you're not testing whether it can engage with the diversity of Chinese dining practices. You're testing whether it knows the modal answer.
Their evaluation pipeline is completely different. Step one: zero-shot classification extracts cultural references from the LLM's response. Step two: they compute something called the Perspectives Distribution Score, or PDS, and PDS entropy, which quantifies the proportional representation and diversity of cultural viewpoints. Higher entropy means less cultural polarization — the model is drawing from more diverse perspectives. Step three: cultural sentiment analysis to detect implicit biases.
What does high entropy look like in practice?
A response that references multiple cultural traditions, acknowledges different viewpoints, and doesn't privilege one as the default. Low entropy — what they call a "uniplex" response — is one that homogenizes diverse cultures into a single dominant narrative. The baseline PDS entropy for standard LLMs is thirteen percent. Strong uniplex bias.
That's almost total homogenization.
Here's where it gets interesting. They tested interventions. When they used contextually-implemented multiplex LLMs — system prompts embedding multiplexity principles — entropy doubled to twenty-six percent. Still not great. But then they tried a multi-agent system, where agents representing distinct cultures collaborate on the response. That achieved ninety-four percent entropy, with sixty-seven point seven percent positive sentiment.
Ninety-four percent. So the architecture matters enormously. A single model, even prompted to be inclusive, can't get past twenty-six percent. But structuring the system as a conversation between culturally-distinct agents gets you to ninety-four.
Which suggests that the problem isn't just training data or prompting. It's architectural. A single model, by design, converges toward a single output distribution. It doesn't naturally represent pluralism. You have to build pluralism into the system architecture.
Alright, let me pull back and look at the methodological landscape across all five. There's a fundamental tension here, and I think it's worth naming directly. CulturalBench, BLEnD, WorldValuesBench, and GlobalOpinionQA all define correctness against human data — annotator agreement, survey responses, majority vote. WorldView-Bench explicitly rejects that approach and measures diversity of perspectives instead. Are these two families measuring the same thing?
I don't think they are. And I think that's fine. They're measuring different things that are both important. The first family is measuring cultural knowledge and alignment — does the model know what people in a given culture actually think, do, or believe? The second family is measuring cultural inclusivity — does the model acknowledge that multiple valid perspectives exist? You can imagine a model that scores perfectly on CulturalBench but is still uniplex. It knows the right answer for every culture, but it presents each culture as a monolithic block with no internal diversity.
You can imagine the opposite failure mode. A model that's beautifully multiplex — it acknowledges diverse perspectives on everything — but gets the basic facts wrong about what people actually believe. It's inclusive but inaccurate.
Which is why both families are needed. But there's a deeper methodological problem that all five benchmarks grapple with, and none fully solve. How do you establish ground truth when culture itself is contested?
This is what I keep coming back to. CulturalBench uses majority vote — four out of five annotators. But the multi-mode question problem shows the limitation. If chopsticks is the modal answer but spoons are also valid, majority vote flattens the distribution. BLEnD allows multiple answers and reports only sixty-three percent inter-annotator agreement, which is more honest about the fuzziness. WorldValuesBench treats survey responses as ground truth, but those are samples from a specific time period — values change, and survey methodology has its own biases.
GlobalOpinionQA explicitly warns that prompting for a country's perspective can trigger stereotypes. So even when you try to condition on culture, you might be getting a caricature rather than an accurate representation. The ground truth isn't just hard to establish — in some cases, the act of trying to establish it can produce harmful outputs.
WorldView-Bench sidesteps this entirely by not having a ground truth. It doesn't ask "is this response correct?" It asks "does this response include multiple perspectives?" But that's a different kind of limitation. You can't use WorldView-Bench to tell whether a model actually understands Ethiopian dining etiquette. You can only tell whether it acknowledges that Ethiopian dining practices are diverse.
I think the honest answer is that there is no single ground truth for cultural knowledge. Culture is a distribution, not a point. The best we can do is measure different aspects of the distribution — the mode, the variance, the breadth of perspectives acknowledged. No single benchmark captures all of that.
Let me throw another wrinkle in. The CulturalBench finding that GPT-four-o outperforms regional models on their own regions — that suggests training data scale matters more than anything else for these knowledge-based benchmarks. But does that hold for the values-based ones?
WorldValuesBench shows that model scale matters enormously — GPT-three point five and Mixtral do reasonably well, while Alpaca and Vicuna are worse than random. But we don't have a direct regional comparison for values prediction the way CulturalBench does for knowledge. It would be fascinating to see if a model trained primarily on Chinese data is better at predicting Chinese values on the World Values Survey.
The BLEnD language paradox complicates this further. If low-resource language performance is worse than English performance even for questions about those cultures, then simply training on more data from those regions might not help if the model can't effectively process the language.
It suggests a two-stage problem. First, you need sufficient representation in training data. Second, you need the model to actually encode that representation in a way that's accessible through the local language. Right now, we're failing at both stages for low-resource cultures, but the language encoding problem is the one that's less discussed.
What about the practical question? If someone is building an AI system and wants to evaluate its cultural bias, which of these benchmarks should they use?
It depends on what they're building. If they're building a factual Q-and-A system, CulturalBench and BLEnD are the obvious choices. CulturalBench is more rigorous on the annotator verification side, but BLEnD covers more languages and everyday scenarios. If they're building something that makes recommendations or predictions about people, WorldValuesBench is relevant because it tests whether the system can model human value distributions. If they're building a system that generates text about cultural topics, WorldView-Bench's multiplexity framework is the one to watch.
GlobalOpinionQA is the one I'd use if I wanted to check whether my system defaults to American opinions on global issues. Which, given what we know about training data, it probably does.
And I think the multi-agent finding from WorldView-Bench is the most actionable result across all five papers. If you want cultural inclusivity in generated text, don't just write a better system prompt. Structure the system as a conversation between multiple culturally-distinct perspectives. That's an architectural insight, not just a prompting trick.
Ninety-four percent entropy versus twenty-six percent. That's not an incremental improvement. That's a phase change.
It makes intuitive sense. A single model trying to be "inclusive" is still one model, with one set of weights, generating one output distribution. It can't help but converge. But multiple models, each representing a different cultural vantage point, can produce pluralistic output. The inclusivity emerges from the interaction, not from any single model's training.
Which is a nice metaphor for how cultural understanding actually works in the real world. You don't get it from one person trying really hard to be open-minded. You get it from actually talking to people with different experiences.
And that's something none of the other four benchmarks even attempt to measure. They're all evaluating a single model in isolation. WorldView-Bench is the only one that tests whether the system architecture itself can support pluralism.
Alright, let me try to synthesize this for listeners who might be trying to make sense of this landscape. If you care about cultural bias in AI, there are really three distinct things you might want to measure. One: does the model know cultural facts correctly, and for all cultures, not just WEIRD ones? CulturalBench and BLEnD answer that. Two: does the model align with the actual values and opinions of people from different cultures? WorldValuesBench and GlobalOpinionQA answer that. Three: does the model acknowledge cultural diversity rather than homogenizing everything into a single narrative? WorldView-Bench answers that.
That's a clean framework. And I'd add that no single benchmark is sufficient. You need at least one from each category. A model that's factually accurate but uniplex is a problem. A model that's beautifully multiplex but factually wrong is a problem. And a model that's accurate and inclusive but aligns only with American values is also a problem.
The other thing I'd flag for listeners is the ground truth problem. Every one of these benchmarks makes a choice about what counts as correct, and that choice has consequences. Majority vote flattens diversity. Survey data captures a moment in time. Stereotype-aware prompting can backfire. And refusing to define correctness at all, like WorldView-Bench does, means you can't measure accuracy. There's no free lunch here.
That's not a failure of these benchmarks. It's a reflection of the fact that culture is contested, fluid, and plural. Any measurement methodology is going to be an approximation. The question is whether the approximation is useful for the thing you're trying to evaluate.
One last thing I want to pull out. The CulturalBench finding about surface-level heuristics — forty percent accuracy just from embedding similarity, without reading the question — that's a warning for anyone building or using these benchmarks. Models are very good at finding shortcuts. If your benchmark can be gamed by pattern matching on culture names, it's not measuring what you think it's measuring.
That's why CulturalBench's hard version matters so much. Converting to binary true-false questions forces the model to actually evaluate each option. It's a methodological innovation that other benchmark designers should pay attention to. If your benchmark has a multiple-choice format, you should check whether models can cheat on it.
BLEnD's short-answer format serves a similar purpose. You can't pattern-match your way to a correct free-text answer about what a typical Nigerian breakfast looks like. You either know it or you don't.
Though short-answer evaluation has its own problems — how do you automatically score "egusi soup with pounded yam" against a reference answer that says "akara and ogi"? You need either human evaluation or very sophisticated semantic matching. There's a reason most benchmarks use multiple choice.
Trade-offs everywhere. Alright, I think we've covered the methodological landscape. If listeners want to dig into any of these papers, they're all on arXiv — we'll make sure the links are in the show notes. Thanks as always to our producer Hilbert Flumingtop for keeping us on track, and to Modal for powering the pipeline that makes this show possible.
This has been My Weird Prompts. You can find every episode at myweirdprompts dot com, or search for us on Spotify. We'll be back next time.
Until then, maybe don't trust a language model to tell you how to eat your breakfast.