Daniel sent us this one — he's been thinking about what we touched on with Emet Schneiderman's work using rhesus monkeys for jaw development research, and it opened up a bigger question. When we picture animal testing, we picture lab mice. White ones, pink eyes, the whole cliché. But the actual landscape is a sprawling zoo — different species chosen because each one models a specific slice of human physiology. And his second question might be even more interesting: what kind of expertise does a researcher develop when they spend years working with one species? Do they become, in effect, a part-time primatologist? Do they call in specialists when they hit knowledge gaps? There's a whole hidden infrastructure here, and I want to dig into it.
The timing on this is perfect, actually. We're in this weird inflection point where AI-driven drug discovery is spitting out candidate molecules faster than ever — the bottleneck is shifting hard toward validation. And validation means animal models. So the question of which species you pick, and why, and whether the researcher actually understands that species deeply enough to interpret the data — that's becoming the thing that determines whether a drug makes it to humans or dies in preclinical.
The bottleneck moved from discovery to translation.
And translation is where species choice becomes everything. Here's the core tension: every animal study exists to avoid harming humans, but every animal is an imperfect model of a human. The question isn't which species is perfect — none of them are. The question is which imperfections you can tolerate for the specific thing you're testing.
Which imperfections you can live with.
That's the whole game. And the term for this is model validity — the degree to which a non-human animal predicts the human response. And here's what most people don't realize: model validity isn't one number. It varies wildly depending on what you're testing. A mouse might be a terrible model for cardiac arrhythmia but an excellent model for a certain type of leukemia. Same species, totally different validity scores.
You can't even say mice are a good model. You have to say mice are a good model for what.
And that for what part is where the expertise lives. So I want to walk through this in three chunks. First, the actual zoo — which animals are used for which therapeutic areas and why. Second, the human expertise layer — how researchers become specialists in their species. And third, the hidden infrastructure of consultants, databases, and the regulatory framework that ties it all together.
Let's start by mapping out the animal kingdom of drug testing. Because the species you choose determines everything about the data you get.
Let's blow up the misconception first. The public imagines lab mice as the universal default. And yes, mice and rats dominate the raw numbers — they're something like ninety to ninety-five percent of all research animals. But that's heavily skewed toward basic science and early discovery. When you get to the regulatory toxicology studies that actually determine whether a drug enters human trials, the species diversity explodes.
Because the FDA wants more than a mouse.
The FDA typically requires data from two species — one rodent and one non-rodent — before they'll let you dose a human. And the choice of that non-rodent species is where things get fascinating. For cardiovascular drugs, the gold standard is dogs. Canine cardiac electrophysiology — the way electrical signals move through the heart — mirrors humans almost uncannily. The ion channels that control the heartbeat, particularly the hERG potassium channel, behave very similarly in dogs and humans. If a drug is going to cause a fatal arrhythmia, a dog study will probably catch it.
A mouse won't.
A mouse heart beats about five to six hundred times a minute. A human heart beats sixty to a hundred. The electrophysiology is just fundamentally different. You can study a mouse heart and learn a lot about mouse hearts, but you cannot reliably predict human QT prolongation from a mouse ECG. So if you're developing a new antiarrhythmic or even just screening a candidate for cardiac liability, you need dogs or sometimes mini-pigs.
Which brings us to pigs. What are pigs the gold standard for?
Dermal toxicology and wound healing. Pig skin has an epidermis that's similar in thickness to human skin — about seventy to a hundred micrometers versus twenty to thirty in mice. The hair follicle density, the collagen structure, the healing patterns — pigs heal more like humans than any other common lab species. If you're developing a burn dressing or a topical drug, you test it on pigs. Rodent skin heals primarily by contraction — the wound pulls together. Human and pig skin heal by re-epithelialization — new tissue grows across the wound. Totally different mechanism.
If you tested a burn dressing on a mouse, you'd be testing the wrong healing process entirely.
You'd get data. It would just be data about mouse wound contraction, which doesn't tell you what you need to know. And that's the pattern that repeats across every therapeutic area. The species isn't just a stand-in for a human — it's a stand-in for a specific human system, and only if you pick the right one.
Walk me through the other major players. We've got dogs for cardiac, pigs for skin. What about the brain?
Non-human primates for neuroscience. Macaques — rhesus and cynomolgus — have prefrontal cortex organization that's close enough to humans to study cognition, memory, and neurodegenerative disease. Their immune systems also track human immune responses more faithfully than rodents. For infectious disease research, especially respiratory viruses, ferrets are the unsung heroes. Ferret lung anatomy and the distribution of ACE2 receptors — the same receptors SARS-CoV-2 uses to enter cells — closely mirrors humans. That's why ferrets were central to COVID research.
Zebrafish are the high-throughput screening workhorse. Their embryos are transparent, so you can literally watch organ development in real time under a microscope. You can test thousands of compounds for developmental toxicity without dissecting a single animal — you just look through the embryo. They also regenerate heart tissue, which makes them valuable for cardiac regeneration research. But they're fish. Their predictive value for human cardiac electrophysiology is essentially zero. Different tool for a different question.
You mentioned guinea pigs earlier, off air. What's their niche?
Allergy and asthma. Guinea pigs develop anaphylaxis in a way that closely resembles human anaphylaxis — bronchoconstriction, hypotension, the whole cascade. Mice can model allergic responses, but the guinea pig airway response is mechanistically closer to human asthma. If you're testing an inhaler for allergic asthma, the guinea pig is often your preclinical model of choice.
Then there's the cost gradient that nobody talks about.
A lab mouse costs somewhere between twenty and fifty dollars. A purpose-bred beagle dog runs five hundred to two thousand. A cynomolgus macaque? Five thousand to fifteen thousand dollars per animal. And that's just acquisition. Housing, feeding, veterinary care, enrichment — primate studies can run into the hundreds of thousands of dollars. So sample sizes shrink as you move up the phylogenetic ladder. A mouse study might have forty animals per group. A primate study might have four to six.
Four to six animals. That's not a study, that's an anecdote with error bars.
It's a real statistical power problem. And researchers know this. You're making go, no-go decisions on drugs based on six monkeys. The only reason that's defensible is that the model validity is supposed to be higher — you're trading sample size for translational relevance. But if your model validity assumption is wrong...
Then you've got the worst of both worlds. Small sample and bad prediction.
Which brings us to the canonical cautionary tale. TGN1412, two thousand six. This was a monoclonal antibody developed by TeGenero, designed to treat autoimmune disease. It targeted the CD28 receptor on T-cells. They tested it in cynomolgus macaques at five hundred times the dose they planned to give humans. The monkeys were fine. No adverse effects. So they dosed six healthy human volunteers at a fraction of that amount. Within ninety minutes, all six were in intensive care with catastrophic cytokine storms — multi-organ failure, the whole nightmare. They all survived, barely, but some had permanent organ damage.
Why did the monkeys not react?
Species-specific difference in CD28 receptor expression. Cynomolgus macaque T-cells don't express CD28 in the same way human T-cells do. The drug activated human T-cells explosively but left macaque T-cells essentially untouched. The model wasn't just imperfect — it was blind to the exact mechanism the drug was designed to target.
The monkey was the wrong model for the question being asked.
Here's the brutal part: at the time, the researchers thought they'd made the right choice. Cynomolgus macaques were considered the standard non-human primate model for immunology. The CD28 receptor was known to be highly conserved across species. Everyone assumed the monkey data would translate. It took a near-fatal clinical trial to reveal the gap.
That changes how the whole field thinks about model validity.
It led to new guidelines from the UK's MHRA and the EMA. Now, for high-risk immunomodulatory drugs, you typically need to demonstrate that the target receptor in your animal model actually behaves like the human receptor. Not just assume it does. You have to prove it with in vitro binding assays before you even start the animal study. The TGN1412 disaster effectively created a new regulatory requirement.
The cost gradient you mentioned — five thousand to fifteen thousand per macaque — that already constrains sample sizes. Then you layer on the ethical dimension. The two thousand twenty-three FDA Modernization Act two point zero now allows alternatives to animal testing. Organ-on-a-chip, computer models, in vitro systems. It doesn't ban animal testing, but it removes the mandate that said you had to use animals. Now researchers have to justify why they're using an animal model at all.
That justification requires deep species expertise. You can't just say we used monkeys because that's what everyone uses. You have to articulate why this species is necessary for this question and why alternatives won't suffice. The burden of proof shifted.
That's the regulatory layer. But let me pull us back to something the prompt asked that I think is genuinely underexplored. The human expertise question. If you're a toxicologist who's been running primate studies for fifteen years, what do you know that a textbook can't teach you?
This is where it gets fascinating. A researcher who works with macaques for a decade develops what's essentially tacit knowledge — intuitive pattern recognition that's never written down. They know that a specific macaque's cortisol level is elevated not because of the drug but because it's lower in the dominance hierarchy and got intimidated by the alpha during feeding that morning. They know that a particular strain of cynomolgus from Mauritius has slightly different baseline liver enzymes than the same species from Vietnam. They can look at an animal and say that's not a drug effect, that's stress — because they've seen thousands of macaques under thousands of conditions.
That knowledge never makes it into the methods section.
And this connects to a real crisis in the field. A twenty twenty-two meta-analysis in PLoS Biology found that only about fifty percent of preclinical animal studies report sufficient detail to assess model validity. Things like the exact strain, the supplier, the housing conditions, the circadian timing of dosing, the social grouping. These aren't trivial details — they can completely change how an animal responds to a drug. But they're often omitted.
Even expert researchers can't evaluate each other's work.
Because the tacit knowledge that makes the data interpretable stays in the researcher's head. It's a reproducibility problem driven by expertise that's never externalized.
How does someone even end up as the macaque person? What's the career path?
Typically, a PhD toxicologist or pharmacologist starts in grad school with rodent work — it's cheaper, faster, and the institutional barriers are lower. They might do their dissertation on, say, hepatotoxicity in rats. Then they do a postdoc or join a pharma company and get assigned to a non-human primate study because the drug program requires it. They shadow a senior primate researcher for a year or two, learning the handling, the behavior, the pathologies. After five years, they're competent. After ten to fifteen, they're the person everyone calls when a monkey study produces weird liver enzymes.
At that point, are they a primatologist? Or a toxicologist who happens to know monkeys?
They're a toxicologist with a subspecies-level specialization. They probably can't tell you much about wild macaque ecology or mating behavior. But they can tell you that the Mauritius-origin cynomolgus has a polymorphism in the CYP2C19 enzyme that affects drug metabolism, and if your study uses mixed-origin animals you're going to get noisy pharmacokinetic data that looks like a drug effect but isn't.
That's incredibly specific.
That's the level the expertise operates at. And here's the thing — when they hit a knowledge gap, and they do, they don't just guess. There's an entire consultation ecosystem. Veterinary pathologists who specialize in non-human primates. The National Primate Research Centers — the NIH maintains eight of them across the country — they provide not just animals but expertise. If you're running a primate study at a university and you see something unexpected, you call the NPRC and say what does this mean. They've seen it before.
It's a guild structure. Apprenticeship, specialization, consultation with masters.
Increasingly, it's being supplemented by databases. The Mouse Phenome Database has been around for a while — it catalogs physiological parameters across hundreds of mouse strains. The Primate Phenotype Database is newer and sparser, but growing. And in twenty twenty-four, NCATS — the National Center for Advancing Translational Sciences — launched something called the Animal Model Validity Index. It scores how well each species models specific human diseases, based on systematic review of the literature.
You can look up a disease and see which species has the highest validity score.
The index is still early and incomplete, but the direction is clear. The goal is to make species selection evidence-based rather than tradition-based. No more we use beagles for cardiac because that's what we've always done. Instead, here's the validity score for beagle cardiac electrophysiology versus mini-pig versus humanized mouse, pick the best one for your question.
Let me push on something. You mentioned the fifty percent reporting problem. If half the literature is missing critical detail, how do you build a validity index? The index is only as good as the studies it's based on.
That's the catch. The index is built from the subset of studies that are well-reported, which introduces its own selection bias. The studies with the most detail tend to be the ones from the best-funded labs, which also tend to use the most expensive models. So the validity index might inadvertently reinforce the use of expensive primate models not because they're always better, but because they're better documented.
Which is a knock-on effect nobody planned for.
This is what I mean about the hidden infrastructure. Every decision — which species, which strain, which supplier, which housing — cascades into the quality and interpretability of the data. And most of these decisions are made by researchers drawing on tacit expertise that's never been systematically captured.
Let's talk about when things go wrong even with good models. The prompt asked about drugs that work in multiple species but fail in humans.
Alzheimer's disease is the poster child for this. Between two thousand and twenty twenty, something like eight out of ten Alzheimer's drug candidates that succeeded in transgenic mouse models failed in human trials. These mice expressed human amyloid precursor protein — they developed amyloid plaques just like Alzheimer's patients. The drugs cleared the plaques. Everything looked perfect.
In humans, nothing.
Or worse than nothing — some drugs made cognition worse. The problem, we now understand, is that these mouse models had human amyloid pathology but mouse immune systems, mouse aging patterns, and none of the tau tangles or neuroinflammation that characterize the full human disease. The model was valid for amyloid clearance but invalid for the actual clinical endpoint, which is cognitive decline.
The model answered the question it was designed for. It just wasn't the right question.
That's the expertise gap in action. Twenty years ago, the field believed that amyloid clearance would be sufficient. The mouse data supported that belief. It took a string of failed phase three trials to reveal that the model was answering a narrower question than everyone assumed.
What about a case where multi-species testing actually worked? Where the zoo approach paid off?
Remdesivir for COVID is a great example. The development program used three different animal models, each answering a different question. Ferrets were used for respiratory transmission — they develop upper respiratory infection similar to humans and transmit the virus to other ferrets. Rhesus macaques were used for lung pathology — they develop the lower respiratory disease that actually kills people. And transgenic mice expressing human ACE2 were used for antiviral efficacy screening — can the drug reduce viral load in a living system. No single species could have answered all three questions. The ferret can't tell you about severe lung disease. The macaque can't be used in the numbers needed for dose-ranging. The mouse can't model transmission.
Each species was a partial answer, and you needed all three to assemble a complete picture.
Even then, the picture was incomplete. Remdesivir turned out to be modestly effective in humans — it reduces recovery time but doesn't dramatically reduce mortality. The animal models suggested it would be more potent than it actually was. So even the multi-species convergence approach has limits.
Which brings us to the core epistemological problem. How do you know when you've reached the limits of your species expertise?
The honest answer is you often don't, until a human trial tells you. But there are signals. If your drug shows different pharmacokinetics in two species that are supposed to be pharmacokinetically similar, that's a red flag. If the dose-response curve is flat in one species and steep in another, that's a red flag. Experienced researchers learn to recognize these discordances and investigate rather than averaging them out.
Averaging them out. That's what happens when you treat species as replicates instead of as independent models.
It happens more than anyone wants to admit. You run a mouse study, a rat study, and a dog study. The mouse shows efficacy at ten milligrams per kilo, the rat at twenty, the dog at five. You average those and say the effective dose is around twelve. But the dog is telling you something different from the rat, and that difference might be the most important signal in your dataset.
Because the dog might be right and the rat might be wrong. Or vice versa.
You can't know which without understanding why they differ. That's where species expertise becomes indispensable. The researcher who knows that dogs have a different CYP enzyme profile than rodents can look at those discordant doses and say the drug is being metabolized differently in dogs, let's check the metabolite profile before we pick a human starting dose.
The expertise isn't just about knowing your species. It's about knowing how your species differs from other species, and from humans.
Comparative biology as a professional skill. And the NIH has been quietly building infrastructure for this. The eight National Primate Research Centers I mentioned — they house something like twenty-five thousand non-human primates across the network. But they also employ comparative pathologists, behaviorists, and veterinarians whose entire job is to help researchers interpret primate data. If you're a toxicologist at a small biotech who's never worked with macaques before, you can contract with an NPRC and get access to decades of institutional knowledge.
It's like a reference library made of people.
The people are aging. A lot of the really deep primate expertise is concentrated in researchers who are nearing retirement. The pipeline of new primate researchers is thin, partly because it's expensive, partly because it's ethically fraught, partly because the FDA Modernization Act is pushing the field toward alternatives. We might be in a window where species expertise is peaking just as the regulatory framework starts to de-emphasize animal models.
That's a strange inflection point. The expertise is at its maximum right when it might become obsolete.
Or it might become the foundation for validating the alternatives. Organ-on-a-chip systems and AI models of human biology don't emerge from nowhere. They're trained on data from animal studies and human clinical trials. The species expertise that researchers have built — the knowledge of how a macaque liver differs from a human liver — that's exactly the knowledge you need to build a computational model that corrects for those differences.
The expertise doesn't become obsolete. It gets encoded.
That's the optimistic view. The pessimistic view is that it gets lost because nobody bothered to encode it before the experts retired.
Let me shift to something more practical for our listeners. If someone's reading a news article about a promising new drug that worked in mice, what should they ask themselves?
First, what species was used and why? If the article doesn't say, that's already a red flag. Second, was the efficacy demonstrated in more than one species? A drug that works only in mice has a much lower probability of translating to humans than a drug that works in mice and dogs or mice and primates. Third, what was the actual endpoint? Did the mice live longer, or did their tumors shrink, or did a biomarker change? Tumor shrinkage in a mouse model of cancer is not the same thing as survival benefit in a human.
The endpoint question feels like the one most people miss.
Because it's the least intuitive. A drug that shrinks tumors in mice sounds amazing. But mouse tumor models often use tumors that grow much faster than human tumors and respond more dramatically to treatment. The same drug in a human might slow tumor growth by ten percent, which is clinically meaningful but not the dramatic shrinkage you saw in the mouse. The model is valid for tumor biology but the endpoint translation is tricky.
That's the expertise again. Knowing that tumor shrinkage means something different in a mouse with a fast-growing xenograft than in a human with a slow-growing adenocarcinoma.
Here's something actionable for anyone working in biotech or pharma. The FDA is increasingly encouraging what are called model validity statements in preclinical reports. A section where the researcher explicitly states why they chose this species, what the known limitations are, and how they've tried to mitigate those limitations. It's not yet mandatory, but it's heading that way.
Push for that. If you're reviewing a preclinical package, ask for the model validity statement. If there isn't one, ask why not.
If you're investing in a biotech company, ask about their animal model strategy. A company that says we tested it in mice and it worked is doing the bare minimum. A company that says we tested it in three species chosen for their specific relevance to our therapeutic area, and here's how we interpret the concordance and discordance between them — that's a company that actually understands translational science.
What about for the general listener who just wants to be a more critical consumer of science news?
The phrase worked in mice should trigger a specific skepticism. Not dismissal — plenty of drugs that worked in mice also work in humans. But the base rate for mouse-to-human translation in most therapeutic areas is somewhere between ten and thirty percent, depending on the disease. So when you hear worked in mice, you should mentally append with unknown human relevance until we see multi-species data.
Ten to thirty percent. That's sobering.
That's for the studies that are well-designed. The poorly-designed ones have an even lower rate. The PLoS Biology -analysis I mentioned — the one finding that only fifty percent of studies report enough detail — that implies that a huge fraction of the preclinical literature is essentially uninterpretable. You can't assess validity if you don't know the strain, the housing, the dosing schedule.
Half the animal studies out there might as well have a footnote that says results may vary for reasons we didn't document.
That's uncomfortably close to the truth. And it's not necessarily fraud or incompetence. It's often just that the researcher has been working with this strain for so long that they don't think to mention that it's the C57BL/6 substrain from Jackson Labs rather than the one from Charles River, even though those two substrains have drifted genetically and can respond differently to drugs.
The expertise becomes invisible to the expert.
It's the curse of knowledge. The thing you know so deeply that you forget other people don't know it.
Let me pull us toward the future. We've got organ-on-a-chip systems getting more sophisticated. AI models that can predict human toxicity from molecular structure. The FDA Modernization Act opening the door to alternatives. What does the animal testing landscape look like in, say, fifteen years?
I think we're heading toward a hybrid model. The AI systems will get good enough to handle the straightforward cases — the drug that's obviously toxic or obviously safe based on its chemical structure and known biology. The animal studies will be reserved for the edge cases and the complex systems. Neuroscience, immunology, developmental biology — areas where the biology is too complicated to model computationally with current or near-future technology.
The animals become the reference standard. You use them to validate the AI models, not to screen every drug.
That's where today's species expertise becomes tomorrow's validation dataset. Every time a researcher says this macaque responded differently than the AI predicted, and here's why, that's a training data point that makes the AI better. The expertise gets encoded incrementally.
That's a more hopeful picture than the expertise just evaporating as people retire.
It only works if we capture the expertise before it's gone. The databases, the validity indices, the detailed reporting standards — all of that is an attempt to externalize tacit knowledge before the tacit knowledge holders leave the field.
What we're really talking about is a race between knowledge capture and knowledge loss.
The stakes are high. Every drug that fails in phase three because of a species model validity gap represents hundreds of millions of dollars and years of wasted time. More importantly, it represents patients who waited for a treatment that didn't work. Getting species selection right isn't just an academic exercise — it's a public health priority.
Which brings us back to where we started. The prompt asked about the hidden expertise behind animal studies, the consultation networks, the knowledge accumulation. And what we've uncovered is that the entire enterprise rests on a foundation of human judgment that's rarely acknowledged and poorly documented.
The animals are the visible part. The expertise is the invisible part. And the invisible part might be more important.
For our listeners — whether you're evaluating a clinical trial, investing in a biotech, or just reading the morning news about a mouse study — the question to ask is always the same. What species, and why? The answer tells you more than the headline ever will.
If you want to go deeper on another layer of the drug safety ecosystem that most people never see, we did an episode on why small countries like Israel re-review drugs that the FDA and EMA have already approved. It turns out that regulatory re-review catches issues that animal studies miss — a different kind of model validity problem, but the same underlying theme. Hidden expertise catching hidden risks.
We'll put a link in the show notes.
Now, Hilbert's daily fun fact.
Hilbert: On the high plateaus of Madagascar, katabatic winds — cold, dense air rushing down mountain slopes at night — can create microclimates so extreme that certain plant species were long believed extinct, only to be rediscovered in sheltered valleys where the wind never reaches.
...right.
Here's the open question I want to leave with. As organ-on-a-chip systems and AI models get better, we might reach a point where the only animals used in drug testing are the ones validating those alternative models. The animals become the calibration standard, not the screening tool. And if that happens, the species expertise we've been talking about becomes even more valuable, not less — because calibrating a model requires knowing exactly how and why the animal differs from the human. The expertise doesn't disappear. It gets baked into the system.
The researchers who spent decades learning macaque liver enzymes and beagle cardiac electrophysiology — they become the people who teach the AI what to look for.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. You can find every episode at myweirdprompts.If you got something out of this one, leave us a review wherever you listen — it helps other people find the show.
I'm Herman Poppleberry.
I'm Corn. We'll be back.