#584: Inside the Lab: The Rise of Autonomous AI Research

Can AI actually do math research? Herman and Corn dive into DeepMind’s Alithia agent and the shift toward "System 2" thinking in AI.

0:000:00
Episode Details
Published
Duration
28:46
Audio
Direct link
Pipeline
V4
TTS Engine
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

On a chilly morning in Jerusalem, brothers Herman and Corn Poppleberry sat down to dismantle one of the most significant shifts in the history of artificial intelligence: the transition from AI as a chatbot to AI as an autonomous researcher. Triggered by a listener’s question regarding Google DeepMind’s new "Alithia" agent—a part of the Gemini 3 DeepThink project—the discussion moved quickly from science fiction to the hard reality of modern computer science.

From Pattern Matching to "System 2" Thinking

The core of the discussion centered on how AI is fundamentally changing its "thought" process. Herman explains that for years, large language models (LLMs) operated primarily through "System 1" thinking—a term borrowed from psychologist Daniel Kahneman. This type of thinking is fast, instinctive, and based on pattern recognition. When you ask a standard AI a question, it predicts the next likely word based on its training data.

However, Alithia represents a shift toward "System 2" thinking: a slower, more deliberative, and logical process. Through a technique known as "test-time compute" or "inference-time scaling," the AI no longer just spits out an answer. Instead, it uses its computational power to explore a "tree of possibilities," iterating and checking its own logic before presenting a final result. Herman notes that Alithia can use formal verification languages like Lean or Coq to ensure its mathematical proofs are logically sound, marking a massive milestone where AI isn't just mimicking mathematicians—it’s performing original research.

The New Architecture of AI Labs

Corn and Herman highlight that the nature of the organizations building these tools has also shifted. The line between a consumer software company and a high-level research university has blurred. Labs like Google DeepMind, OpenAI, and Anthropic have become massive, multi-layered organizations that function more like secret government projects than traditional tech firms.

This shift is reflected in their hiring practices. Herman points out that these labs are no longer just looking for computer scientists. The "masthead" of a modern AI research paper now includes theoretical physicists, pure mathematicians specializing in topology, cognitive scientists, and even linguists. The industry is currently hunting for "full-stack researchers"—individuals who can bridge the gap between dense theoretical math and the high-performance C++ coding required to manage clusters of thousands of GPUs.

The Academic "Brain Drain"

A significant portion of the episode focuses on the "brain drain" currently affecting global academia. Herman explains that the "compute moat"—the massive amount of hardware and capital required to train frontier models—is sucking talent out of universities. When a corporate lab can offer five times a professor’s salary and access to $100 million in computing power, the incentive to stay in traditional academia diminishes. This has created a two-tier system in science where the most advanced research is increasingly happening behind corporate doors rather than in public universities.

The Role of Third-Party Labs

Despite the dominance of the "Big Tech" vendors, Herman and Corn discuss the vital role of independent and semi-independent institutes. They highlight organizations like the Allen Institute for AI (AI2) in Seattle, which focuses on "AI for the common good" and transparency. They also touch on Mila in Montreal and the Vector Institute in Toronto, which act as bridges between government funding and corporate sponsorship, focusing on the fundamental "why" of AI rather than just the "how" of a commercial product.

The brothers also give a nod to grassroots collectives like EleutherAI. What started as a group of researchers on Discord looking to replicate GPT-3 has grown into a serious research powerhouse that keeps major corporations honest by releasing open-source models and datasets.

The Skepticism of Benchmarks

One of the most provocative points of the discussion involves the "reproducibility crisis" in AI. Corn asks how the industry verifies claims made by vendors who have a financial interest in their models looking like the best in the world. Herman warns against "marketing benchmarks," where companies cherry-pick data to show their models in the best light.

There is a growing concern that models are being "trained on the test." If the questions used to benchmark a model are included in its training data, the model isn't actually "smart"—it simply has a good memory. This is why autonomous research agents like Alithia are so significant. You cannot "cheat" on an unsolved mathematical conjecture. If an AI provides a proof for a theorem that has remained open for fifty years, and that proof can be verified by an independent logic-checker, the "vibes" of marketing are replaced by verifiable truth.

Conclusion: A New Era of Discovery

As the episode concludes, the takeaway is clear: we are moving away from the era of AI as a sophisticated toy and into an era where AI is a legitimate partner in scientific discovery. While the concentration of power in corporate labs remains a concern, the development of "System 2" thinking and formal verification suggests that the future of AI research will be defined by logic, transparency, and the pursuit of answers to questions that have eluded humans for decades.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Read Full Transcript

Episode #584: Inside the Lab: The Rise of Autonomous AI Research

Daniel Daniel's Prompt
Daniel
I’d love to talk about AI research, especially as it’s conducted at AI laboratories. These labs are often the vendors themselves, like Google with Gemini and DeepMind, which can refer to both the product and the research lab. I recently read about Google’s Gemini 3 DeepThink agent, Alithia, completing autonomous mathematical research.

I’d like to understand more about the people employed in these labs and their backgrounds. Besides vendors, what are the major third-party AI laboratories exploring emerging capabilities and agentic systems? Finally, is vendor-led research viewed with skepticism by the industry due to potential conflicts of interest? Let's discuss the labs behind these stories and the world of Arxiv and GitHub where AI innovation is explored.
Corn
Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am sitting here on a surprisingly chilly morning in Jerusalem with my brother. The sun is just starting to hit the limestone walls of the Old City outside our window, but the air in this stone house is definitely holding onto the winter.
Herman
Herman Poppleberry, at your service. It is definitely sweater weather today, Corn. I have got my oversized wool cardigan on, a double espresso that is currently the only thing keeping me upright, and a stack of printed papers that I have been highlighting since five in the morning. I am ready to dive into some deep technical territory.
Corn
Good, because our housemate Daniel sent us a prompt that is right up your alley. He has been following the latest news about Google and DeepMind, specifically this new agent called Alithia that is part of the Gemini three DeepThink project. Apparently, it has been completing autonomous mathematical research, which sounds like something straight out of a science fiction novel from twenty years ago. Daniel wants to know if this is the end of the human mathematician or just a very expensive calculator.
Herman
It is a fantastic jumping-off point, Daniel, so thank you for sending this in. It touches on the very structure of how modern progress in artificial intelligence actually happens. You see, the line between a consumer product and a high-level research laboratory has become incredibly blurry lately. People hear the name Gemini and they think of the app on their phone or the assistant helping them write an email. But behind that interface is a massive, multi-layered research organization that functions more like a high-end university or a secret government project than a traditional software company.
Corn
That is exactly what Daniel was asking about. He wanted to know about the people in these labs, the backgrounds they come from, and whether we should be skeptical of research that comes directly from the vendors who are trying to sell us these products. He also asked about third-party labs and the whole culture of ArXiv and GitHub. So, Herman, let us start with the big one. This Alithia agent. What is actually going on there? Is it really doing math, or is it just really good at looking like it is doing math?
Herman
That is the crucial distinction. For a long time, large language models were essentially sophisticated pattern matchers. If you asked them to solve a math problem, they were predicting the next most likely character based on thousands of similar problems they had seen in their training data. But the DeepThink architecture, which powers agents like Alithia, represents a shift toward what we call system two thinking. This is a reference to Daniel Kahneman's work on behavioral economics. System one is fast, instinctive, and emotional. System two is slower, more deliberative, and logical. Alithia is designed to use search and optimization techniques during the actual inference process. It is not just spitting out an answer. It is exploring a tree of possibilities, checking its own work, and iterating until it finds a logically sound proof.
Corn
So it is essentially "thinking" before it speaks? Like it is having an internal monologue?
Herman
In a manner of speaking, yes. It uses a technique often referred to as test-time compute or inference-time scaling. Instead of using a fixed amount of energy to generate every word, the system can decide to spend more time—more computational power—on a difficult math problem. It might run thousands of internal simulations or formal verifications using a language like Lean or Coq before it presents the final research paper. The fact that Alithia is doing this autonomously, meaning it identifies the problem, formulates a hypothesis, and then works through the formal logic without human intervention, is a massive milestone. It recently provided a novel proof for a lemma related to the twin prime conjecture that, while niche, was entirely original.
Corn
It makes me wonder about the people who build these things. If the A I is doing the research, what are the researchers doing? Daniel asked about the backgrounds of the people employed in these labs. I imagine it is not just your standard computer science graduates anymore.
Herman
You are right. It has changed significantly over the last five to ten years. In the early days of the deep learning revolution, you mostly saw computer scientists and statisticians. But now, if you look at the masthead of a major paper from Google DeepMind or OpenAI or Anthropic, you see an incredible diversity of expertise. You have theoretical physicists who are experts in information theory. You have pure mathematicians who specialize in category theory or topology. You have cognitive scientists who spend their time thinking about how human biological neural networks differ from artificial ones.
Corn
I have noticed that a lot of the top researchers actually have backgrounds in philosophy or linguistics too. It seems like as the models get more "agentic," meaning they can act on their own, the labs need people who understand the nature of agency and ethics.
Herman
Absolutely. There is a very famous researcher at OpenAI who actually started in neurobiology. The idea is that if we want to build an artificial general intelligence, we need to understand how general intelligence works in the only place we know it exists—the human brain. But there is also a very practical side to the hiring. These labs are looking for what I call full-stack researchers. These are people who can read a dense theoretical paper on Monday, write the high-performance C plus plus code to implement the idea on Tuesday, and then manage a cluster of ten thousand graphics processing units to train the model by Friday. It is a very rare skill set. We are talking about people who understand both the "why" of the math and the "how" of the hardware.
Corn
It also sounds like an incredibly expensive skill set. I mean, we hear about these seven-figure salaries for top A I researchers. Does that mean all the talent is being sucked out of the universities and into these corporate labs?
Herman
That is a major concern in the industry right now. We call it the brain drain. If you are a brilliant young professor at Stanford or M I T, and Google DeepMind offers you five times your salary plus access to a hundred million dollars worth of computing power that your university could never afford, it is very hard to say no. This is why the vendor labs are leading the charge. They have the compute moat. You can have the best ideas in the world, but if you do not have the hardware to test them, you are just writing theory. This has created a bit of a two-tier system in science.
Corn
That leads perfectly into Daniel's second question. Besides the big vendors like Google, Meta, and Microsoft-backed OpenAI, who are the major third-party A I laboratories? Who is doing the work that is not necessarily tied to a specific commercial product?
Herman
There are a few different categories here. First, you have the non-profit or semi-independent institutes. One of the most famous is the Allen Institute for Artificial Intelligence, or A I two, located in Seattle. It was founded by Paul Allen, the co-founder of Microsoft, but it operates independently. They focus on A I for the common good. They do incredible work on things like Semantic Scholar, which helps researchers find relevant papers, and they build open-source models like O L Mo, which stands for Open Language Model. Their goal is transparency. They want to show the world exactly how these things are trained, which is something the big vendors are becoming increasingly secretive about.
Corn
I have also heard you mention Mila and the Vector Institute. Those are more academic, right?
Herman
Exactly. Mila is based in Montreal and is led by Yoshua Bengio, who is one of the godfathers of deep learning. It is a partnership between several universities, but it functions like a world-class research lab. The Vector Institute in Toronto is similar. These labs are crucial because they act as a bridge. They get some government funding, some corporate sponsorship, but their primary mission is still pure research and training the next generation of scientists. They are less focused on building a product like Gemini and more focused on answering the fundamental "why" questions.
Corn
What about the newer labs that focus specifically on agentic systems? I feel like every week I hear about a new startup that claims to be a research lab first and a company second.
Herman
Yes, the lab-as-a-service model is very popular right now. Look at labs like Adept or even Cognition A I. They are focused specifically on agents—A I that can use a computer like a human does, clicking buttons, writing code, and browsing the web to accomplish a goal. And then you have organizations like Eleuther A I. They started as a grassroots group of researchers on Discord who wanted to replicate the results of G P T three because they felt it shouldn't be locked behind a corporate gate. They have since grown into a very serious research collective that produces high-impact papers and open-source datasets. They are the ones keeping the big guys honest.
Corn
It is interesting that you mention the corporate gate. That brings us to Daniel's point about skepticism. If Google publishes a paper saying their new Gemini model is the best in the world at autonomous math, and they have a financial interest in people believing that, how does the industry actually verify it? Is there a lot of "take it with a grain of salt" going on?
Herman
Oh, absolutely. There is a healthy, and sometimes unhealthy, amount of skepticism. In the A I world, we have a bit of a reproducibility crisis. When a vendor publishes a paper, they often describe the architecture, but they do not always release the weights of the model or the exact training data. This makes it very hard for an independent lab to verify the results. If I tell you I have built a car that runs on water, but I will not let you look under the hood or drive it yourself, you are going to be skeptical. We often see what we call marketing benchmarks, where a company cherry-picks the results that make them look best.
Corn
So how do they get away with it? Why does the scientific community still pay attention?
Herman
Because they usually provide some level of access via an Application Programming Interface, or A P I. Researchers can then run their own benchmarks against the model. But here is the catch—the benchmarks themselves are often gamed. There is a saying in data science: "When a measure becomes a target, it ceases to be a good measure." If a model is trained on the very questions that are in the standard math benchmarks, of course it is going to score a hundred percent. It is like a student who has the answers to the final exam memorized. They are not actually smart; they just have a good memory. This is why we are seeing a move toward private evaluations and live benchmarks that the models haven't seen before.
Corn
That is where the autonomous research like Alithia gets interesting, though. If it is solving new, previously unsolved mathematical conjectures, you cannot really cheat on that, can you?
Herman
Precisely. That is why the move toward formal verification and autonomous research is so significant. If Alithia produces a proof for a theorem that has been an open question for fifty years, and that proof can be verified by a separate, much simpler logic-checking program, then the skepticism disappears. You cannot fake a mathematical proof. It either works or it does not. That is the ultimate gold standard for research. It moves us away from vibes and toward verifiable truth.
Corn
I want to talk about the world of ArXiv and GitHub that Daniel mentioned. For those who do not know, ArXiv is an open-access archive where researchers post their papers before they are formally peer-reviewed and published in journals. It feels like the pulse of the A I world.
Herman
It is the town square of A I. In most scientific fields, you write a paper, you submit it to a journal, you wait six months for peer review, and then it gets published. In A I, six months is an eternity. If you wait six months, your research is already obsolete. So, researchers post their pre-prints on ArXiv immediately. The peer review happens in real-time on social media, in blog posts, and in the issues section of GitHub repositories. It is a high-velocity, high-stakes environment.
Corn
It is a very "move fast and break things" approach to science. I imagine it can be quite chaotic.
Herman
It is incredibly chaotic. On any given Tuesday, there might be fifty new A I papers posted. Most of them are incremental or even flat-out wrong. But among them is the next "Attention Is All You Need," which was the paper that introduced the Transformer architecture that changed everything. The skill now is not just being able to do the research, but being able to filter the signal from the noise on ArXiv. There are actually A I agents now whose only job is to read ArXiv and summarize the important papers for human researchers.
Corn
And GitHub is where the rubber meets the road. That is where the code lives.
Herman
Exactly. A research paper without code is often viewed with extreme suspicion. The gold standard in the community is a paper on ArXiv with a link to a GitHub repository that contains the code to reproduce the results. This is where the open-source community, led by companies like Meta and their Llama models, has been so influential. They are releasing the how-to and the what to the world, which forces the closed-vendor labs to stay honest. If a small lab in Paris, like Mistral, can produce a model that rivals Google's best work using a fraction of the resources, it proves that it is about the ideas and the data quality, not just the size of your server farm.
Corn
You know, we should probably take a second to mention that if you are enjoying this deep dive into the world of A I labs, we would really appreciate it if you could leave us a review on your favorite podcast app. It really helps people find the show, and we love seeing what you think of these episodes. We are trying to hit five hundred reviews by the end of the month, so every single one counts.
Herman
It really does help. And it keeps us motivated to spend our late nights reading those fifty ArXiv papers so you do not have to. I have got permanent blue-light strain in my eyes for you, the listeners!
Corn
Right, back to the labs. One thing Daniel touched on was the agentic nature of these systems. We have talked about researchers, but what about the alignment teams? I hear that word a lot lately. It seems like every major lab now has a huge portion of its staff dedicated to making sure the A I does not, well, go rogue.
Herman
Alignment is the central challenge of our time in A I. It is the study of how to ensure that an A I's goals and behaviors are aligned with human values. In a lab setting, this is not just a philosophical question; it is a technical one. How do you give a model like Alithia the freedom to do autonomous research without it deciding that the most efficient way to solve a math problem is to turn off the safety filters or use all the world's electricity? We have seen researchers coming from backgrounds in game theory and reinforcement learning specifically to tackle this.
Corn
It sounds like a difficult balance. You want it to be smart enough to be useful, but not so smart that it becomes unpredictable.
Herman
Exactly. And the labs are split on how to handle this. You have the pro-acceleration camp, who believe we should push forward as fast as possible and solve the safety issues as they arise. Then you have the safety-first camp, who think we should slow down until we have a mathematical proof of safety. This tension is what led to the split between OpenAI and Anthropic. Anthropic was founded by former OpenAI employees who felt that safety was not being taken seriously enough. They call themselves an A I safety and research company. Their research is often focused on interpretability—basically trying to look inside the black box of the neural network to see what it is actually thinking.
Corn
I like that analogy of the black box. It feels like we are at a point where the labs have built these incredible engines, but they are still trying to figure out exactly how the internal combustion works.
Herman
That is very accurate. We know it works, we can measure the horsepower, but we are still mapping out the gears. There is a whole field called mechanistic interpretability where researchers try to map specific neurons in the A I to specific concepts, like "the concept of a prime number" or "the concept of a lie." And that is why the backgrounds of these researchers matter so much. We need the mechanics who can build the engine, but we also need the architects who can design the car and the policymakers who can decide where the roads should go.
Corn
Let us talk about the practical side of this for a moment. If someone is listening to this and they want to stay informed about what is coming out of these labs, but they are not a P h D in mathematics, what should they look for? How do you read an ArXiv paper without getting a headache?
Herman
My advice is always to start with the Abstract and the Conclusion. The abstract tells you what they tried to do and what they found. The conclusion tells you why it matters and what the limitations are. But the real secret is to look at the Figures and Tables. A good A I paper will have a chart comparing its model to others. Look for the State of the Art, or S O T A, benchmarks. If the new model is only point five percent better than the old one, it is probably just an incremental update. If it is twenty percent better, or if it is doing something entirely new like Alithia, then you pay attention.
Corn
And what about the third-party verification? Is there a place where people can go to see unbiased reviews of these models?
Herman
There are a few great resources. The L M S Y S Chatbot Arena is a fantastic one. It uses a crowd-sourced E L O rating system, similar to how chess players are ranked. Users are shown two anonymous models, they ask them a question, and then they vote on which answer is better. It is a very human-centric way of measuring performance that is much harder to game than a standardized test. There is also the Hugging Face leaderboard, which tracks the performance of open-source models. Hugging Face is essentially the GitHub of A I models. It is a central hub where labs and independent researchers share their work.
Corn
It is such a vibrant ecosystem. It feels like the early days of the internet or the homebrew computer clubs in the seventies.
Herman
It really is. The pace is breathtaking. But to Daniel's point about skepticism—I think it is important to remember that every lab has a narrative. Google wants to show that they are still the kings of research, despite the competition. OpenAI wants to show that they are the closest to A G I. Meta wants to show that open-source is the most robust path. When you read a paper, you have to ask: "What story is this lab trying to tell?" Science is never purely objective when there are billions of dollars in venture capital on the line.
Corn
So, when we look at Alithia and Gemini three DeepThink, the story is that Google is reclaiming the lead in reasoning and agency.
Herman
Exactly. They are saying, "We are not just a chatbot company. We are a fundamental science company." And to be fair, the research coming out of DeepMind over the years—AlphaFold for protein folding, AlphaGo for games—has been world-class. They have earned a certain level of respect. But even then, the industry will wait to see if other researchers can replicate those autonomous math results. We are waiting for the first independent lab to say, "Yes, we ran Alithia's code and it actually discovered something new."
Corn
It makes me think about the future of the researcher as a career. If we have agents that can do the grunt work of research—the literature reviews, the hypothesis testing, the coding—does the human researcher just become a manager of A I agents?
Herman
That is the big shift. We are moving from doing research to directing research. A lead scientist at a lab in five years might manage a fleet of a hundred Alithia-style agents, each working on a different sub-problem. The human's job will be to provide the high-level intuition, the creative leaps, and the final ethical sign-off. It is a very different kind of work. It is more about prompt engineering at a massive, scientific scale. Instead of asking "how do I solve this equation," the researcher asks "what is the most important equation to solve for humanity right now?"
Corn
It is a bit daunting, though. If the entry-level grunt work is gone, how do new researchers learn the ropes? How do you become a master if you never have to do the apprentice work?
Herman
That is a profound question, Corn. It is the same concern people have about A I and coding. If you never have to struggle with a bug for three hours, do you ever really understand how the system works? The labs are going to have to figure out how to train the next generation of directors without the traditional doer path. Maybe we will see more focus on A I-human collaboration as a core part of the P h D curriculum. We might see degrees in "Artificial Epistemology" or "Agentic Management."
Corn
It is fascinating to think that we are living through this transition right now. We are essentially watching the scientific method itself get an upgrade.
Herman
We are. The Scientific Method two point zero is going to be powered by these agentic systems. It will be faster, more rigorous, and hopefully, it will help us solve problems that have been stuck for decades—things like clean energy, personalized medicine, and yes, even the most complex mathematical conjectures. Imagine an A I that doesn't just solve math, but designs a new type of carbon capture filter based on molecular simulations it ran itself.
Corn
I think that is a great place to wrap up our core discussion. We have covered the who, the where, and the why of these labs. But before we go, Herman, do you have any final takeaways for the listeners? Maybe something they can look for in the news over the next few months?
Herman
I would say, keep an eye on the Open Weights movement. Watch labs like Mistral in France, Sakana A I in Tokyo, and the Allen Institute. If they can continue to keep pace with the massive, closed-vendor labs, it means the future of A I will be more democratic and transparent. If the gap between the closed labs and the open ones starts to widen significantly because of compute moats, then we should start to worry about the concentration of power. Also, watch for the word "formal verification." That is the signal that an A I is actually reasoning rather than just guessing.
Corn
That is a really important point. The "who" matters just as much as the "what."
Herman
And for the technical folks out there, go to ArXiv. Look up the Alithia paper. Even if you do not understand the math, look at the Limitations section. That is where the most honest information usually lives. No lab wants to admit their model is not perfect, but the Limitations section is where they have to be scientists instead of marketers. It is the most honest part of the whole A I industry.
Corn
Wise words. Well, this has been an incredible exploration of a world that is usually hidden behind corporate press releases and dense academic jargon. Thank you, Herman, for breaking it all down. And thanks again to Daniel for such a timely and deep prompt. It really pushed us to look at the machinery behind the magic.
Herman
It was a pleasure. I always love an excuse to talk about Herman Poppleberry's favorite topic—the future of intelligence. Now, if you'll excuse me, I have another forty-two papers from this morning's ArXiv drop to get through before lunch.
Corn
If you have a weird prompt of your own, or if you want to get in touch with us, you can find the contact form and our full archive of episodes at myweirdprompts dot com. We are also on Spotify, Apple Podcasts, and most other places where you get your audio fix.
Herman
And if you are in Jerusalem, say hi if you see us out and about. We are the ones usually arguing about neural networks and the alignment problem over a plate of hummus in the Mahane Yehuda market.
Corn
Exactly. We are easy to spot—Herman is the one with the highlighter stains on his fingers. Alright, everyone, thanks for listening to My Weird Prompts. We will be back soon with another deep dive into the strange and wonderful world of human-A I collaboration.
Herman
Until next time!
Corn
Goodbye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.