Alright, we have a monster of a topic today. Daniel sent us a fascinating write-up about the "Data Wall" and the evolution of scaling laws. He basically asks, why isn't GPT-5 just GPT-4 with a bigger engine and more chrome? If the old mantra was "bigger is better," why are we seeing this massive shift in how the biggest labs in the world actually build these models? He wants us to trace the path from the early optimism of the 2020 Kaplan laws to the cold, hard reality of the Chinchilla paper and what it means for the future of AI engineering.
It is the defining question of the current era, Corn. Honestly, if you understand the shift from 2020 to now, you understand why the AI industry looks the way it does. You understand why OpenAI is buying the archives of the Financial Times and why Meta is training tiny models on trillions of tokens. It’s all buried in the math of these scaling laws.
Well, before we dig into the math and the "Data Hunger Games," I should mention that today’s episode is actually being powered by Google Gemini 1.5 Flash. It’s writing our script today, which is fitting, considering we’re talking about the models that come out of these massive labs. I’m Corn, by the way.
And I’m Herman Poppleberry. And Corn, you hit on it right at the start. For a long time, the vibe in AI was very "more is more." If you wanted a smarter model, you just added more parameters. It was like building a bigger brain and assuming intelligence would just follow. But we’ve learned that a giant brain is useless if it hasn't actually read anything.
Right, it’s the "Home Alone" kid in a giant body. Plenty of potential, but no life experience. So, let’s go back to 2020. That’s where the "bigger is better" gospel really started, right? With the Kaplan paper from OpenAI?
Well—not "exactly," I should say the Kaplan paper provided the empirical backbone for that belief. Jared Kaplan and the team at OpenAI published "Scaling Laws for Neural Language Models," and it was a bombshell. It basically said that if you want to lower the "loss"—which is just a fancy way of saying making the model more accurate—you can predict exactly how much better it will get based on three variables: the number of parameters, the amount of data, and the total compute power you use for training.
And the takeaway for most people back then was: just juice the parameters.
That was the intuition. The Kaplan paper suggested that as you got more compute, you should put the lion's share of that budget—like sixty or seventy percent of it—into making the model larger. The data didn't need to scale nearly as fast. So, if you had ten times the money to spend on electricity and chips, the math told you to make a model that was maybe five times bigger but only give it twice as much data.
It sounds like a recipe for a very specialized, very hollow giant. But how did they actually prove this at the time? Were they just looking at small-scale tests and assuming they’d hold up at the scale of a supercomputer?
That’s exactly what happened. They were looking at "power laws." If you plot performance on a graph, it follows a very predictable curve. But the curves they were looking at were based on models that hadn't been trained to "saturation." They were essentially looking at how fast a model learns in the first few weeks of school and assuming that rate of progress would stay the same forever.
Which led us straight to GPT-3.
It did. GPT-3 was a behemoth for its time—one hundred and seventy-five billion parameters. But here’s the kicker: it was only trained on about three hundred billion tokens of data. In hindsight, that model was like a massive university library that only had about three shelves of books in it. It had all this capacity, all these "synapses," but it hadn't actually seen enough language to fill them up.
So why did we think that was the way to go? Was the math just wrong, or were we just distracted by the shiny new toy of "massive parameter counts"?
It wasn't that the math was "wrong" in a vacuum; it’s that the experiments were conducted in a way that favored larger models because they converge faster. If you have a massive model, it learns more quickly in the early stages of training. But the OpenAI team at the time didn't realize that if you kept training a smaller model for much longer, it would eventually overtake the big, "lazy" model. They were looking at a snapshot, not the full marathon.
It’s like saying a tall kid is better at basketball because he can reach the hoop easier when they’re both five years old. You’re ignoring the fact that the shorter kid might spend ten thousand hours practicing his jump shot and eventually become the better player.
That’s a great way to put it. And because GPT-3 was so much better than GPT-2, everyone just assumed the Kaplan laws were the final word. We saw this arms race of "parameter counting." People were talking about trillion-parameter models, ten-trillion-parameter models. It was a spec-sheet war.
But wait, if everyone was following Kaplan, why didn't we see the wheels fall off sooner? Did GPT-3 actually feel "empty" to the users back then?
Not really, because compared to what came before, it was a miracle. But researchers started noticing "diminishing returns." You’d double the size of the model, spend ten times the money, and it would only be, say, five percent better at logic. That’s when people started suspecting that the "parameter-first" approach was hitting a wall of inefficiency.
Until 2022. Enter the "Chinchilla" paper. I love the name, by the way. It sounds so much less intimidating than "massive neural network architecture."
It’s a classic DeepMind move. They named the model "Chinchilla" to follow their trend of animal names, but the paper itself—"Training Compute-Optimal Large Language Models"—changed the entire trajectory of the field. Jordan Hoffmann and the team basically went back and re-did the Kaplan experiments, but they did them much more rigorously. They trained over four hundred models of different sizes on different amounts of data to see where the actual sweet spot was.
And what did they find? I’m guessing it wasn't "make the model bigger."
It was a total pivot. They found that for every doubling of your compute budget, you should scale the model size and the data size equally. One-to-one. Not the five-to-one ratio Kaplan suggested.
Wow. So if I have double the chips, I give the model double the "brain cells" and double the "books"?
Precisely. And they derived a rule of thumb that is now legendary in AI engineering: the "twenty-token rule." To be "compute-optimal"—meaning you are getting the absolute most intelligence out of every dollar you spend on electricity—you need about twenty tokens of training data for every single parameter in the model.
Wait, let’s do the math on GPT-3 then. One hundred and seventy-five billion parameters. If we follow the Chinchilla rule, how much data should it have had?
It should have been trained on at least three point five trillion tokens. But remember, it was only trained on three hundred billion. GPT-3 was under-trained by a factor of more than ten. It was a massive, empty vessel.
That’s wild. So when DeepMind built the Chinchilla model itself, they went the other direction, right?
They did. They built Chinchilla with only seventy billion parameters—less than half the size of GPT-3—but they fed it one point four trillion tokens. And guess what? This smaller, "well-read" model absolutely crushed GPT-3 and even DeepMind's own larger model, Gopher, on almost every benchmark. It was smarter, faster, and cheaper to run, all because it was "compute-optimal."
This feels like a massive "I told you so" moment for anyone who likes efficiency. But help me understand the "why" here, Herman. Why does a smaller model with more data beat a bigger model with less? If I have more parameters, shouldn't the model have a higher "ceiling" for what it can learn?
In theory, yes, a bigger model has a higher ceiling. But think of it this way: parameters are like the "storage capacity" for patterns. Data is the source of those patterns. If you have a huge storage warehouse but only ten boxes to put in it, most of the warehouse is just empty space. The model never actually learns the subtle, complex relationships between words because it hasn't seen enough examples to distinguish signal from noise. It’s "sparse." When you train a smaller model on way more data, you are forcing it to pack as much information as possible into every single parameter. You’re making it "dense." It becomes a much more efficient representation of the world.
It’s like the difference between a guy who owns a thousand books but has only read the dust jackets, and a guy who owns fifty books but has memorized every line. In a trivia contest, I’m betting on the guy with fifty books.
Every time. And that realization—that we were over-parameterized and under-trained—sent shockwaves through the industry. It’s why we haven't seen GPT-5 yet. Because if you want to make GPT-5 significantly better than GPT-4, and you want to increase the parameter count, the Chinchilla laws tell you that you have to find a staggering amount of data to justify it.
Well, let’s look at those numbers. If GPT-4 is rumored to be around one point eight trillion parameters—which, again, is just an estimate, OpenAI hasn't confirmed that—but if we use that as a baseline, what does a "compute-optimal" GPT-5 look like?
If you wanted to make a model that was ten times larger than GPT-4—say, eighteen trillion parameters—the Chinchilla rule says you would need three hundred and sixty trillion tokens of high-quality data.
Three hundred and sixty trillion. Just for context, how much "high-quality" text is actually out there on the internet?
This is where we hit the "Data Wall." Estimates vary, but most researchers think the total amount of high-quality, human-generated text publicly available—books, scientific papers, Wikipedia, high-quality websites—is somewhere between ten trillion and thirty trillion tokens.
So we are off by an order of magnitude. Even if we scraped every single word ever written in human history, we still wouldn't have enough data to train a "compute-optimal" GPT-5 on the scale that people are expecting.
That’s the "Data Wall." It’s not just a technical challenge; it’s a physical limit of our digital civilization. We’ve collectively written and digitized a finite amount of language, and we’re already getting close to the bottom of the barrel.
But what about all the private data? I mean, the internet is just the tip of the iceberg, right? There are trillions of emails, Slack messages, DMs, and private corporate databases. Does that solve the wall?
It’s a double-edged sword. Yes, there is more data behind closed doors, but much of it is "low-entropy" or redundant. Think about your own email inbox. How much of that is actually "high-quality information" that would help an AI understand the world? It’s mostly "Meeting at 2 PM" or "Please see attached." To a model, that’s just noise. The "Wall" isn't about the total number of characters; it’s about the total amount of novel information. We are running out of unique ways to describe the world.
Well, that explains why every single AI lab is currently in a "data-hunger" arms race. I mean, we saw OpenAI sign those deals with Reddit, and News Corp, and the Financial Times. They are literally buying the "raw ore" they need to feed the Chinchilla beast.
It’s exactly what’s happening. They aren't just buying the content; they are buying the legal right to train on it. Because if you can't find more data, you can't scale your models effectively. And if you can't scale, you can't win.
But wait, Herman. If we’re running out of human-generated text, why don't we just let the AI write its own data? I mean, we’ve talked about "synthetic data" before. Why not just have GPT-4 write a trillion pages of high-quality "synthetic" books and then train GPT-5 on that?
This is where it gets really tricky. There’s a risk called "Model Collapse," which is a fancy way of saying a model starts eating its own "exhaust" and gets stupider over time. If you train a new model on the output of an old one, the errors and biases of the old model get amplified in a feedback loop. It’s like a copy of a copy—eventually, the image gets blurry and distorted until it’s unrecognizable.
It’s the "Hapsburg" effect for AI models. You’re inbreeding them and getting weird, chin-heavy results.
It’s exactly that. But the counter-argument is that you can use a model to "curate" better data or generate "reasoning traces" that are actually more useful than raw human text. That’s what we’re seeing with OpenAI’s o1—or "Strawberry"—series. They are using "test-time compute," which is a fancy way of saying they are letting the model "think longer" before it answers. That creates a new kind of data—a "reasoning trace"—that might be much more valuable than a random Reddit comment.
So we’re moving from "more data" to "better data." Or even "harder data."
It’s a shift from "data quantity" to "data chemistry." We’re learning that not all tokens are created equal. A token from a physics textbook is worth way more to a model than a token from a YouTube comment section. So, instead of just "hoovering" the entire internet, labs are now focused on "data curation"—finding the highest-quality, most information-dense data possible and training on that.
Does this mean we might see models get smaller in the future? If we find the perfect "curated" dataset, could a 10-billion parameter model beat GPT-4?
We are already seeing hints of that. Look at models like Phi-3 from Microsoft. It’s tiny, but it punches way above its weight because it was trained on "textbook-quality" data. The goal is to reach the same level of intelligence with a smaller "footprint." If you have the "perfect" curriculum, you don't need a trillion parameters to store it.
But back to the "compute-optimal" thing. Does this mean the Kaplan laws were just... dead? Like, should we all just stop making big models and focus on tiny, perfect ones?
Not exactly. The Kaplan laws aren't "wrong"—they still describe the relationship between size, data, and compute. It’s just that Chinchilla refined the "optimal" path. But here’s where it gets even more interesting: there’s a difference between "compute-optimal" and "inference-optimal."
"Inference-optimal." That sounds like the "usage phase," right? When we actually ask the model a question?
"Compute-optimal" (Chinchilla) tells you the most efficient way to train a model. But once you’ve trained it, you have to run it. And a massive model like GPT-4 is incredibly expensive to run. Every time you ask it a question, you are firing up trillions of parameters.
So if I’m Meta, and I want to give Llama to everyone in the world for free, I care way more about how much it costs to run than how much it cost to train.
That’s the "Inference-Optimal" pivot. Meta’s Llama 3 is the perfect example. They released an eight-billion-parameter model and a seventy-billion-parameter model. According to Chinchilla, they only "needed" to train the seventy-billion model on about one point four trillion tokens. But Meta trained it on fifteen trillion tokens.
Fifteen trillion! That’s more than ten times the "optimal" amount! Why would they burn all that extra electricity if they didn't have to?
Because they weren't trying to be "efficient" during training; they were trying to be efficient during use. By over-training that seventy-billion parameter model, they made it as smart as a much larger model. Now, every single time someone uses Llama 3, Meta saves money because they are running a smaller model that has been "super-charged" by all that extra data.
It’s like a chef who spends twenty-four hours slow-cooking a brisket so that it melts in your mouth in two seconds. It’s a huge investment of time and energy upfront, but the final experience is much better.
It’s a "pre-computation" of intelligence. You are burning as much electricity as possible during the training phase to "bake" as much knowledge as possible into a small, portable model. That’s the future of AI engineering—not just "bigger is better," but "denser is better."
So, let’s circle back to Daniel’s question. Why isn't GPT-5 just GPT-4 with more parameters?
Because if they just added parameters without also finding a massive, unprecedented source of high-quality data, they would be building a bigger, more expensive model that wasn't actually smarter. They would be moving away from the Chinchilla-optimal line. It would be a step backward in engineering efficiency. GPT-5 has to be a leap in how they use data, not just how much data they use.
This makes me think about the "System 2" stuff you mentioned—the OpenAI o1 models. If we can't find more "raw" data on the internet, maybe the path to "GPT-5 level" performance is actually through "reasoning" rather than "memorization."
That’s exactly what the "test-time compute" shift is all about. Instead of just scaling the "brain" (parameters) or the "library" (data), we are scaling the "thinking time" (inference compute). We’re saying, "Okay, you’ve read all the books, now take five minutes to actually solve this math problem instead of just guessing the next word."
It’s the "show your work" phase of AI.
It is. And that "work"—those reasoning traces—become a new kind of high-quality data that can be fed back into the next generation of models. It’s a way to bypass the "Data Wall" by letting the AI generate its own "thinking" data.
But does that actually solve the problem of new knowledge? I mean, if the AI is just "reasoning" based on what it already knows, can it ever discover something truly new, like a new law of physics? Or is it just rearranging the furniture?
That is the multi-billion dollar question. Some argue that reasoning is a form of discovery. When a mathematician proves a new theorem, they aren't necessarily looking at "new data" from the outside world; they are finding new connections within the logic they already possess. If AI can do that, it can effectively "generate" its own progress.
So, for the AI engineers out there, what’s the takeaway? If I’m building a model today, should I be obsessing over parameter counts, or should I be out there literally hunting for data like a treasure hunter?
You should be a "data alchemist." The era of brute-force scaling is over. The "secret sauce" now is in the curation, the filtering, and the "data chemistry" of your training set. If you can find a way to make your data five percent more information-dense, that’s far more valuable than adding five percent more parameters. And if you’re deploying a model, you have to think about that "inference-optimal" trade-off. Is it worth over-training a smaller model to save money on your cloud bill later? In 2026, the answer is almost always yes.
It’s funny how we went from thinking AI was this infinite, digital frontier to realizing it’s actually constrained by the same things we are—limited resources and the need for high-quality instruction.
It’s a grounding realization. It reminds us that intelligence isn't just a byproduct of "bigness." It’s a byproduct of learning. And learning requires good teachers and good books. Even for a trillion-parameter neural network.
Speaking of good books, I read an interesting fact the other day—did you know that the entire text of Wikipedia, which feels like the sum of human knowledge, is only about 4 billion tokens?
That’s exactly right. And 4 billion tokens is nothing to a modern model. Llama 3 ate Wikipedia for breakfast in the first few minutes of its training run. It shows you how "small" our collective written knowledge actually is when compared to the appetite of these machines. We are trying to feed a blue whale with a teaspoon.
Well, before we wrap up, I should mention that if you’re enjoying this deep dive into AI engineering, you should check out Episode 1839, "AI’s Data Kitchen: From Hoovering to Fine-Tuning." It goes into the "messy" side of how these labs actually clean and prepare the data we’ve been talking about today. It’s a great companion to this episode.
It really is. It shows you the "blue-collar" side of the AI revolution—the people and processes that turn raw internet "garbage" into the "gold" that feeds the scaling laws.
So, what’s the "final boss" here, Herman? If we hit the "Data Wall" and we can't find more data, and we can't generate enough high-quality synthetic data... does AI just stop getting smarter?
I don't think it stops; I think it "evolves" in a different direction. We might see models that are more "active learners"—models that can interact with the real world, perform experiments, and generate their own "experience" data. This is where robotics comes in. If a robot moves through a room, it’s generating "video tokens" and "sensor tokens" that have never been seen before. That’s a whole new ocean of data.
So the "Data Wall" for text might just be the "Starting Line" for embodied AI?
Or we might see a total shift in architecture—something that doesn't follow the Chinchilla laws at all. But for now, the "Data Hunger Games" are in full swing. Whoever secures the most "raw ore" wins the next round.
It’s a wild time to be watching this stuff. And honestly, it makes me feel a little better about my own "slow" learning style. If a trillion-dollar AI lab has to struggle with "data quality," I can't feel too bad about needing a few extra hours to understand a physics paper.
Quality over quantity, Corn. It’s the law of the universe, and now it’s the law of the silicon, too.
Well, that’s a wrap on Episode 1994. Big thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power this show—they make it possible for us to explore these massive topics without hitting our own "compute wall."
This has been My Weird Prompts. If you enjoyed the show, please leave us a review on your favorite podcast app—it really does help us reach new listeners who are curious about the weird world of AI.
You can find us at myweirdprompts dot com for our full archive and RSS feed. Catch you in the next one.
Bye, everyone.