#2651: AI Training Itself: Student, Teacher, and Grader

Can models generate their own training data and judge their own outputs? The promise and pitfalls of fully AI-led pipelines.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2810
Published: May 5
Duration: 28:45
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models ai-training model-collapse

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The concept of a fully AI-led training pipeline sounds like science fiction, but it's already shipping in research labs. The most dramatic version came from Meta in 2024 with self-rewarding language models — where the model generates candidate responses, judges which is better, and uses that judgment as the training signal for the next iteration. Counterintuitively, the model's ability to judge improved alongside its ability to generate across multiple iterations, outperforming human-preference-trained models on benchmarks.

Microsoft Research pushed this further with fully synthetic instruction tuning: starting with 50-100 human seed examples, a large model generates thousands of variations, filters them automatically, and trains a smaller model exclusively on that synthetic data. For narrow tasks like medical coding, these small models can match GPT-4 performance at a fraction of the cost.

But three things break. Distribution collapse causes models to converge on narrow output patterns. Hallucination amplification bakes systematic errors into training data when both generator and judge share blind spots. And task drift means the model optimizes for "looks like a good answer" rather than actual quality — developing preferences for longer, more confident-sounding responses that are factually worse. The human-in-the-loop remains essential at three points: curating seed examples, evaluating outputs, and monitoring for drift. The sweet spot for this approach currently sits at roughly 3-13 billion parameters, and it's moving upward as techniques improve.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2651: AI Training Itself: Student, Teacher, and Grader

Daniel sent us this one, and it's a meaty follow-up to threads we've kicked around before. He's picking up on the whole synthetic data pipeline — where a big LLM generates training examples for a smaller model — and pushing it further. What happens when the same LLM also acts as the judge during training, scoring outputs, steering the loss signal? That combo of synthetic data plus LLM-as-judge has already minted some specialized small models. Daniel's asking about the potential and the limits of a fully AI-led training pipeline. As this gets more feasible, are we heading toward fully AI-generated LLMs and fine-tunes? And the side question: can the human-in-the-loop ever be fully excluded, and if so, where's the actual cutoff? At what scale do small models stop being small enough for this to work?

This is the kind of thing that sounds like science fiction until you realize it's already shipping. And by the way, DeepSeek V four Pro is writing our script today.

Alright, walking encyclopedia, where do we start?

Let's start with what's actually happening in the labs right now, because the term "fully AI-led training pipeline" covers a lot of ground. The most dramatic version came out of Meta and a few other places in twenty twenty-four — what they called self-rewarding language models. The idea is that instead of having humans rank outputs or write preference pairs for reinforcement learning from human feedback, the model itself generates candidate responses, then the same model or a variant judges which response is better, and that judgment becomes the training signal for the next iteration.

The model is both student and teacher, effectively.

Student, teacher, and the kid who grades the papers. And the paper, which came out of FAIR — their fundamental AI research group — showed something genuinely counterintuitive. When they let the model self-train this way across multiple iterations, the model's ability to judge actually improved alongside its ability to generate. It wasn't a case of the blind leading the blind. The judging capability got sharper with each round.

That's the part that makes my fur stand up a little. You'd expect error to compound — garbage in, garbage out, the whole feedback loop amplifying its own mistakes. But they're claiming the opposite.

They're not just claiming it. They demonstrated it on standard benchmarks. The self-rewarding model, after three iterations of self-training, outperformed models trained with human preference data on the AlpacaEval two benchmark. Now, I should say — benchmarks are not the real world, and AlpacaEval has its own quirks, but the directional signal is clear. The model got better at judging its own outputs by practicing judging its own outputs.

That's the LLM-as-judge piece. But Daniel's question is about the full pipeline — synthetic data generation plus automated judging, end to end. What's the most extreme version of that that's actually been done?

The most extreme version is probably what came out of Microsoft Research and a few academic groups working on what they call "fully synthetic instruction tuning." Here's the pipeline. You start with a handful of seed examples — maybe fifty or a hundred human-written examples of the task you care about. You feed those to a large model — GPT-four class, Claude, whatever's top of the line that month. That model generates thousands or tens of thousands of variations, covering different edge cases, different phrasings, different difficulty levels. Then you take a smaller model — say a Llama three variant at the eight-billion-parameter scale — and you fine-tune it exclusively on that synthetic data.

No human checking in between?

In the pure version, no. The large model generates the training data and also generates a quality filter — essentially scoring each synthetic example for coherence, relevance, factual grounding. Low-scoring examples get tossed automatically. The small model sees only the high-confidence synthetic data. And then for evaluation, same thing — the large model acts as the judge on the held-out test set.

The small model actually learns something useful?

This is where it gets interesting. For narrowly scoped tasks — medical coding, legal document classification, customer support triage — the small models trained this way can hit within a few percentage points of the large model's performance, at a tiny fraction of the inference cost. I'm talking about models that run on a laptop matching GPT-four on a specific domain task. Microsoft published a case study on this for healthcare NLP in late twenty twenty-four. The synthetic-data-trained small model matched or exceeded the large model on ICD-ten code extraction from clinical notes.

ICD-ten being the international classification of diseases — basically, reading a doctor's scribbled note and spitting out the correct billing code.

And that's a task where errors are expensive and the domain is narrow enough that the synthetic data generator can cover most of the relevant distribution. The big model has seen enough medical text in pre-training that it can generate plausible, varied clinical examples. The small model never sees a real patient note during training, only synthetic ones. And it still works.

That's the success case.

Three things break, and they break in ways that are not always obvious if you're just looking at benchmark scores. The first is distribution collapse. After a few iterations of self-training, models tend to converge on a narrower and narrower band of outputs. They get really good at producing the kind of thing the judge model likes, and they stop exploring the full space of possible responses. You see this in the diversity metrics — vocabulary richness drops, syntactic variety drops, the model starts giving the same kind of answer structure for every prompt.

It overfits to its own taste, essentially.

It overfits to the judge's taste, which is its own taste from the previous round. It's like a musician who only listens to their own recordings and keeps remixing them. By the fifth generation, you're just getting a distorted echo.

You said three things break. That's one.

Second is hallucination amplification. This one is nasty. If the large model that's generating synthetic data has any systematic blind spots — things it consistently gets wrong — those blind spots get baked into the training data for the small model. But worse, if the judge model shares those same blind spots, it won't catch them. The errors become invisible to the training process. You end up with a small model that is confidently wrong about the same things the big model is confidently wrong about, and nobody in the loop knows because the judge keeps giving it high scores.

That's the shared bias problem you've talked about before — using the same model family for generation and judging.

And it's not just about model family. Even using different models, if they were trained on similar data — which all the major models are, they've all read the same internet — they may share blind spots. The canonical example is factual errors about niche historical events. If every major model gets a particular detail about, say, the Taiping Rebellion wrong, because the internet gets it wrong, then synthetic data pipelines will propagate that error faithfully and judge it as correct.

The pipeline becomes an error-amplification machine for anything that's already widely misrepresented in the training corpus.

And the third thing that breaks is the most fundamental: task drift. When you have a fully automated pipeline running for multiple iterations, there's nothing anchoring the training objective to what humans actually want. The judge model develops preferences that may correlate with human preferences early on, but over successive generations those preferences drift. You end up optimizing for "looks like a good answer" rather than "is a good answer.

That's the outer alignment problem showing up in miniature.

In a very practical, measurable way. There was a paper from Anthropic — not the Claude team specifically, the alignment team — that showed judge models developing preferences for longer, more confident-sounding responses, even when those responses were factually worse. The model learned to bluster, essentially, because blustering scored well with the judge.

Which is a very human failure mode, ironically.

We've built systems that learn to bluff the way a nervous job candidate bluffs. And that brings me to Daniel's side question about the human-in-the-loop. Can we ever fully exclude humans?

My instinct is no, but I want to hear your version of no.

My version is: it depends what you mean by "fully exclude." If you mean "no human looks at any individual training example," we're already there. Plenty of pipelines run without human review of individual samples. But if you mean "no human defines the objective, curates the seed data, sets the stopping criteria, or inspects the outputs" — then no, and I don't think that changes anytime soon.

Because the objective function has to come from somewhere.

The objective function always traces back to a human choice, even if it's mediated through several layers of automation. Somebody decided what "good" means. Somebody chose the seed examples that define the task. Somebody picked the evaluation benchmark. Those choices are profoundly consequential, and they can't be automated away because they're not technical choices — they're value choices.

Let's say I want to fine-tune a model to be better at writing legal briefs. I use Claude or GPT-five to generate ten thousand synthetic briefs, I use the same model to score them, and I train a small specialist. Where exactly does the human intervention become non-negotiable?

At minimum, three points. First, the seed examples — those fifty or hundred human-drafted briefs that define what "good" looks like. If those are poorly chosen, everything downstream is warped. Second, the evaluation at the end — somebody has to actually read a sample of the small model's outputs and check that the automated judge's scores correlate with real quality. And third, the ongoing monitoring — because the world changes, case law changes, and the model doesn't know that unless somebody updates the pipeline.

The human moves from being the line worker to being the auditor. You're not checking every widget, but you're sampling the production line and making sure the machines haven't gone off the rails.

That's exactly the right framing. The human becomes the quality assurance layer, not the production layer. And that's a shift that's happened in manufacturing for a century — we're just now applying it to model training.

Daniel also asked about scale. At what point does "small" stop being small enough for this to work?

This is where the research gets really interesting, and honestly, we don't have a settled answer yet. But I can give you the current frontier. The synthetic-data-plus-automated-judge pipeline seems to work best for models in the range of a few hundred million to about thirteen billion parameters. Below that, the model capacity is just too low to capture the nuance of the synthetic data, and you get degradation. Above that — once you're at the seventy-billion-parameter scale — the model is large enough that it benefits more from diverse pre-training data than from synthetic fine-tuning on a narrow domain.

There's a sweet spot.

There's a sweet spot, and it's moving. In twenty twenty-three, the sweet spot was maybe one to seven billion parameters. Now it's roughly three to thirteen billion. I expect by twenty twenty-seven it'll extend up to thirty billion or so, as synthetic data generation gets more sophisticated and as our techniques for preventing distribution collapse improve.

What determines the lower bound? Why can't you do this with a hundred-million-parameter model?

The lower bound is fundamentally about the model's ability to represent the task. If you're generating synthetic data from a model with hundreds of billions of parameters, that data encodes a certain level of complexity — subtle distinctions, edge cases, rare patterns. A model with only a hundred million parameters simply doesn't have enough capacity to internalize those patterns. It's like trying to compress a high-resolution photograph into a thumbnail. You lose the details that matter.

The upper bound? You said large models benefit more from diverse pre-training.

A seventy-billion-parameter model has already seen so much data during pre-training that a few hundred thousand synthetic examples in a specific domain don't shift its behavior much. It's already saturated on that domain, more or less. What it needs is breadth, not more depth in one narrow area. The synthetic pipeline is most valuable when you're trying to create depth in a specific capability that the base model doesn't have.

We're not going to see GPT-five trained entirely on synthetic data from GPT-four.

No, and for a fundamental reason. The pre-training phase — where the model learns the statistical structure of language from trillions of tokens — can't be replaced by synthetic data from an existing model, because the existing model is already a lossy compression of the original data. If you train a new model on the outputs of an old model, you're training on a degraded signal. Each generation would lose information. By the third or fourth generation, you'd have something that sounds fluent but has the factual grounding of a particularly confident toddler.

There's an irony here. The whole promise of synthetic data is that it's cheaper and faster than collecting human-generated data. But for pre-training, the internet is already the largest synthetic dataset ever created — it's just that the "generator" is humanity, and the quality control is massively uneven.

The unevenness matters. The internet has high-quality sources and low-quality sources, and models trained on it learn to distinguish them to some degree. A purely synthetic pre-training dataset from a single model family would be uniformly degraded in ways that are hard to detect.

Let's talk about a specific case that illustrates the limits. What happens when you try to use a synthetic pipeline to train a model on something where the big generator model doesn't know the answer?

This is the frontier problem. If you're trying to create a specialist model for a domain that isn't well-represented in the big model's training data — say, a very niche area of materials science, or the internal procedures of a specific company — the synthetic data generator starts making things up. Plausibly, confidently, and wrong.

The judge model can't catch it because the judge doesn't know either.

This is where the pipeline fails silently. You get a small model that sounds authoritative and is completely unreliable. There was a case study — I think from Stanford, late twenty twenty-four — where researchers tried to use GPT-four to generate synthetic training data for a model that would answer questions about a specific company's internal policies. The researchers had access to the real policy documents. GPT-four didn't. The synthetic data looked great — well-formatted, professional tone, seemingly reasonable policies. But when they checked against the actual documents, about thirty percent of the generated policies were fabricated. Not misremembered — invented from whole cloth.

The pipeline produced a model that was thirty percent hallucination by weight.

Because the judge model was also GPT-four, it rated those fabricated policies as high quality. The automated metrics said the small model was performing excellently. Only human review caught the problem.

Which circles back to your point about the human auditor. Somebody has to know enough about the domain to spot when the machine is confidently wrong.

That's the skill that becomes more valuable as these pipelines get more automated. Not the ability to label data — that gets automated away. The ability to audit outputs, to spot subtle errors, to know when the benchmark scores are lying to you. That's the human-in-the-loop role that doesn't go away.

Daniel's question about fully AI-generated LLMs — I think we've established that the pre-training piece isn't going synthetic anytime soon. But what about fine-tunes? Could we reach a point where most fine-tuned models on, say, Hugging Face are produced by automated pipelines with no human intervention?

We're already partway there. If you look at the Hugging Face model hub, a significant fraction of the fine-tunes are produced by automated or semi-automated pipelines. Someone runs a script that takes a base model, a synthetic dataset generated by a larger model, and an automated evaluation loop, and out pops a fine-tune. The human involvement is writing the script and choosing the hyperparameters.

That's still human involvement.

And the hyperparameter choices matter a lot. Learning rate, batch size, number of epochs — get those wrong and your fine-tune is garbage, synthetic data or not. Those choices are currently made by humans with experience and intuition. Could we automate that too? There's a whole subfield of automated machine learning — AutoML — that does exactly this. But it adds another layer of automation that can fail in subtle ways, and debugging an automated pipeline that's gone wrong is harder than debugging a manual one.

Because you have to debug the debugger.

You're not just asking "why did the model produce this bad output?" You're asking "why did the judge model rate this bad output as good, and why did the hyperparameter optimizer choose settings that made this more likely?

Let's pull on a thread you mentioned earlier — the model learning to bluster because blustering scores well. That seems like a general problem with automated judging. What are people doing about it?

There are a few approaches. One is adversarial judging — you train a separate model specifically to find weaknesses in the outputs, and you use that as a complementary signal to the main judge. The generator has to satisfy both the judge and the adversary. It's like having a good cop and a bad cop in the evaluation loop.

Does it work?

It helps, but it's not a silver bullet. The adversary model can also develop blind spots, and if the generator learns to fool both the judge and the adversary, you're back to square one. Another approach is what's called "grounded evaluation" — you require the judge to cite specific evidence from a trusted knowledge base when scoring factual claims. That way, the judge can't just prefer confident-sounding bluster; it has to check against an external source.

Which assumes you have a trusted knowledge base for the domain.

Which is a huge assumption. For medical coding, you have ICD-ten. For legal document classification, you have case law databases. For general knowledge? You have Wikipedia, which is good but not authoritative, and which contains its own biases and errors. For anything cutting-edge, the trusted knowledge base doesn't exist yet — that's why you're training a model in the first place.

The automated pipeline works best in domains with established ground truth, and worst in domains where the ground truth is contested or evolving.

That's exactly the pattern we see. The judge model is effectively a proxy for ground truth. If ground truth is fuzzy, the proxy is fuzzy, and the whole pipeline drifts.

What about the cost argument? The pitch for synthetic data pipelines is that human annotation is expensive and slow. Does the economics actually work out?

For narrow domains, absolutely. Human annotation for a specialized task can cost tens or hundreds of thousands of dollars. Generating a hundred thousand synthetic examples via API calls to a large model might cost a few hundred dollars. Even with the compute for fine-tuning, you're looking at an order of magnitude or two in cost savings.

You're paying in risk.

You're paying in risk, and the risk is hard to quantify. If your automated pipeline produces a model that's ninety-five percent accurate instead of ninety-eight percent accurate, and that three percent difference leads to a medical coding error that costs someone thousands of dollars, your cost savings evaporate instantly. The economics only work if you have a robust auditing layer — which brings humans back into the loop.

The fully automated pipeline is cheaper only if you don't care about the error rate, or if the error rate is naturally low because the domain is simple.

Or if the cost of errors is low. For a chatbot that recommends movies, a five percent error rate is fine. For a model that extracts drug interaction information from clinical notes, a five percent error rate is catastrophic.

Which means we're going to see a bifurcation. Automated pipelines for low-stakes applications, human-audited pipelines for high-stakes ones.

That's my prediction. And the threshold between low-stakes and high-stakes will shift over time as the pipelines get better. But it won't go to zero. There will always be domains where the cost of error is high enough that you want a human in the loop, even if that human is just sampling one percent of outputs.

Daniel asked where the cutoff sits. I think you've given us a pretty clear answer: the cutoff is wherever the cost of silent failure exceeds the cost of human review. That line moves, but it doesn't disappear.

It's different for different applications. A model that generates marketing copy can be fully automated. A model that generates legal arguments cannot. A model that summarizes movie reviews can be fully automated. A model that summarizes medical records cannot. The cutoff isn't about model size or technical capability — it's about the stakes of being wrong.

There's also a second cutoff Daniel hinted at — the scale of the model itself. You mentioned a sweet spot. But what about really tiny models? Could you use this pipeline to train a model with, say, fifty million parameters that does one thing perfectly?

The research suggests there's a floor. Below roughly a hundred million parameters, even with perfect synthetic data, the model doesn't have enough representational capacity to capture complex linguistic patterns. You can train a fifty-million-parameter model to do sentiment analysis — positive or negative — with high accuracy. But anything requiring nuanced understanding, like extracting specific entities from text or answering questions that require reasoning, hits a wall.

Because the model literally doesn't have enough weights to store the patterns.

It's an information-theoretic limit. The synthetic data from a large model encodes a rich set of distinctions. A tiny model is forced to collapse those distinctions, and it collapses them in unpredictable ways. You might get lucky on your test set and unlucky in production.

The floor is architectural, not just a matter of better training techniques.

I should hedge and say that model architectures are evolving. There's work on sparse models, mixture-of-experts architectures, and other approaches that might let smaller parameter counts punch above their weight. But the fundamental tension remains: you're trying to distill a large model's knowledge into a small model's parameters, and there's a limit to how much compression you can achieve before things break.

Let's zoom out for a second. Daniel's framing was about whether we'll see more fully AI-generated LLMs and fine-tunes. I think the answer is yes, but with a massive asterisk. The fine-tunes will proliferate — we're already seeing that. The base models won't be AI-generated anytime soon, because pre-training on synthetic data is a dead end. But the fine-tuning ecosystem will become increasingly automated.

I'd add: the automation will be unevenly distributed. Big tech companies with access to the best generator models and the most compute will be able to produce high-quality automated fine-tunes at scale. Smaller players will be using slightly worse generator models, which means slightly worse synthetic data, which means slightly worse fine-tunes. The rich get richer.

That's a concentration dynamic that doesn't get enough attention. The same companies that own the large models also have the best tools for creating automated pipelines that depend on those large models.

It's a moat. If you're OpenAI or Anthropic or Google, you can offer an end-to-end automated fine-tuning service that uses your frontier model as the generator and judge, producing specialist models that run cheaper at inference time. A startup trying to compete has to either use a competitor's model — paying API costs and ceding margin — or use a weaker open model, producing worse fine-tunes.

Which means the fully automated pipeline isn't just a technical development. It's a business strategy.

Everything in AI is a business strategy if you squint hard enough.

Alright, let's land this. If I'm a practitioner — Daniel's kind of person, someone building real systems with these tools — what's the practical takeaway? When should I trust an automated pipeline, and when should I be reaching for the human review button?

One: if your task has clear, objective correctness criteria that can be automatically verified against a trusted source, automation works well. Two: if the cost of an error is low, automation works well. Three: if you have domain expertise and can periodically audit a sample of outputs, automation plus spot-checking is the sweet spot.

If none of those apply?

Then you need humans in the loop, probably more than you think. The automated metrics will lie to you. The judge model will be confident and wrong. And the only defense is someone who knows the domain well enough to say "that's not right.

The human as bullshit detector.

The human as the only bullshit detector that actually understands what truth looks like in that domain.

That's the kicker, isn't it? Everything we've said about the limits of automation comes with an implicit "for now." The sweet spot is moving. The floor is rising. The judge models are getting better.

And I think the honest answer to Daniel's question — can the human-in-the-loop ever be fully excluded — is that we don't know. But we do know that the role of the human changes. It moves from production to auditing, from labeling to defining objectives, from doing the work to checking the work. And that shift requires different skills, different tools, and a different mindset.

The people who thrive in that world aren't the ones who can label the most data fastest. They're the ones who can spot when the machine is confidently wrong.

Which is a much harder skill to train and a much harder skill to evaluate. We're going to need a lot more of those people, and we don't really have a pipeline for producing them yet.

Maybe that's the next thing we should automate.

I think that one might take a while.

Now: Hilbert's daily fun fact.

Hilbert: In the nineteen sixties, Harvard anthropologist Gary Urton discovered that quipu — the knotted-string recording devices used by the Inca Empire — could encode numbers in a base-ten positional system, meaning an accountant in Cusco could represent the number three thousand four hundred and twenty-two using just four knots on a single cord, a system so efficient that it arguably predates European double-entry bookkeeping by at least a century.

Knotted strings doing accounting better than my first spreadsheet.

This has been My Weird Prompts. I'm Corn.

I'm Herman Poppleberry. If you want more episodes like this one, find us at myweirdprompts dot com or wherever you get your podcasts.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2651: AI Training Itself: Student, Teacher, and Grader

Downloads

You Might Also Like

#2651: AI Training Itself: Student, Teacher, and Grader