#1672: Can We Build an AI That Never Forgets?

Why AIs can't remember last week, and the costly, risky quest to build models that learn continuously without forgetting everything.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Published: Mar 28
Duration: 16:51
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-training catastrophic-forgetting knowledge-cutoff

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Every user of a modern large language model has hit the same wall: the polite but firm refusal to discuss events after its knowledge cutoff date. It’s the AI equivalent of a librarian handing you a six-month-old newspaper when you asked for today’s headlines. This "knowledge cutoff" problem is a fundamental friction point in AI, creating a gap between a model's static world and our dynamic reality. But what if we could build a model that learns continuously? A "living" AI that ingests the news daily and updates its understanding of the world. The technical pieces for this exist, but the engineering and economic realities are brutal.

The naive approach—fully retraining a massive model every day—is economically insane, costing millions of dollars per cycle. The realistic path uses parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation). Instead of rewriting the model's entire brain, LoRA freezes the base model and trains tiny adapter matrices that act as "daily briefings." These adapters are stacked on top of the core model, providing fresh knowledge without the cost of a full retrain. Meta's Llama 3.1 release even built native support for stacking these adapters, making the plumbing a reality.

However, the mechanics are the easy part. The first major hurdle is the ingestion pipeline. To feed the model daily news, you need a sophisticated system to filter, deduplicate, and fact-check a chaotic firehose of information. This "editorial desk" would likely need its own AI to vet the data for the main AI, creating a complex, recursive system. The data must be clean, because the next problem is catastrophic forgetting. When a neural network learns new information, it can violently overwrite old knowledge, repurposing weights that stored historical facts to store today's stock market data. Mitigation techniques exist, but they add overhead and complexity, creating a constant balancing act between protecting the past and integrating the present.

This leads to subtle but critical issues of model drift and trust. If a model is updated daily, its internal representations can shift over time. Its reasoning style or "personality" might become inconsistent, leading to different answers for the same question on different days. This destroys user trust. Worse is the bias accumulation problem. A model ingesting a stream of contradictory news narratives might average them into a confusing worldview or, worse, amplify the most sensationalist, click-driven framing. An "always-current" model could easily become an "always-anxious" model, its emotional valence tied to the news cycle.

The practical takeaway is that the general-purpose "always-current" model is a nightmare. The viable path lies in domain-specific applications. A legal bot that ingests new case law or a medical assistant that updates with the latest journal articles has a clean, trusted data source and a clear value proposition. In these bounded domains, the risks of drift and bias are lower, and the need for freshness is high. This highlights the core trade-off. The dominant paradigm today is Retrieval-Augmented Generation (RAG), where a frozen model consults a live database of recent documents. RAG is cheaper and safer, but it’s a lookup tool, not a learning system. It can reference new information but cannot deeply reason with it or integrate it into its core world model.

Incremental retraining, while risky and expensive, offers the promise of true synthesis—the ability to connect new events to deep, existing knowledge. We are in a transitional phase, with engineers building the safe bridges of RAG while others attempt the high-risk, high-reward frontier of continuous learning. The question is not just "can we build it," but whether the cost and the potential for a model that changes its "self" every day is worth the reward.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1672: Can We Build an AI That Never Forgets?

You know that feeling when you ask an AI about something that happened last week and it gives you that polite "my knowledge cutoff is January 2026" shrug? It's like asking a librarian for today's newspaper and they hand you a six-month-old copy. You're standing there, holding yesterday's fish-wrap, and the librarian is smiling serenely, utterly confident they've solved your problem.

Oh, it's worse than that. It's not just a shrug. It's a confident, well-written shrug that might even hallucinate plausible-sounding details about events it knows nothing about. The gap between what these models know and what's happening in the world is a real friction point. As of March 2026, major models like GPT-4 still have a cutoff of November 2023, and Claude 3's is April 2024. It's not just ignorance; it's a confident, articulate ignorance that can actively mislead you.

So Daniel's prompt this time is right on that friction point. He's asking if we could have an LLM that ingests the news once a day or once a week, does some tiny incremental retraining, so its knowledge cutoff is never more than a few days old. A living, breathing model, basically. One that doesn't just report the news, but understands it in the context of everything else it knows.

Right. And this isn't a sci-fi question anymore. The pieces are technically there. We have the training methods, we have the data pipelines. The question is what it actually costs, what breaks, and whether it's worth it. I've been deep in the papers on continual learning this week, and the picture is... complicated. It's like looking at the blueprint for a perpetual motion machine; the elegance is captivating, but the engineering realities are brutal.

Complicated in a "we can almost do it" way or complicated in a "this creates ten new problems for every one it solves" way?

Both. Simultaneously. They're not mutually exclusive. Let's start with the "how," because the mechanism is fascinating. The naive approach—just taking your base model and fully retraining it on new data every day—is economically insane. We're talking millions of dollars per cycle for a large model. The compute bill alone would be staggering. It's like rebuilding your entire house from the foundation up every time you want to add a new bookshelf.

So that's off the table. What's the realistic path?

Parameter-efficient fine-tuning. The big buzzword here is LoRA, which stands for Low-Rank Adaptation. Instead of updating all the billions of weights in the model, you freeze the base model and train these tiny, separate adapter matrices that get plugged in. Think of it as giving the model a daily briefing sheet rather than rewriting its entire brain. The core reasoning engine stays pristine, but you're handing it a new, focused set of notes on current events.

A briefing sheet. I like that. So the core model stays static, but you're stacking these little knowledge patches on top. It's like applying a software patch instead of reinstalling the entire operating system.

The analogy holds well. Meta's Llama 3.1 release earlier this year, in January, made this a first-class feature. They built native support for stacking multiple LoRA adapters. So you could have your base model, then an adapter for medical knowledge, another for legal terminology, and, crucially, a daily news adapter. You load them all at inference time. The system dynamically composes the expertise you need for a given query.

Okay, so the plumbing exists. But what goes into that daily adapter? You can't just scrape every headline from every news site and feed it in. That's a recipe for garbage. You'd get the model equivalent of a person who only reads tabloid headlines and thinks they understand geopolitics.

That's the first major hurdle: the ingestion pipeline. You need a sophisticated system to filter, deduplicate, and fact-check the news stream before it becomes training data. You're dealing with conflicting reports, sensationalized headlines, opinion pieces disguised as news, and outright misinformation. The curation layer is arguably harder than the training itself. It's the editorial desk of this entire operation.

So you'd need a whole separate AI system just to vet the data for your main AI system. It's AIs all the way down. A committee of artificial editors.

It really is. And that curation system needs to be incredibly robust because of the next problem: catastrophic forgetting. This is the big one. When a neural network learns new information, it can violently overwrite old information. The weights that stored knowledge about, say, the French Revolution get repurposed to store today's stock market movements. The network has a finite capacity, and new information can crowd out the old.

That sounds disastrous. So you update the model with this week's news, and it forgets basic historical facts or how to do math? It becomes a current affairs savant who can't remember who Napoleon was?

The research shows it's not always that dramatic, but the drift is real. Models can lose proficiency on older tasks. There are mitigation strategies, like Elastic Weight Consolidation, which essentially identifies which weights are important for old knowledge and penalizes changes to them during new training. But it adds computational overhead and complexity. It's a constant balancing act—protecting the past while integrating the present.

It's like trying to write new notes in the margins of a textbook without smudging any of the existing text. You need a very steady hand and a special pen. And you have to do it every single day.

That's a decent analogy. And the problem compounds over time. If you do this daily for six months, the model's internal representations can drift in subtle ways. Its "personality," its reasoning style, even its factual recall can become inconsistent. You might ask it the same question on Monday and Wednesday and get subtly different answers because the Wednesday model has ingested two more days of potentially conflicting narratives. The model's "self" becomes a moving target.

That would destroy user trust. Consistency is a huge part of why people trust tools. If my calculator gave me a slightly different answer every day, I'd stop using it. You need to know that "X" means the same thing on Tuesday as it did on Monday.

And that leads to the second-order effects. Let's say a major news event happens—a political scandal, a market crash. Different outlets report it with different spins, different emphasized facts. If your model ingests all of those narratives in the same training cycle, what does it internalize? Does it average them? Does it latch onto the most frequent framing? The model has to construct a coherent reality from a stream of contradictory signals.

It could end up with a weird, blended perspective that doesn't match any single coherent reality. Or worse, it could amplify the most engagement-driven, sensationalist version of events because that's what dominates the data stream. The model's worldview could become a function of what gets the most clicks.

And that's the bias accumulation problem. The model doesn't just absorb facts; it absorbs the framing, the tone, the implicit assumptions of its training data. If that data is a firehose of daily news, the model's worldview could shift rapidly based on the news cycle. Imagine a model that becomes noticeably more pessimistic during a week of bad economic news, or more alarmist during a geopolitical crisis. Its emotional valence becomes tied to the news ticker.

So the "always-current" model might end up being an "always-anxious" model. That's a great selling point. "Come for the fresh facts, stay for the existential dread."

Ha. But this connects to something real. Bloomberg, for example, has been a pioneer in this space with BloombergGPT. They don't do daily updates, but they have a continuous pipeline for financial data. The key is they have an incredibly narrow, well-defined domain and a trusted, curated data source. They're not trying to ingest the entire chaotic breadth of human news. They're updating a specialist, not a generalist.

So the viable version of this might be domain-specific. A medical research assistant that updates daily with new journal publications. A legal bot that ingests new case law. Not a general-purpose know-it-all. You're not creating a new person every day; you're giving a specialist doctor a daily dose of the latest New England Journal of Medicine.

That's the practical takeaway. The general-purpose "always-current" model is a nightmare of curation, drift, and trust issues. But the specialized, domain-specific incremental model? That's where the economics and the technology start to make sense. The data is cleaner, the scope is bounded, and the value of freshness is extremely high. A lawyer paying for a tool that knows yesterday's precedent is a clear value proposition.

But even in a specialized domain, you mentioned the cost. LoRA adapters are cheap compared to full retraining, but "cheap" is relative. What are we actually talking about? Give me the back-of-the-napkin math.

A full retraining run for a 70-billion-parameter model can cost anywhere from two to five million dollars in compute, depending on the provider and the dataset size. A LoRA adapter training run for the same model might cost one to two percent of that. So twenty to a hundred thousand dollars per cycle.

Per cycle. So if you're doing that daily... that math gets scary fast.

You're looking at potentially millions of dollars a month, even with the efficient method. And that's just compute. That doesn't include the engineering team to maintain the curation pipeline, monitor for drift, and manage the adapter zoo. It's a massive operational undertaking. You're not just paying for electricity; you're paying for a 24/7 crew to keep the learning machine from going off the rails.

Which brings up the elephant in the room. Why not just use RAG? Retrieval-Augmented Generation. You keep your model frozen, but when you ask it a question, it first searches a live database of recent news and uses that context to formulate an answer. No retraining needed. It's the "don't change the brain, just give it a better library card" approach.

That is the dominant paradigm right now, and for good reason. It's cheaper, it's safer, and it doesn't risk catastrophic forgetting. But RAG has its own severe limitations. The model's core knowledge doesn't change. It's like giving someone a textbook and letting them glance at a newspaper clipping right before an exam. They can reference the clipping, but they haven't internalized the knowledge. They can't reason deeply about it or connect it to other concepts in their base knowledge. The information is adjacent to their mind, not within it.

So RAG is good for fact lookup—"What did the Fed say yesterday?"—but bad for synthesis and reasoning that requires a deep understanding of new information. It's a lookup table, not a learning system.

Precisely. If the news is that a new type of semiconductor material was discovered, a RAG system can tell you the announcement details. But an incrementally retrained model might be able to reason about how that material changes the competitive landscape for chipmakers, because it has updated its internal world model. It can run simulations in its head that incorporate the new fact. That's the promise: not just knowing, but understanding.

But it's a promise weighed against all those risks we just outlined. Drift, cost, consistency. It feels like we're in a transitional phase. RAG is the pragmatic, safe choice now. Incremental retraining is the high-risk, high-reward frontier. We're all using the safe bridge while some engineers are testing the experimental one.

And the frontier is moving. The tools are getting better. We're seeing research on more stable continual learning algorithms, like "PackNet" or "Progressive Neural Networks," which try to dedicate different parts of the network to different tasks. We're also seeing better automated curation systems, sometimes using smaller, specialized models to filter and summarize news for the larger model. The cost curve is dropping. In a year or two, the calculus might change significantly.

So for our listeners who are developers or just curious tinkerers, what can they actually do with this information today? How do they get their hands dirty?

If you're building an application where freshness is critical, start with RAG. It's the right tool for most jobs. But if you're hitting the limits of RAG—if you need the model to truly understand and reason about new information, not just retrieve it—then start experimenting with the open-source adapter frameworks. Hugging Face's PEFT library, or their TRL library for reinforcement learning, make it possible to set up a small-scale incremental update pipeline. You won't be updating a frontier model daily, but you can learn the mechanics on a smaller model—say, a 7-billion-parameter one—and a specific, clean dataset. Think of it as a lab experiment.

So the homework is: go play with LoRA adapters on a seven-billion-parameter model. See if you can teach it something new without making it forget how to write a sonnet. A very specific, nerdy homework assignment.

That's the idea. You'll quickly encounter the tradeoffs firsthand. You'll see how much data you need, how the training affects other capabilities, and whether the freshness gain is worth the complexity. You'll feel the pain of catastrophic forgetting when your model suddenly can't do basic arithmetic after you fed it a week of sports statistics.

It's a fascinating space. We're essentially asking: how do we build a mind that can learn continuously without losing itself? It's a problem human brains solve pretty well, even if we do forget where we put our keys. Our learning is seamless, integrated, and we don't usually overwrite our knowledge of history with today's lunch menu.

Our brains have had a few million years of evolution optimizing for that. We're trying to crack it in a decade. But the pace of progress is wild. The fact that we're even having this conversation about practical, incremental retraining shows how far the field has come. Five years ago, this was pure theory.

Alright, let's bring this in for a landing. The dream of a perfectly current LLM is technically possible but practically fraught. The path is through efficient adapters, not full retraining. The biggest hurdles are catastrophic forgetting, data curation, and maintaining model consistency over time. The most viable near-term applications are specialized, domain-specific models where the value of deep, fresh knowledge outweighs the costs and risks.

And the broader question Daniel's prompt raises is whether we want our AI systems to be static encyclopedias or living, evolving entities. Both have their place. The static model is reliable and safe. The living model is dynamic but unpredictable. The future probably involves a mix—stable base models with specialized, frequently updated adapters for domains that demand it. A foundation of granite, with rooms that can be remodeled.

A nice, balanced take. Thanks as always to our producer, Hilbert Flumingtop, for keeping us on track. And big thanks to Modal for providing the GPU credits that power these deep dives.

If you're enjoying the show, a quick review on your podcast app really does help us reach new listeners who might like this kind of technical breakdown. It helps others find the signal in the noise.

This has been My Weird Prompts. I'm Corn.

And I'm Herman Poppleberry. Stay curious.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1672: Can We Build an AI That Never Forgets?

Downloads

You Might Also Like

Episode #1672: Can We Build an AI That Never Forgets?