#2936: Why AI Still Can't Really Teach You to Code

Code generators ship code. Real tutors build understanding. Why the gap is bigger than you think.

Featuring
Listen
0:00
0:00
Episode Details
Episode ID
MWP-3106
Published
Duration
27:03
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The tech industry has poured billions into AI that writes code for you, but almost nothing into AI that teaches you to write it yourself. That's the central tension explored in this episode, prompted by a listener question about why no one has built a truly persistent, pedagogical coding tutor.

The fundamental problem is that code generation and code tutoring optimize for completely different things. A tool like Cursor maximizes output velocity — how fast can you ship working code. A pedagogical agent maximizes comprehension and retention — how well you understand what you just did, and whether you'll still understand it next month. Those are not just different features; they're different architectural targets.

Every major attempt so far has fallen short in distinct ways. Khan Academy's Khanmigo has the right tutor persona but no persistent memory — every session starts fresh, so it can't remember that you struggled with nested loops two weeks ago. Replit's Ghostwriter and Codecademy's AI Assistant are reactive explainers bolted onto code generation engines, not proactive teachers. Anthropic's Claude for Education has the right Socratic philosophy but lacks coding-specific scaffolding.

The technical challenge is architectural. A six-month curriculum of twice-weekly sessions generates roughly half a million tokens of dialogue, far exceeding any current context window. The solution likely involves retrieval-augmented generation paired with a student knowledge graph — tracking every concept encountered, associated error patterns, and confidence scores. When a student works on decorators in week eight, the system should know they struggled with higher-order functions in week five, a prerequisite concept.

Beyond memory, the system needs to implement educational principles like Vygotsky's zone of proximal development (keeping tasks in the sweet spot between too easy and impossible) and productive failure (strategically letting students struggle before intervening). This requires multiple layers: a student model, a curriculum scheduler, a pedagogical decision engine, and the LLM itself. That's a lot of infrastructure for what looks like a simple chat window — which is exactly why no one has shipped it yet.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2936: Why AI Still Can't Really Teach You to Code

Corn
Daniel sent us this one — he's drawing a line between two very different AI experiences. On one side, you've got tools like Cursor and Claude Code that write code for you. On the other, what if an AI sat down with you and said, "Let's learn Python," and then actually taught you — not by generating answers, but by guiding your hand as you typed, remembering what you struggled with three weeks ago, and deliberately bringing it back when you least expected it? He's asking whether any companies have seriously tried to build this, and what delivery mechanisms are emerging. The short answer is: a few have tried, none have nailed it, and the gap is weirder than you'd think.
Herman
The distinction he's drawing is the one between a code generation agent and a pedagogical agent. And they optimize for completely different things. A code gen tool optimizes for output velocity — how fast can we get you to a working solution. A pedagogical agent optimizes for comprehension and retention — how well do you understand what you just did, and will you still understand it next month. Those are not just different features. They're different optimization targets, different architectures, different success metrics entirely.
Corn
One gets you to ship. The other gets you to know.
Herman
And the industry has poured billions into shipping. The tutoring side has gotten scraps and afterthoughts. Which is strange, because the models are demonstrably capable of Socratic dialogue. Claude can absolutely refuse to give you an answer and instead ask you guiding questions. GPT-4o can do the same. The capability is there. What's missing is everything around it.
Corn
Let's walk through who's actually tried. Because the prompt is right that most of what exists is what I'd call the "education mode toggle" — you flip a switch and suddenly the same tool that was writing your unit tests is now explaining what a decorator is. That's not a tutor. That's a code generator wearing a tweed jacket.
Herman
Let's start with the most visible attempt. Khan Academy launched Khanmigo in 2023, and in March 2024 they released a coding tutor beta built on GPT-4. The idea was that Khanmigo would act as a patient tutor, asking questions, giving hints, refusing to just hand over the answer. And the persona work was genuinely thoughtful — it had a warm, encouraging tone that felt more like a human tutor than a chatbot.
Herman
As of mid-2026, it still doesn't have persistent student memory across sessions. Each session starts fresh. The tutor doesn't remember that you struggled with nested loops two weeks ago. It doesn't know which concepts you've mastered versus which ones you've only superficially encountered. The context window resets, and with it, the entire pedagogical relationship.
Corn
Which makes it less a tutor and more a very friendly reference librarian you keep having first dates with.
Herman
And that's not a knock on Khan Academy — they've been transparent about the limitations and they're working on it. But it illustrates the core problem. You can have the best tutor persona prompt in the world, and without persistent memory, you're still just a chatbot.
Corn
What about Replit? They were early in this space.
Herman
Replit's Ghostwriter launched in 2023, and the original pitch had strong educational undertones. The idea was that it would help beginners learn while building. But the product incentives pushed hard toward code completion and deployment speed. By 2024, Ghostwriter had largely pivoted to being a productivity tool. The "Explain Code" feature is reactive — you highlight code, it explains it — but it's not proactive. It doesn't sequence concepts. It doesn't track your progress. It's a useful feature, but it's an afterthought bolted onto a code generation engine.
Corn
Reactive versus proactive is a useful lens. A real tutor sees you about to make a mistake and intervenes before you commit to it. An explain-code button waits until you're already confused and then responds. One is teaching, the other is tech support.
Herman
Codecademy's AI Assistant, which they integrated into their IDE in 2024, falls into the same category. It can answer questions about the current exercise, but it has no long-term curriculum tracking. It knows the lesson you're on because you're literally on that page, not because it has a model of your learning trajectory.
Corn
The pattern is: everyone built a helpful Q-and-A bot, and called it a tutor.
Herman
Anthropic's Claude for Education, announced in January of this year, is interesting because it approaches the problem from a different angle. They built a dedicated education tier with what they call "Socratic mode" — the model is explicitly prompted to refuse to give direct answers and instead guide students through reasoning. And the refusal is structural, not just a polite suggestion. Claude will push back if you try to get it to do your homework.
Corn
Which is the right instinct. The best teachers I've had were the ones who wouldn't let me off the hook.
Herman
Claude for Education is designed for general subjects — history, literature, science. It's not specifically optimized for coding pedagogy. Teaching someone to write a Python function has different scaffolding requirements than teaching someone to analyze a poem. In coding, the tutor needs to see your code, understand your specific syntax error, and know whether this error connects to a misconception you had three sessions ago. General-purpose Socratic mode doesn't do that.
Corn
It's a philosophy applied broadly, not a coding tutor built from the ground up.
Herman
And then there are the startups. Coddy launched in 2024, Maven in 2025. There's an unnamed Y Combinator S25 batch company building what they describe as "a personal Python tutor that remembers everything you've struggled with." But as far as I can tell, none of them have shipped a production system with persistent memory across months. They're all working on the problem, but the demos I've seen are single-session.
Corn
The field is basically: Khan Academy has the right idea but no memory, Replit and Codecademy have reactive explainers, Anthropic has the right pedagogical philosophy but no coding specificity, and a handful of startups are promising the moon but haven't landed. That's the landscape.
Herman
The question is why. Why has this been so hard to build?
Corn
Let me ask you something. You were a pediatrician. When you were training residents, did you just answer their questions, or did you sequence what they learned?
Herman
You don't let a resident intubate before they've mastered bag-mask ventilation. And you bring things back — "Remember that case of croup from three weeks ago? Here's a similar presentation, but the stridor sounds different. What's your differential?" That interleaving, that deliberate return to prior material, is what builds durable knowledge. And that's exactly what current AI tutors can't do.
Corn
That's the technical challenge. Let's get into it. What's actually stopping someone from building this?
Herman
The core problem is memory. And I don't mean memory in the colloquial sense — I mean architectural memory. Let me give you some numbers. A typical 30-minute tutoring session generates roughly 8,000 to 12,000 tokens of dialogue — that's the student's code, the tutor's guidance, the back-and-forth. GPT-4o has a context window of 128,000 tokens. 5 Sonnet has 200,000. At 10,000 tokens per session, you can fit maybe 12 to 18 sessions in a context window before you hit the ceiling.
Corn
A proper coding curriculum is what, six months of regular sessions?
Herman
If you're meeting twice a week, that's 50 sessions. You're at half a million tokens just in raw transcript, and that's before you account for the code artifacts, the exercises, the error logs. A flat context window can't hold a multi-month learning history. And even if it could, jamming everything into a single prompt is the wrong architecture. The tutor needs structured access to specific things — not "here's everything that ever happened," but "here are the three misconceptions about variable scope that this student exhibited in weeks two, four, and seven.
Corn
It's the difference between a filing cabinet and a pile.
Herman
This is where things get interesting architecturally. The approach that people are starting to explore is retrieval-augmented generation paired with what's essentially a student knowledge graph. Every concept the student has encountered becomes a node. Attached to each node are the associated error patterns, the exercises they've completed, and a confidence score — how well does the model believe this student understands closures, or list comprehensions, or recursion.
Corn
When the student is working on, say, decorators in week eight, the system can pull up the relevant nodes — "this student struggled with higher-order functions in week five, which is a prerequisite concept for understanding decorators.
Herman
And that's not just a keyword search. It's structured retrieval based on a curriculum map. The system knows that decorators depend on understanding first-class functions, which depend on understanding functions as objects. If the confidence score on "functions as objects" is low, the tutor should probably revisit that before introducing decorator syntax.
Corn
There's a name for this in the education literature, right? The zone of proximal development?
Herman
Vygotsky's zone of proximal development. The idea is that there's a sweet spot between what a student can do unaided and what they can't do even with help. Good teaching happens in that zone — the task is just hard enough to stretch the student, but achievable with guidance. And the tutor needs to constantly recalibrate where that zone is. That's a dynamic inference problem. The system has to look at what the student just did, compare it to their history, and decide: is this the right level of challenge, or am I about to lose them?
Corn
Which is hard enough for a human tutor who can read facial expressions and hesitation. An AI only has the text.
Herman
There's another layer. Education research has something called "productive failure" — formalized by Manu Kapur in a 2008 paper in the Journal of the Learning Sciences. The finding is that students who struggle with a problem before receiving instruction often develop deeper understanding than students who receive direct instruction first. The struggle itself is productive, provided it's structured and eventually resolved.
Corn
The tutor shouldn't just withhold answers — it should strategically let the student fail.
Herman
For a specific window of time, on specific types of problems. And then it needs to know exactly when to step in with a hint, and exactly what kind of hint. Too early, and you rob the student of the productive struggle. Too late, and they get frustrated and disengage. Can current LLMs do this without explicit pedagogical rules? My sense is no — they need structured guidance. A prompt that says "be Socratic" is too vague. You need something more like "the student has been struggling with this type of error for four minutes. Based on their history, they typically need a nudge about indentation, not about logic. Give a hint that points at the syntax without naming the fix.
Corn
We're talking about a system with multiple layers. A student model that tracks knowledge and misconceptions, a curriculum scheduler that sequences concepts and schedules review, a pedagogical decision engine that decides when to hint versus when to let struggle, and then the LLM itself that generates the actual tutor dialogue. That's a lot of infrastructure for what looks to the user like a chat window.
Herman
That's exactly why it hasn't been built yet. Code generation is a simpler problem — you give the model context about the codebase and the task, and it produces output. The success metric is "does it work." Tutoring requires all these additional systems, and the success metric is "does the student understand" — which is much harder to measure and takes months to evaluate.
Corn
There are some infrastructure pieces emerging. You mentioned Mem0 earlier.
Herman
Mem0 is an open-source memory layer for LLMs that reached 10,000 GitHub stars in February of this year. It provides persistent memory across sessions — it can store facts, preferences, and interaction history. LangChain released LangMem in early 2026, which is their memory framework. These are useful building blocks, but they're generic. They're not optimized for pedagogical sequencing. They don't know about the forgetting curve or interleaved practice.
Corn
The forgetting curve — this is Ebbinghaus, right? From the 1880s?
Herman
Hermann Ebbinghaus, 1885. The finding is that memory decays exponentially after learning. You lose about 50 percent of new information within an hour if you don't actively reinforce it. The way you fight this is spaced repetition — you review material at increasing intervals. An AI tutor should know exactly what a student learned three weeks ago and deliberately bring it back, not because the student asked, but because the forgetting curve says it's time.
Corn
Which means the tutor needs a scheduler. Not just a chat interface — an actual curriculum engine that tracks what was learned when, and proactively surfaces old material.
Herman
This is where I get excited, because this is an architectural problem that's actually solvable with current technology. We have the models. We have the memory infrastructure, even if it's rough. We understand the pedagogical principles. What we don't have is a company that's put it all together into a coherent product. The pieces are on the table. No one's assembled them.
Corn
If you were building this — and I know you've thought about this — what does the memory architecture actually look like?
Herman
I'd structure it around two memory systems, inspired by hippocampal indexing theory. You have episodic memory — what the student did. Session transcripts, code they wrote, errors they made, exercises they completed. And you have semantic memory — what the student knows. A structured knowledge graph where each node is a concept with a mastery score, associated error patterns, and links to prerequisite and dependent concepts.
Corn
The retrieval works differently for each.
Herman
When the student is working on a new problem, the system does a semantic retrieval first — "what concepts are relevant to this problem, and what's the student's mastery level on each?" Then it does an episodic retrieval — "when this student encountered similar problems in the past, what specific mistakes did they make, and what interventions worked?" The two retrievals inform each other. The semantic layer tells you what to worry about; the episodic layer tells you how it manifested.
Corn
This all has to happen in milliseconds, mid-conversation.
Herman
Which is a real engineering constraint. You can't have the tutor pause for three seconds while it queries a vector database. The retrieval has to be fast enough to feel conversational. That's achievable with current tooling — we're not talking about anything science-fictional here — but it requires careful engineering.
Corn
Let me push on something. You said the models are capable. But are they? Can an LLM actually stay in the zone of proximal development for an hour, let alone six months?
Herman
I'm not entirely sure about this part, but here's what I think the evidence suggests. In single sessions, with a well-crafted system prompt, yes — Claude and GPT-4o can maintain a Socratic stance quite effectively. The problem is consistency across sessions. Without the student model, the tutor doesn't know where to pick up. And there's also the issue of the model's own tendencies. LLMs are trained to be helpful, which often means giving the answer. Pushing against that training requires strong prompting, and even then, models will sometimes default to being overly helpful.
Corn
The model wants to be liked.
Herman
The model wants to be useful, which in its training often means providing the solution. A good tutor knows that withholding the solution is sometimes the most useful thing you can do. That's a tension that hasn't been fully resolved in the current generation of models.
Corn
Let's talk about what exists today that a learner can actually use. If someone listening wants an AI coding tutor that teaches rather than generates, what do they do right now?
Herman
The most practical approach is a hybrid. You use Claude or GPT-4 with a custom system prompt that says something like "Never write code for me. Only ask questions and give hints. When I'm stuck, help me discover the solution rather than providing it." Then you manually track your own progress in a separate document — what you've learned, what you struggled with, what you want to revisit.
Corn
Which is basically you being your own knowledge graph.
Herman
It's crude, but it works. And it forces you to reflect on your own learning, which is itself pedagogically valuable. The act of writing down "I keep messing up list comprehensions when there's a conditional" is metacognition. You're learning about your learning.
Corn
Like being your own doctor but you only have a stethoscope and a notepad.
Herman
You're not a doctor. But it's what we've got until someone builds the real thing. The other approach is to use Claude Code or Cursor but in what I'd call "review mode" — you write the code yourself, and then you ask the tool to review it, explain what you could have done differently, and quiz you on the underlying concepts. You're using the code gen tool as a tutor by constraining how you interact with it.
Corn
You're forcing it into the pedagogical role through sheer willpower and prompt engineering.
Herman
That's kind of the story of AI in 2026, isn't it? We have these incredibly capable general-purpose tools, and we're bending them into specialized roles through prompts and external scaffolding. It works, but it's fragile. One prompt drift and suddenly your Socratic tutor is writing your functions for you.
Corn
Let's pull back to the bigger question. Why has the market produced a dozen code generation tools and basically zero dedicated coding tutors?
Herman
Code generation has an obvious ROI — you can measure it in features shipped, time saved, developer velocity. A CTO can look at Cursor and say "this will make my team 30 percent faster." An AI coding tutor's ROI is measured in learning outcomes, which take months to materialize and are harder to attribute. The buyer for code gen is the engineering organization. The buyer for a coding tutor is the individual learner, or maybe a bootcamp or university. The budgets are different, the sales cycles are different, the success metrics are different.
Corn
The venture capital follows the enterprise money.
Herman
Code gen tools can charge per-seat enterprise pricing. Tutoring tools are competing with free YouTube tutorials and 15-dollar Udemy courses. The unit economics are brutal unless you can demonstrate dramatically better outcomes — which you can't do without long-term studies, which you can't run without a product, which you can't build without funding.
Corn
The chicken-and-egg problem with a pedagogical twist.
Herman
There's also a cultural factor. The tech industry valorizes shipping. "Move fast and break things." Learning is inherently slow. You cannot "move fast" through understanding recursion. Your brain needs time to myelinate those neural pathways. An AI tutor that respects the pace of learning is making a promise that's fundamentally at odds with the industry's tempo.
Corn
The first successful AI coding tutor might not come from Silicon Valley. It might come from an education company, or from a country with a different relationship to learning speed.
Herman
Khan Academy is the obvious candidate. They have the pedagogical DNA, the existing user base, and they've been working on Khanmigo for three years. Duolingo could extend into coding — they already have the gamification and spaced repetition infrastructure. But their coding product has been fairly basic so far. The dark horse would be a company like Replit that decides to build a dedicated learning track separate from their pro tools, with its own architecture and success metrics.
Corn
Or a startup we haven't heard of yet. The Y Combinator batch company you mentioned — "a personal Python tutor that remembers everything you've struggled with" — that's basically the spec.
Herman
It's the spec, but saying it and building it are different things. The hard part isn't the chat interface. The hard part is the knowledge graph, the curriculum scheduler, the pedagogical decision engine. Those are backend systems that take years to get right. And you need a lot of student data to train and refine them.
Corn
Which brings us to the credentialing question. If an AI tutor with persistent memory actually works — if it can track what you've learned and retained over six months — does that change how we evaluate programmers?
Herman
I think it fundamentally could. Right now, we credential programmers based on what they've built — their portfolio, their GitHub, their shipped products. Or we credential them based on artificial assessments — coding interviews, algorithm challenges, certifications. An AI tutor that's been with you for six months has a much richer signal. It knows not just what you can build, but what you actually understand. It's seen you struggle with closures and eventually master them. It's seen which concepts you internalized quickly and which ones took three attempts.
Corn
The transcript becomes the credential.
Herman
And that's both exciting and unsettling. Exciting because it could surface talent that doesn't look good on traditional metrics — someone who struggled early but showed consistent growth, for instance. Unsettling because it puts an enormous amount of trust in the AI's assessment. What if the tutor's knowledge graph is wrong? What if it misjudges mastery?
Corn
What if the student learns to game the tutor the way students game standardized tests?
Herman
That's a real concern. Any assessment system creates incentives, and those incentives shape behavior. If the tutor's confidence scores become high-stakes, students will optimize for the scores rather than for understanding. We'd need to design the system so that gaming it is harder than actually learning the material — which is a hard design problem.
Corn
Where does this leave us? We've got capable models, emerging memory infrastructure, well-understood pedagogical principles, and a market gap wide enough to drive a truck through. The prompt itself reads like a product requirements document.
Herman
It really does. The user basically laid out the problem statement, the competitive landscape, the technical challenges, and the opportunity. If I were an engineer looking for something to build, I'd take this prompt seriously.
Corn
The biggest gap isn't model capability — it's curriculum design. No current AI tutor has a built-in pedagogical theory of when to challenge versus when to support. They're all winging it.
Herman
The second gap is the persistent student model. The prompt is ephemeral. The knowledge graph is the moat. If you're building this, don't start with the chat interface. Start with the data model. What does a student know? How do you represent a misconception? How do you track the forgetting curve? Get that right, and the tutor behavior emerges from it.
Corn
Start with what the system remembers, not what it says.
Herman
For learners today, the best approach is that hybrid I mentioned. Use a frontier model with a strict "don't write code for me" system prompt. Track your own progress manually. Treat the AI as a Socratic partner, not a code generator. It's not elegant, but it works, and it builds the metacognitive habits that will serve you long after the AI tutor product finally ships.
Corn
For builders listening — this space is wide open. The user's prompt is essentially a challenge. The tools exist. The models are capable. What's missing isn't technology. It's the will to build for learning, not just for shipping.
Herman
The question I keep coming back to is who gets there first. Does it come from an education company that understands pedagogy but struggles with the AI infrastructure? Or from a dev tools company that has the infrastructure but is optimized for a completely different success metric?
Corn
My money's on neither. I think it comes from someone who's been a teacher and a programmer, who's felt the frustration of both sides, and who builds it because they can't not build it.
Herman
That's the romantic answer.
Corn
It's the sloth answer. We're patient.
Herman
I'll take it. The open question for me is what happens to programming education if this works. If an AI tutor can take someone from zero to proficient in Python over six months, with persistent memory and adaptive curriculum, what does that do to bootcamps? To university CS programs? To the entire edifice of how we teach coding?
Corn
It doesn't replace them. But it changes what they're for. If the AI handles the mechanics of learning syntax and debugging and basic patterns, the human institutions can focus on what they're uniquely good at — design judgment, collaboration, ethics, the things you learn by building something with other humans in the room.
Herman
The AI teaches you to code. The humans teach you what's worth coding.
Corn
Now: Hilbert's daily fun fact.

Hilbert: In the 1880s, a group of British missionaries in Mongolia attempted to introduce the Ethiopian game of genna — a form of field hockey played at Christmas — to local herders. The entire effort collapsed when someone realized the missionaries had brought the wrong kind of sticks, and by the time replacements arrived from Shanghai, the herders had invented their own game using yak bones and a felt ball. The missionaries reportedly never played genna again.
Corn
The wrong sticks.
Herman
Yak bones and a felt ball. That's a whole sport that nearly existed and then didn't.
Herman
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this episode, leave us a review — it helps other curious people find the show. Find more at myweirdprompts.
Corn
Build the tutor. Someone's got to.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.