#2936: Why AI Still Can't Really Teach You to Code

Code generators ship code. Real tutors build understanding. Why the gap is bigger than you think.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3106
Published: May 20
Duration: 27:03
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: ai-agents personalized-ai ai-education

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The tech industry has poured billions into AI that writes code for you, but almost nothing into AI that teaches you to write it yourself. That's the central tension explored in this episode, prompted by a listener question about why no one has built a truly persistent, pedagogical coding tutor.

The fundamental problem is that code generation and code tutoring optimize for completely different things. A tool like Cursor maximizes output velocity — how fast can you ship working code. A pedagogical agent maximizes comprehension and retention — how well you understand what you just did, and whether you'll still understand it next month. Those are not just different features; they're different architectural targets.

Every major attempt so far has fallen short in distinct ways. Khan Academy's Khanmigo has the right tutor persona but no persistent memory — every session starts fresh, so it can't remember that you struggled with nested loops two weeks ago. Replit's Ghostwriter and Codecademy's AI Assistant are reactive explainers bolted onto code generation engines, not proactive teachers. Anthropic's Claude for Education has the right Socratic philosophy but lacks coding-specific scaffolding.

The technical challenge is architectural. A six-month curriculum of twice-weekly sessions generates roughly half a million tokens of dialogue, far exceeding any current context window. The solution likely involves retrieval-augmented generation paired with a student knowledge graph — tracking every concept encountered, associated error patterns, and confidence scores. When a student works on decorators in week eight, the system should know they struggled with higher-order functions in week five, a prerequisite concept.

Beyond memory, the system needs to implement educational principles like Vygotsky's zone of proximal development (keeping tasks in the sweet spot between too easy and impossible) and productive failure (strategically letting students struggle before intervening). This requires multiple layers: a student model, a curriculum scheduler, a pedagogical decision engine, and the LLM itself. That's a lot of infrastructure for what looks like a simple chat window — which is exactly why no one has shipped it yet.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2936: Why AI Still Can't Really Teach You to Code

Daniel sent us this one — he's drawing a line between two very different AI experiences. On one side, you've got tools like Cursor and Claude Code that write code for you. On the other, what if an AI sat down with you and said, "Let's learn Python," and then actually taught you — not by generating answers, but by guiding your hand as you typed, remembering what you struggled with three weeks ago, and deliberately bringing it back when you least expected it? He's asking whether any companies have seriously tried to build this, and what delivery mechanisms are emerging. The short answer is: a few have tried, none have nailed it, and the gap is weirder than you'd think.

The distinction he's drawing is the one between a code generation agent and a pedagogical agent. And they optimize for completely different things. A code gen tool optimizes for output velocity — how fast can we get you to a working solution. A pedagogical agent optimizes for comprehension and retention — how well do you understand what you just did, and will you still understand it next month. Those are not just different features. They're different optimization targets, different architectures, different success metrics entirely.

One gets you to ship. The other gets you to know.

And the industry has poured billions into shipping. The tutoring side has gotten scraps and afterthoughts. Which is strange, because the models are demonstrably capable of Socratic dialogue. Claude can absolutely refuse to give you an answer and instead ask you guiding questions. GPT-4o can do the same. The capability is there. What's missing is everything around it.

Let's walk through who's actually tried. Because the prompt is right that most of what exists is what I'd call the "education mode toggle" — you flip a switch and suddenly the same tool that was writing your unit tests is now explaining what a decorator is. That's not a tutor. That's a code generator wearing a tweed jacket.

Let's start with the most visible attempt. Khan Academy launched Khanmigo in 2023, and in March 2024 they released a coding tutor beta built on GPT-4. The idea was that Khanmigo would act as a patient tutor, asking questions, giving hints, refusing to just hand over the answer. And the persona work was genuinely thoughtful — it had a warm, encouraging tone that felt more like a human tutor than a chatbot.

As of mid-2026, it still doesn't have persistent student memory across sessions. Each session starts fresh. The tutor doesn't remember that you struggled with nested loops two weeks ago. It doesn't know which concepts you've mastered versus which ones you've only superficially encountered. The context window resets, and with it, the entire pedagogical relationship.

Which makes it less a tutor and more a very friendly reference librarian you keep having first dates with.

And that's not a knock on Khan Academy — they've been transparent about the limitations and they're working on it. But it illustrates the core problem. You can have the best tutor persona prompt in the world, and without persistent memory, you're still just a chatbot.

What about Replit? They were early in this space.

Replit's Ghostwriter launched in 2023, and the original pitch had strong educational undertones. The idea was that it would help beginners learn while building. But the product incentives pushed hard toward code completion and deployment speed. By 2024, Ghostwriter had largely pivoted to being a productivity tool. The "Explain Code" feature is reactive — you highlight code, it explains it — but it's not proactive. It doesn't sequence concepts. It doesn't track your progress. It's a useful feature, but it's an afterthought bolted onto a code generation engine.

Reactive versus proactive is a useful lens. A real tutor sees you about to make a mistake and intervenes before you commit to it. An explain-code button waits until you're already confused and then responds. One is teaching, the other is tech support.

Codecademy's AI Assistant, which they integrated into their IDE in 2024, falls into the same category. It can answer questions about the current exercise, but it has no long-term curriculum tracking. It knows the lesson you're on because you're literally on that page, not because it has a model of your learning trajectory.

The pattern is: everyone built a helpful Q-and-A bot, and called it a tutor.

Anthropic's Claude for Education, announced in January of this year, is interesting because it approaches the problem from a different angle. They built a dedicated education tier with what they call "Socratic mode" — the model is explicitly prompted to refuse to give direct answers and instead guide students through reasoning. And the refusal is structural, not just a polite suggestion. Claude will push back if you try to get it to do your homework.

Which is the right instinct. The best teachers I've had were the ones who wouldn't let me off the hook.

Claude for Education is designed for general subjects — history, literature, science. It's not specifically optimized for coding pedagogy. Teaching someone to write a Python function has different scaffolding requirements than teaching someone to analyze a poem. In coding, the tutor needs to see your code, understand your specific syntax error, and know whether this error connects to a misconception you had three sessions ago. General-purpose Socratic mode doesn't do that.

It's a philosophy applied broadly, not a coding tutor built from the ground up.

And then there are the startups. Coddy launched in 2024, Maven in 2025. There's an unnamed Y Combinator S25 batch company building what they describe as "a personal Python tutor that remembers everything you've struggled with." But as far as I can tell, none of them have shipped a production system with persistent memory across months. They're all working on the problem, but the demos I've seen are single-session.

The field is basically: Khan Academy has the right idea but no memory, Replit and Codecademy have reactive explainers, Anthropic has the right pedagogical philosophy but no coding specificity, and a handful of startups are promising the moon but haven't landed. That's the landscape.

The question is why. Why has this been so hard to build?

Let me ask you something. You were a pediatrician. When you were training residents, did you just answer their questions, or did you sequence what they learned?

You don't let a resident intubate before they've mastered bag-mask ventilation. And you bring things back — "Remember that case of croup from three weeks ago? Here's a similar presentation, but the stridor sounds different. What's your differential?" That interleaving, that deliberate return to prior material, is what builds durable knowledge. And that's exactly what current AI tutors can't do.

That's the technical challenge. Let's get into it. What's actually stopping someone from building this?

The core problem is memory. And I don't mean memory in the colloquial sense — I mean architectural memory. Let me give you some numbers. A typical 30-minute tutoring session generates roughly 8,000 to 12,000 tokens of dialogue — that's the student's code, the tutor's guidance, the back-and-forth. GPT-4o has a context window of 128,000 tokens. 5 Sonnet has 200,000. At 10,000 tokens per session, you can fit maybe 12 to 18 sessions in a context window before you hit the ceiling.

A proper coding curriculum is what, six months of regular sessions?

If you're meeting twice a week, that's 50 sessions. You're at half a million tokens just in raw transcript, and that's before you account for the code artifacts, the exercises, the error logs. A flat context window can't hold a multi-month learning history. And even if it could, jamming everything into a single prompt is the wrong architecture. The tutor needs structured access to specific things — not "here's everything that ever happened," but "here are the three misconceptions about variable scope that this student exhibited in weeks two, four, and seven.

It's the difference between a filing cabinet and a pile.

This is where things get interesting architecturally. The approach that people are starting to explore is retrieval-augmented generation paired with what's essentially a student knowledge graph. Every concept the student has encountered becomes a node. Attached to each node are the associated error patterns, the exercises they've completed, and a confidence score — how well does the model believe this student understands closures, or list comprehensions, or recursion.

When the student is working on, say, decorators in week eight, the system can pull up the relevant nodes — "this student struggled with higher-order functions in week five, which is a prerequisite concept for understanding decorators.

And that's not just a keyword search. It's structured retrieval based on a curriculum map. The system knows that decorators depend on understanding first-class functions, which depend on understanding functions as objects. If the confidence score on "functions as objects" is low, the tutor should probably revisit that before introducing decorator syntax.

There's a name for this in the education literature, right? The zone of proximal development?

Vygotsky's zone of proximal development. The idea is that there's a sweet spot between what a student can do unaided and what they can't do even with help. Good teaching happens in that zone — the task is just hard enough to stretch the student, but achievable with guidance. And the tutor needs to constantly recalibrate where that zone is. That's a dynamic inference problem. The system has to look at what the student just did, compare it to their history, and decide: is this the right level of challenge, or am I about to lose them?

Which is hard enough for a human tutor who can read facial expressions and hesitation. An AI only has the text.

There's another layer. Education research has something called "productive failure" — formalized by Manu Kapur in a 2008 paper in the Journal of the Learning Sciences. The finding is that students who struggle with a problem before receiving instruction often develop deeper understanding than students who receive direct instruction first. The struggle itself is productive, provided it's structured and eventually resolved.

The tutor shouldn't just withhold answers — it should strategically let the student fail.

For a specific window of time, on specific types of problems. And then it needs to know exactly when to step in with a hint, and exactly what kind of hint. Too early, and you rob the student of the productive struggle. Too late, and they get frustrated and disengage. Can current LLMs do this without explicit pedagogical rules? My sense is no — they need structured guidance. A prompt that says "be Socratic" is too vague. You need something more like "the student has been struggling with this type of error for four minutes. Based on their history, they typically need a nudge about indentation, not about logic. Give a hint that points at the syntax without naming the fix.

We're talking about a system with multiple layers. A student model that tracks knowledge and misconceptions, a curriculum scheduler that sequences concepts and schedules review, a pedagogical decision engine that decides when to hint versus when to let struggle, and then the LLM itself that generates the actual tutor dialogue. That's a lot of infrastructure for what looks to the user like a chat window.

That's exactly why it hasn't been built yet. Code generation is a simpler problem — you give the model context about the codebase and the task, and it produces output. The success metric is "does it work." Tutoring requires all these additional systems, and the success metric is "does the student understand" — which is much harder to measure and takes months to evaluate.

There are some infrastructure pieces emerging. You mentioned Mem0 earlier.

Mem0 is an open-source memory layer for LLMs that reached 10,000 GitHub stars in February of this year. It provides persistent memory across sessions — it can store facts, preferences, and interaction history. LangChain released LangMem in early 2026, which is their memory framework. These are useful building blocks, but they're generic. They're not optimized for pedagogical sequencing. They don't know about the forgetting curve or interleaved practice.

The forgetting curve — this is Ebbinghaus, right? From the 1880s?

Hermann Ebbinghaus, 1885. The finding is that memory decays exponentially after learning. You lose about 50 percent of new information within an hour if you don't actively reinforce it. The way you fight this is spaced repetition — you review material at increasing intervals. An AI tutor should know exactly what a student learned three weeks ago and deliberately bring it back, not because the student asked, but because the forgetting curve says it's time.

Which means the tutor needs a scheduler. Not just a chat interface — an actual curriculum engine that tracks what was learned when, and proactively surfaces old material.

This is where I get excited, because this is an architectural problem that's actually solvable with current technology. We have the models. We have the memory infrastructure, even if it's rough. We understand the pedagogical principles. What we don't have is a company that's put it all together into a coherent product. The pieces are on the table. No one's assembled them.

If you were building this — and I know you've thought about this — what does the memory architecture actually look like?

I'd structure it around two memory systems, inspired by hippocampal indexing theory. You have episodic memory — what the student did. Session transcripts, code they wrote, errors they made, exercises they completed. And you have semantic memory — what the student knows. A structured knowledge graph where each node is a concept with a mastery score, associated error patterns, and links to prerequisite and dependent concepts.

The retrieval works differently for each.

When the student is working on a new problem, the system does a semantic retrieval first — "what concepts are relevant to this problem, and what's the student's mastery level on each?" Then it does an episodic retrieval — "when this student encountered similar problems in the past, what specific mistakes did they make, and what interventions worked?" The two retrievals inform each other. The semantic layer tells you what to worry about; the episodic layer tells you how it manifested.

This all has to happen in milliseconds, mid-conversation.

Which is a real engineering constraint. You can't have the tutor pause for three seconds while it queries a vector database. The retrieval has to be fast enough to feel conversational. That's achievable with current tooling — we're not talking about anything science-fictional here — but it requires careful engineering.

Let me push on something. You said the models are capable. But are they? Can an LLM actually stay in the zone of proximal development for an hour, let alone six months?

I'm not entirely sure about this part, but here's what I think the evidence suggests. In single sessions, with a well-crafted system prompt, yes — Claude and GPT-4o can maintain a Socratic stance quite effectively. The problem is consistency across sessions. Without the student model, the tutor doesn't know where to pick up. And there's also the issue of the model's own tendencies. LLMs are trained to be helpful, which often means giving the answer. Pushing against that training requires strong prompting, and even then, models will sometimes default to being overly helpful.

The model wants to be liked.

The model wants to be useful, which in its training often means providing the solution. A good tutor knows that withholding the solution is sometimes the most useful thing you can do. That's a tension that hasn't been fully resolved in the current generation of models.

Let's talk about what exists today that a learner can actually use. If someone listening wants an AI coding tutor that teaches rather than generates, what do they do right now?

The most practical approach is a hybrid. You use Claude or GPT-4 with a custom system prompt that says something like "Never write code for me. Only ask questions and give hints. When I'm stuck, help me discover the solution rather than providing it." Then you manually track your own progress in a separate document — what you've learned, what you struggled with, what you want to revisit.

Which is basically you being your own knowledge graph.

It's crude, but it works. And it forces you to reflect on your own learning, which is itself pedagogically valuable. The act of writing down "I keep messing up list comprehensions when there's a conditional" is metacognition. You're learning about your learning.

Like being your own doctor but you only have a stethoscope and a notepad.

You're not a doctor. But it's what we've got until someone builds the real thing. The other approach is to use Claude Code or Cursor but in what I'd call "review mode" — you write the code yourself, and then you ask the tool to review it, explain what you could have done differently, and quiz you on the underlying concepts. You're using the code gen tool as a tutor by constraining how you interact with it.

You're forcing it into the pedagogical role through sheer willpower and prompt engineering.

That's kind of the story of AI in 2026, isn't it? We have these incredibly capable general-purpose tools, and we're bending them into specialized roles through prompts and external scaffolding. It works, but it's fragile. One prompt drift and suddenly your Socratic tutor is writing your functions for you.

Let's pull back to the bigger question. Why has the market produced a dozen code generation tools and basically zero dedicated coding tutors?

Code generation has an obvious ROI — you can measure it in features shipped, time saved, developer velocity. A CTO can look at Cursor and say "this will make my team 30 percent faster." An AI coding tutor's ROI is measured in learning outcomes, which take months to materialize and are harder to attribute. The buyer for code gen is the engineering organization. The buyer for a coding tutor is the individual learner, or maybe a bootcamp or university. The budgets are different, the sales cycles are different, the success metrics are different.

The venture capital follows the enterprise money.

Code gen tools can charge per-seat enterprise pricing. Tutoring tools are competing with free YouTube tutorials and 15-dollar Udemy courses. The unit economics are brutal unless you can demonstrate dramatically better outcomes — which you can't do without long-term studies, which you can't run without a product, which you can't build without funding.

The chicken-and-egg problem with a pedagogical twist.

There's also a cultural factor. The tech industry valorizes shipping. "Move fast and break things." Learning is inherently slow. You cannot "move fast" through understanding recursion. Your brain needs time to myelinate those neural pathways. An AI tutor that respects the pace of learning is making a promise that's fundamentally at odds with the industry's tempo.

The first successful AI coding tutor might not come from Silicon Valley. It might come from an education company, or from a country with a different relationship to learning speed.

Khan Academy is the obvious candidate. They have the pedagogical DNA, the existing user base, and they've been working on Khanmigo for three years. Duolingo could extend into coding — they already have the gamification and spaced repetition infrastructure. But their coding product has been fairly basic so far. The dark horse would be a company like Replit that decides to build a dedicated learning track separate from their pro tools, with its own architecture and success metrics.

Or a startup we haven't heard of yet. The Y Combinator batch company you mentioned — "a personal Python tutor that remembers everything you've struggled with" — that's basically the spec.

It's the spec, but saying it and building it are different things. The hard part isn't the chat interface. The hard part is the knowledge graph, the curriculum scheduler, the pedagogical decision engine. Those are backend systems that take years to get right. And you need a lot of student data to train and refine them.

Which brings us to the credentialing question. If an AI tutor with persistent memory actually works — if it can track what you've learned and retained over six months — does that change how we evaluate programmers?

I think it fundamentally could. Right now, we credential programmers based on what they've built — their portfolio, their GitHub, their shipped products. Or we credential them based on artificial assessments — coding interviews, algorithm challenges, certifications. An AI tutor that's been with you for six months has a much richer signal. It knows not just what you can build, but what you actually understand. It's seen you struggle with closures and eventually master them. It's seen which concepts you internalized quickly and which ones took three attempts.

The transcript becomes the credential.

And that's both exciting and unsettling. Exciting because it could surface talent that doesn't look good on traditional metrics — someone who struggled early but showed consistent growth, for instance. Unsettling because it puts an enormous amount of trust in the AI's assessment. What if the tutor's knowledge graph is wrong? What if it misjudges mastery?

What if the student learns to game the tutor the way students game standardized tests?

That's a real concern. Any assessment system creates incentives, and those incentives shape behavior. If the tutor's confidence scores become high-stakes, students will optimize for the scores rather than for understanding. We'd need to design the system so that gaming it is harder than actually learning the material — which is a hard design problem.

Where does this leave us? We've got capable models, emerging memory infrastructure, well-understood pedagogical principles, and a market gap wide enough to drive a truck through. The prompt itself reads like a product requirements document.

It really does. The user basically laid out the problem statement, the competitive landscape, the technical challenges, and the opportunity. If I were an engineer looking for something to build, I'd take this prompt seriously.

The biggest gap isn't model capability — it's curriculum design. No current AI tutor has a built-in pedagogical theory of when to challenge versus when to support. They're all winging it.

The second gap is the persistent student model. The prompt is ephemeral. The knowledge graph is the moat. If you're building this, don't start with the chat interface. Start with the data model. What does a student know? How do you represent a misconception? How do you track the forgetting curve? Get that right, and the tutor behavior emerges from it.

Start with what the system remembers, not what it says.

For learners today, the best approach is that hybrid I mentioned. Use a frontier model with a strict "don't write code for me" system prompt. Track your own progress manually. Treat the AI as a Socratic partner, not a code generator. It's not elegant, but it works, and it builds the metacognitive habits that will serve you long after the AI tutor product finally ships.

For builders listening — this space is wide open. The user's prompt is essentially a challenge. The tools exist. The models are capable. What's missing isn't technology. It's the will to build for learning, not just for shipping.

The question I keep coming back to is who gets there first. Does it come from an education company that understands pedagogy but struggles with the AI infrastructure? Or from a dev tools company that has the infrastructure but is optimized for a completely different success metric?

My money's on neither. I think it comes from someone who's been a teacher and a programmer, who's felt the frustration of both sides, and who builds it because they can't not build it.

That's the romantic answer.

It's the sloth answer. We're patient.

I'll take it. The open question for me is what happens to programming education if this works. If an AI tutor can take someone from zero to proficient in Python over six months, with persistent memory and adaptive curriculum, what does that do to bootcamps? To university CS programs? To the entire edifice of how we teach coding?

It doesn't replace them. But it changes what they're for. If the AI handles the mechanics of learning syntax and debugging and basic patterns, the human institutions can focus on what they're uniquely good at — design judgment, collaboration, ethics, the things you learn by building something with other humans in the room.

The AI teaches you to code. The humans teach you what's worth coding.

Now: Hilbert's daily fun fact.

Hilbert: In the 1880s, a group of British missionaries in Mongolia attempted to introduce the Ethiopian game of genna — a form of field hockey played at Christmas — to local herders. The entire effort collapsed when someone realized the missionaries had brought the wrong kind of sticks, and by the time replacements arrived from Shanghai, the herders had invented their own game using yak bones and a felt ball. The missionaries reportedly never played genna again.

The wrong sticks.

Yak bones and a felt ball. That's a whole sport that nearly existed and then didn't.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this episode, leave us a review — it helps other curious people find the show. Find more at myweirdprompts.

Build the tutor. Someone's got to.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2936: Why AI Still Can't Really Teach You to Code

Downloads

You Might Also Like

#2936: Why AI Still Can't Really Teach You to Code