So Daniel sent us this one, and it's a proper technical one. He's asking about stateful AI memory frameworks — specifically what's actually happening under the hood with systems like mem0, Letta, Zep, and LangMem. Not the marketing pitch, not the GitHub star counts, but the real architectural differences. How do these storage mechanisms differ from standard RAG? What backend technologies are actually powering them? What are the major architectural divisions? And which pairings actually work well together in production? So: beneath the abstractions. That's where we're going today.
I've been wanting to dig into this for a while, because the marketing language in this space is genuinely obscuring some fascinating engineering decisions. Every framework claims "persistent memory" and "long-term context" and those phrases mean almost nothing without understanding what's actually happening at the storage layer.
And before we even get to comparing frameworks, I think we need to establish why the obvious answer — just use RAG — falls short. Because I suspect a lot of listeners' first instinct is "isn't this just retrieval augmented generation?"
It's the right place to start, and the answer is: no, for reasons that are more fundamental than people realize. Standard RAG is a document retrieval system. The design assumption is that you have a static corpus — you chunk it, embed it, store it in a vector database, and at query time you pull relevant chunks into context. That's a solved problem and it works well for that use case. But agent memory is a completely different problem. The data isn't static. Contradictions accumulate. Facts get superseded. And critically, RAG has no write semantics beyond batch indexing — there's no concept of "this fact replaced that fact."
So the contradiction problem is the one that really illustrates the gap. User says "I live in New York" in session one, says "I moved to Austin" in session three. A naive RAG system has both facts embedded, and depending on which one scores higher on retrieval, you get the wrong answer.
And it gets worse than that. OpenAI's own documentation for their reasoning models explicitly discourages heavy RAG injection — they note that models like o3 benefit from shorter, cleaner prompts. So you have this dynamic where the obvious solution, just stuff more retrieved context in, actively degrades performance on the models that are most capable of reasoning. The frameworks we're talking about today are all, in different ways, trying to solve the problem of building a compact, accurate, temporally-aware representation of what the agent actually knows.
By the way, today's episode is brought to life by Claude Sonnet 4.6, which is generating our script. We're just the charming delivery mechanism.
Charming is generous for a donkey, but I'll take it.
Okay, so let's actually get into the architectural divisions, because I think that's the most clarifying lens. How do you carve this space up?
There are really three distinct architectural bets being made. The first is what I'd call LLM-extracted fact stores — mem0 is the clearest example. The second is temporal knowledge graphs, which is Zep and their Graphiti engine. The third is context-window-first with external overflow, which is Letta's approach. And then there's a fourth category that's more of a cross-cutting concern: storage-agnostic memory management libraries, which is what LangMem is doing. Each of these reflects a genuinely different answer to the question of where memory lives and how it gets formed.
Let's go through them. Start with mem0, because it has the most GitHub stars — fifty-two thousand nine hundred at last count — so it's probably what most listeners have at least heard of.
mem0's core pipeline is elegant in its simplicity. Every conversation goes through an LLM extraction pass. The model reads the raw conversation and identifies salient facts — discrete, human-readable claims. Those facts then get compared against existing memories in a deduplication and consolidation step. The result gets embedded and stored in a vector database. By default that's a local Qdrant instance, but the framework supports a wide range of backends: pgvector, Chroma, Pinecone, Weaviate, Redis, MongoDB Atlas, Azure AI Search. It's genuinely pluggable.
And the extraction step is the key thing that distinguishes it from RAG. You're not storing chunks of conversation — you're storing distilled facts.
That's the strength. The facts are clean, human-readable, and de-duplicated. But it introduces what I think of as the extraction tax. Every single write involves an LLM call. At scale, that's a non-trivial cost and latency hit. And the quality of your memory is bounded by the quality of the extraction. If the extraction model misses a nuance or gets a fact slightly wrong, that error is now persisted. The mem0 team recommends keeping the extraction temperature at or below zero point two for deterministic behavior, which is the right call, but it doesn't eliminate the risk.
mem0 also has an optional graph layer that you can enable on top of the vector store. How does that actually work mechanically?
When graph memory is enabled, mem0 runs a parallel pipeline alongside the standard vector path. The LLM extracts not just facts but entities and relationships — so instead of just "Alice likes hiking," you get a node for Alice, a node for hiking, and an edge representing that relationship. Those nodes and edges go into a graph backend — Neo4j, Memgraph, Kuzu, Amazon Neptune, or interestingly, Apache AGE which is a PostgreSQL extension that adds Cypher query support. At retrieval time, you get vector similarity results in one array and graph-traversal results in a separate relations array. They're additive — the graph results don't reorder the vector hits, they're supplementary context.
The Apache AGE option is interesting to me because it means you can run graph memory queries on the same PostgreSQL instance as your application database. No separate graph database to operate.
The catch is that AGE has no built-in vector index. Similarity search is computed client-side. For moderate graph sizes that's fine, but it doesn't scale the same way a dedicated graph database does. It's a great option for teams that are already deep in PostgreSQL and want to experiment with graph memory without adding another piece of infrastructure.
Okay, so mem0's benchmark numbers — the LOCOMO benchmark shows a twenty-six percent relative improvement over OpenAI Memory, ninety-one percent lower p95 latency versus full-context approaches, ninety percent fewer tokens. Those are impressive numbers, but every team benchmarks on metrics that favor their approach.
The honest caveat is that LOCOMO is a multi-session dialogue benchmark that plays to mem0's strengths. The more interesting result from the broader benchmark landscape is actually from Letta's own research — they found that their Filesystem abstraction, which is literally just storing conversational histories in a file, scored seventy-four percent on the LoCoMo benchmark. That beat several specialized memory tool libraries. Which is a genuine challenge to the complexity of this whole space. For a lot of use cases, the marginal benefit of sophisticated memory systems over well-organized file storage may be smaller than vendors claim.
That's the kind of benchmark result vendors do not put on their landing pages.
No, they do not. But it's important context for any engineer deciding whether to adopt one of these systems.
Alright, let's talk about Letta, because the architecture there is quite different. This is the evolution of MemGPT, which came out of a twenty twenty-three paper on OS-inspired virtual context management.
The core insight of MemGPT was: what if instead of treating the context window as a fixed container that fills up and then you're stuck, you treat it like an operating system treats memory — with a hierarchy of storage tiers and explicit management of what lives where. Letta productizes that idea into a three-tier memory hierarchy.
Walk me through the tiers.
Tier one is memory blocks, which Letta calls core memory. These are structured sections of the agent's context window that are always present — they're prepended to every prompt as XML-like blocks. So the agent always knows the user's name, always has their preferences, always has the current project state. No retrieval needed. Zero latency for that information. The agent can self-edit these blocks using built-in tools — memory_rethink, memory_replace, memory_insert — so the agent is actively curating what it considers important enough to keep in its always-on memory.
And the recommended limits are under fifty thousand characters per block, under twenty blocks per agent, which keeps things from getting unwieldy.
Tier two is archival memory — a general-purpose vector database for long-term, semantically searchable storage. The agent uses explicit tools to search and insert. This is for episodic history, things that are important to preserve but don't need to be in context every single time. Tier three is recall memory, which is searchable conversation history — distinct from archival in that it's automatic, not curated by the agent.
The thing I find most interesting about Letta's architecture is the shared memory blocks for multi-agent coordination. Because multiple agents can be attached to the same memory block, and when one updates it, all of them see it immediately. That's a coordination primitive that doesn't require explicit message passing.
It's genuinely underappreciated. You can have a supervisor agent watching a subagent's result block update in real-time, without any pub-sub infrastructure, without any message queue. The shared state is the coordination mechanism. That's a different mental model from how most people think about multi-agent systems.
And then there's the sleep-time compute work, which is the most recent architectural innovation from the Letta team. April twenty twenty-five paper. What's actually happening there?
The sleep-time paper addresses a fundamental problem with the original MemGPT design: memory management, conversation handling, and tool use were all bundled in one agent. That creates latency and reliability issues because the agent is trying to do too many things at once. The sleep-time architecture splits this into two agents. The primary agent handles real-time conversation, using a fast model — something like gpt-4o-mini. It can search recall and archival memory but it cannot edit its own memory blocks. The sleep-time agent runs asynchronously between conversations, using a stronger, slower model. It has write access to the primary agent's memory blocks and continuously consolidates and improves the learned context.
So you're decoupling the latency-sensitive work from the quality-sensitive work. The user never waits for memory consolidation because it happens while nobody's looking.
The benchmark numbers are striking. The paper shows roughly a five-times reduction in test-time compute needed to achieve the same accuracy on stateful reasoning benchmarks. Thirteen percent accuracy improvement on Stateful GSM-Symbolic, eighteen percent on Stateful AIME. Two and a half times cost reduction per query when you amortize the sleep-time compute across related queries. And the most recent addition — Context Repositories from February of this year — stores memory blocks in a git repository. Version control, branching, rollback of agent memory state. Memory as code, with all the tooling that implies.
Memory as code is a phrase that would have sounded absurd two years ago and now sounds completely sensible.
The git model maps surprisingly well. You can diff memory states, you can branch for different agent personas, you can roll back if the agent's memory gets into a bad state. It's a fundamentally different mental model from "data in a database."
Let's move to Zep and Graphiti, because this is the most architecturally distinct approach. Twenty-four thousand nine hundred stars on Graphiti, so it's gotten significant traction.
Zep's fundamental bet is that the right data structure for agent memory is not a vector index but a temporal knowledge graph. Graphiti is their open-source engine. The key concept is what they call a context graph — a temporal graph of entities, relationships, and facts where every fact has a validity window. When a fact becomes true, it gets a valid_from timestamp. When it's superseded, it gets a valid_until timestamp. The old fact is not deleted — it's invalidated. So you can query "what was true about this user six months ago" and get a historically accurate answer.
That's the bi-temporal model. And this is genuinely more sophisticated than anything the other frameworks offer natively for handling contradictions.
The contradiction handling is automatic. When Graphiti detects that new information conflicts with an existing fact, it closes the validity window on the old fact and creates a new fact. The graph structure has four components: entities as nodes, facts and relationships as edges with those temporal validity windows, episodes which are the raw provenance data so every derived fact traces back to source conversations, and custom types that developers can define via Pydantic models.
The retrieval strategy is also different. It's not just vector similarity — it's a three-way hybrid.
Semantic embeddings for similarity, BM25 for keyword matching, and graph traversal for relationship-based navigation. That third path is what enables queries that pure vector search can't handle. "What does Alice's manager think about the project?" requires traversing Alice to her manager via a reports-to edge, then finding that manager's opinion of the project. There's no semantic shortcut for that — you need the relational structure.
The comparison to Microsoft's GraphRAG is worth making here, because GraphRAG has gotten a lot of attention and they're solving adjacent problems.
GraphRAG is fundamentally a batch-oriented system designed for static document summarization. When new data arrives, you need to recompute the graph — or at least significant portions of it. Graphiti integrates new episodes immediately without recomputing the entire graph. For agents operating on live, evolving data — customer conversations, business data that changes daily — batch recomputation is a non-starter. The latency profile is also different. Graphiti claims sub-two-hundred millisecond retrieval at production scale. GraphRAG queries can take seconds to tens of seconds because of the sequential LLM summarization in the retrieval path.
Zep versus Graphiti distinction is also worth clarifying — Graphiti is the open-source engine you self-host, Zep is the managed service on top of it with user and thread management, a dashboard, graph visualization, and a sub-two-hundred millisecond SLA.
The DMR benchmark — Deep Memory Retrieval — shows Zep at ninety-four point eight percent versus MemGPT at ninety-three point four. LongMemEval shows up to eighteen point five percent accuracy improvement and ninety percent latency reduction versus baseline. Though I'd note that DMR was created by the MemGPT team, so Zep beating it is meaningful but the benchmark was designed to measure exactly what MemGPT is good at.
Now LangMem is doing something structurally different from all of these. It's not really a storage system — it's a memory management library.
LangMem occupies a different layer entirely. It provides the extraction, consolidation, and retrieval logic. Storage is pluggable — you bring your own. The production backend is AsyncPostgresStore, development uses InMemoryStore, and it integrates with LangGraph's BaseStore interface. The memory types are organized around cognitive science concepts: semantic memory for facts and knowledge, episodic memory for past experiences, and procedural memory for system behavior and instructions.
The procedural memory is the part that has no equivalent in any other framework we've talked about.
It's genuinely unique. LangMem has a create_prompt_optimizer that takes conversation trajectories plus feedback scores and rewrites system prompts. So if an agent consistently gives theoretical explanations when users want practical examples, the optimizer updates the system prompt to prioritize code examples. It uses metaprompt-based optimization with configurable reflection steps. This is memory formation at the behavioral level, not just the factual level. It's closer in spirit to fine-tuning, but done at inference time through prompt rewriting.
The semantic memory distinction between profiles and collections is also doing real engineering work. Because those are genuinely different use cases.
Profiles are a single document representing current state — user preferences, goals, communication style. New information updates the document rather than creating new records. Schema is enforced via Pydantic models. It's easy to present to users for manual editing, it has a predictable structure. Collections are unbounded sets of individual memory records where the LLM decides on each write whether to insert, update, or delete. Higher recall but requires more careful reconciliation to avoid contradictions. Most real applications need both — a profile for "what is currently true about this user" and a collection for "what has happened in our interactions."
And LangMem's namespacing system is worth mentioning — hierarchical namespaces with runtime template variable substitution. So you can scope memories to a user within an organization within an application, and that scoping is resolved at runtime.
Which makes it composable in a way that the more opinionated frameworks aren't. You're not locked into Zep's graph model or Letta's block architecture — you're using LangMem's extraction and consolidation logic with whatever storage your application already has.
Let's talk about Motia briefly, because it keeps appearing in the same lists as these frameworks and it is not the same kind of thing at all.
Motia is a backend framework for APIs, workflows, and AI agents. It's not a memory system. Its state primitive is a Redis-backed key-value store for sharing data between workflow steps. The core primitive is what they call Steps — a trigger, handler, emit pattern. It was the number one back-end and full-stack project in JavaScript Rising Stars for twenty twenty-five. But if you're looking for semantic search, vector storage, graph memory, or LLM-extracted fact consolidation — Motia doesn't provide any of that. It's the orchestration layer where you'd wire together the actual memory frameworks. Putting it in the same list as mem0 and Zep is a category error.
Okay, let's get into the pairings that actually work well in practice, because I think this is where the architectural choices become concrete decisions.
The most natural pairing for long-running personal assistants or coding agents is Letta memory blocks with archival memory and a sleep-time agent. Blocks for high-priority always-needed facts — name, preferences, current project state. Archival for episodic history. Sleep-time agent using a stronger model to continuously consolidate archival insights back into blocks. The primary agent stays fast, the memory gets smarter over time without affecting user-facing latency.
And Context Repositories on top of that if you want version control of the memory state.
For enterprise applications with complex entity relationships — CRM, customer support, multi-stakeholder workflows — the Zep temporal graph plus LLM context assembly is the right architecture. Zep's context assembly layer pre-builds the context block before the LLM call, so the model sees a clean, structured representation of the relevant entities and relationships rather than raw retrieved chunks. The bi-temporal fact invalidation means you're never giving the model stale data about who owns what account or what the current policy is.
The mem0 hybrid — vector plus graph — is the right choice for social or relationship-heavy applications where you need both semantic similarity and relational traversal.
And the results are additive, not competing. Vector hits tell you what's semantically relevant, graph traversal tells you what's relationally connected. For something like a social application where "who did Alice meet at this conference" is as important as "what are Alice's interests," you need both paths. The mem0 architecture merges them cleanly — vector results in one array, graph context in the relations array.
LangMem with PostgreSQL is the right call for teams already deep in the LangChain ecosystem, or for applications where the agent's behavior itself needs to evolve over time.
The prompt optimizer is the differentiator there. If you have an agent that serves thousands of users and you want it to get better at its job based on feedback — not just better at remembering facts about individual users, but better at the task itself — LangMem's procedural memory is the only framework in this space that addresses that.
And the Letta sleep-time agent paired with a fast primary agent is the architecture for high-volume conversational systems where you genuinely cannot afford to compromise on either latency or memory quality.
The key insight from the sleep-time compute paper is that memory formation and memory retrieval have completely different requirements. Retrieval is latency-sensitive — users are waiting. Formation is quality-sensitive — you want the strongest model doing the consolidation. Decoupling those two workloads and running them on different models at different times is an elegant solution to what was otherwise a fundamental tension.
Let me ask the uncomfortable question. Given that Letta's own research found a simple filesystem scored seventy-four percent on LoCoMo — beating specialized memory libraries — when does it actually make sense to adopt one of these frameworks versus something much simpler?
The filesystem result is a genuine challenge to the complexity of this space, and I think the honest answer is: for a lot of use cases, simpler is better. If your agent has a bounded set of facts it needs to remember, and those facts don't change frequently, and you don't need temporal queries, a well-organized file or a simple key-value store might genuinely be sufficient. Where the complexity pays off is in three specific scenarios. First, when you need contradiction handling — when facts genuinely get superseded and you need the system to know that. Second, when you need relational traversal — when "what does Alice's manager think" is a real query you need to answer. Third, when you need temporal queries — when "what was true six months ago" is a requirement. If none of those apply, you're probably over-engineering.
And the extraction tax is real. Every LLM call on write adds latency and cost. At scale, that compounds.
The extraction tax is the hidden cost that the benchmark papers don't emphasize. mem0 and LangMem both pay it on every write. Zep and Graphiti pay it for entity extraction. Only Letta's memory blocks avoid it — the agent writes directly to its own context. But agent-written memories can be inconsistent without a dedicated consolidation pass, which is why the sleep-time agent exists. There's no free lunch here. Every architectural choice is trading something for something else.
Alright, practical takeaways. If you're an engineer evaluating these systems right now, what's the decision framework?
Start with what kind of queries you actually need to answer. If it's "what are this user's preferences" — that's a profile in LangMem or a memory block in Letta. If it's "what happened in our last conversation" — that's episodic memory, any of these frameworks handle it. If it's "who does this customer know and what changed in their account last quarter" — that's Zep. If it's "the agent needs to get better at its job over time, not just remember facts" — that's LangMem's procedural memory. Second question: what's your infrastructure budget? Adding Neo4j or FalkorDB is not free. Apache AGE on PostgreSQL is a reasonable middle ground for graph queries at moderate scale. Third question: do you need temporal queries? If yes, Zep is the only framework with native bi-temporal support. If no, you have more options.
And if you're in the LangChain ecosystem, LangMem is the path of least resistance. If you're building something where multi-agent coordination matters, Letta's shared memory blocks are worth the architectural investment.
The other thing I'd flag is that the sleep-time compute pattern is broadly applicable, not just in Letta. The idea of using idle compute to improve memory quality asynchronously is a design pattern you can implement in any framework. You don't have to use Letta's specific architecture to benefit from that insight.
The git-based memory in Letta Code is the one I keep thinking about. Because it reframes the whole problem. Memory isn't data, it's state. And state that can be versioned, branched, and rolled back is fundamentally more manageable than state that just accumulates in a database.
The tooling implications are significant. You can code review an agent's memory changes. You can audit why an agent started behaving differently — it's right there in the diff. You can maintain separate memory branches for different deployment environments. That's not just a nice feature, it's a different engineering discipline.
Alright, I think that's a genuinely thorough tour of what's actually happening beneath the abstractions in this space. The short version: vector-only fact stores for simplicity, temporal graphs for relational and historical accuracy, context-window-first for latency-critical applications, and storage-agnostic libraries when you need composability and behavioral learning.
And the honest meta-point is that the marketing language — persistent memory, long-term context — is doing a lot of work to obscure genuinely different architectural bets. Knowing which bet you're making matters a lot when you're choosing infrastructure.
Big thanks to Modal for the GPU credits that power this show. And thanks as always to our producer Hilbert Flumingtop for keeping things running. If you're enjoying the show, a quick review on your podcast app helps us reach new listeners. This has been My Weird Prompts. See you next time.
Take care, everyone.