#846: Beyond the Vector: Building Long-Standing AI Memory

Stop relying on basic vector search. Discover how Graph RAG and RAPTOR are creating AI systems with true long-standing memory.

0:000:00

Episode Details

Published: Feb 25
Duration: 30:49
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: rag architecture knowledge-graphs

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The current landscape of artificial intelligence is undergoing a significant architectural shift. While much of the focus over the last few years has been on the "generation" side of Large Language Models (LLMs), the industry is now turning its attention to "retrieval"—the process by which a system finds and utilizes stored information. Traditional Retrieval Augmented Generation (RAG) is increasingly being viewed as a bottleneck, leading to the rise of more sophisticated, holistic memory systems.

The Limits of Traditional Vector Search

The standard approach to AI memory involves turning text into mathematical vectors and searching for the closest matches in a database. However, this method often struggles with "semantic density." When a complex idea is flattened into a single point in a high-dimensional space, precision is lost. This often results in "noisy" retrievals where the system finds content that sounds similar in tone but lacks the specific factual relevance required for complex tasks. Furthermore, traditional chunking—breaking text into small, isolated pieces—causes the system to lose the broader narrative context, effectively viewing the world through a keyhole.

Proactive Retrieval and Query Transformation

To move beyond reactive search, developers are implementing hybrid models and query transformations. Hybrid search combines the conceptual understanding of vectors with the keyword precision of traditional ranking functions like BM25.

One of the most effective techniques for improving hit rates is Hypothetical Document Embeddings (HyDE). Rather than searching a database with a raw user query, the system first generates a hypothetical answer. It then uses that "fake" answer to search for real documents that match that specific informational profile. This "answer-to-answer" matching significantly improves the relevance of the retrieved data.

The Rise of Graph RAG and Multi-Hop Reasoning

Perhaps the most exciting development in the field is Graph RAG. By extracting entities and their relationships from data, systems can build a Knowledge Graph rather than a flat list of text chunks. This allows the AI to perform "multi-hop reasoning," following a trail of associations to find information that may be several steps away from the original query.

Using techniques like community detection, these systems can group related information into clusters and generate high-level summaries. This allows the AI to answer broad, philosophical questions about a dataset without having to scan every individual entry, providing a synthesized view that mimics human cognitive recall.

Hierarchical Memory and RAPTOR

Managing "long-standing memory" also requires a change in how data is stored. Hierarchical RAG links small, precise text chunks to larger "parent" documents, ensuring that when a specific detail is found, the surrounding context is also available.

A more advanced version of this is RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). This method recursively clusters and summarizes text at multiple levels of abstraction. The result is a tree-like structure where the AI can choose the appropriate resolution of memory—from high-level thematic summaries to granular, leaf-level data—depending on the complexity of the user's request.

Refining the Results

The final piece of the modern retrieval puzzle is reranking. Because initial searches prioritize speed over absolute accuracy, they often return a mix of relevant and irrelevant results. Reranking uses powerful cross-encoder models to perform a "second interview" on the top results, ensuring the most contextually accurate information is fed to the LLM. By combining these structural and procedural upgrades, AI systems are moving away from simple keyword triggers and toward a future of true contextual intelligence.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #846: Beyond the Vector: Building Long-Standing AI Memory

Daniel's Prompt

We've discussed the importance of context in AI engineering and how it's often more transformative than prompt engineering. I've been experimenting with proactive context development and personalized AI systems with long-standing memory. Currently, our RAG mechanism for this podcast feels reactive, relying on an AI agent to decide when to ingest context. I'm interested in exploring more intelligent ways of doing RAG, moving from traditional query-based pulling to a more holistic ingestion across large amounts of context data. Beyond vector databases and embeddings, what are the different approaches and methods for the retrieval aspect of RAG?

Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am joined as always by my brother, the man who has spent the last forty-eight hours straight reading about vector quantization, graph theory, and the topological nuances of high-dimensional latent spaces.

Herman Poppleberry here. And you are not wrong, Corn. I have been deep in the stacks this week. There is so much movement in the architecture of how these systems actually remember things that I can barely keep up. We are seeing a massive shift in the industry right now, moving away from the "brute force" approach of just throwing more tokens at a model and toward a much more elegant, structured way of managing information. It is a great time to be a nerd, Corn.

Well, today is a perfect day for that obsession because we have a great prompt from Daniel in Jerusalem. Daniel is a long-time listener and a serious tinkerer in the AI space. He is thinking about the future of retrieval augmented generation, or RAG as we all call it. He is basically looking at the way we currently handle context in this very show and realizing it feels a bit reactive. Currently, we have an agent that decides when to pull in data from our past episodes, but Daniel wants to move toward something more proactive and holistic. He is asking about the methods for the retrieval aspect of RAG that go beyond just vector databases and embeddings. He wants to know how we get to that "long-standing memory" feel without the system just waiting for a specific keyword to trigger a search.

This is such a timely question. Daniel is right on the money here. We have spent the last couple of years obsessed with the generation part of the equation, making the models smarter, faster, and more creative. But the retrieval part—the part where the system actually finds the right information to talk about—has remained relatively basic for a lot of implementations. We are still largely stuck in this paradigm of taking a user question, turning it into a mathematical vector, and finding the closest matches in a database. It works, but it is often noisy and, as Daniel pointed out, very reactive. It is like having a library where the only way to find a book is to shout a word and hope the right one falls off the shelf.

It is interesting that he mentions our own internal system. For those who do not know, we have a huge archive of over eight hundred episodes. When we talk about a topic, our system looks at what we have said before to provide that continuity. But it does rely on a specific trigger. Daniel is pushing us to think about a more intelligent ingestion process. Before we get into the heavy technical alternatives, Herman, why is the traditional vector search starting to feel like a bottleneck for power users like Daniel in two thousand twenty-six?

The main issue is semantic density versus precision. When you turn a sentence into an embedding, you are essentially flattening a complex idea into a single point in a high-dimensional space. That is great for finding general topics, but it is terrible for specific details or complex relationships. For example, if you ask a system about a specific date or a very niche technical term, a vector search might return something that sounds similar in tone but is factually irrelevant. It is also very limited by the chunk size. If your context is chopped up into small pieces to fit into a database, the system loses the broader narrative of the information. It is looking at the world through a keyhole. You lose the "why" and the "how" because you are only looking at the "what" of a single paragraph.

That makes sense. It is like trying to understand a whole book by only looking at individual sentences scattered on the floor. You might find the word "apple," but you do not know if it is about a fruit, a computer company, or a metaphor for original sin. So, if we want to move beyond that reactive, keyhole-style retrieval, where do we start? Daniel mentioned he is interested in more holistic ingestion.

One of the biggest shifts we are seeing right now is the move toward Hybrid Search as a baseline, not an extra feature. This combines the semantic power of vectors with the old-school precision of keyword search, specifically things like BM twenty-five. BM twenty-five is a ranking function used by search engines to estimate the relevance of documents to a given search query based on term frequency and document length. When you combine them using something called Reciprocal Rank Fusion, or R-R-F, you get the best of both worlds. You get the conceptual understanding of the vector and the exact match capability of the keyword. But even that is still a bit reactive. To get to the proactive level Daniel is talking about, we have to look at things like Query Transformation and Expansion.

Query transformation. I remember we touched on this briefly in episode eight hundred nine when we talked about context engineering. Is that where the system rewrites the user's question before it even starts looking for answers?

One of the most powerful methods here is called Hypothetical Document Embeddings, or HyDE. Instead of taking Daniel’s prompt and searching the database directly, the system first asks a large language model to write a fake, hypothetical answer to that prompt. Then, it takes that fake answer and uses it to search the database. The reason this works so well is that a dense vector of a question often looks very different from a dense vector of an answer in the latent space. By generating a hypothetical answer first, you are searching for documents that look like the information you want, rather than documents that look like the question you asked. It significantly improves the hit rate because you are matching "answer to answer" rather than "question to answer."

That is fascinating. It is almost like the system is visualizing what the right answer should look like and then going out to find the real-world evidence that matches that vision. It is a bit like a detective who forms a theory of the crime first so they know which clues to actually look for in the field. But Daniel mentioned moving toward a more holistic ingestion across large amounts of context data. Does that lead us into the world of Graph RAG?

Oh, absolutely. Graph RAG is where things get really exciting and where the "holistic" part really shines. Instead of just having a flat list of text chunks, you use a model to extract entities—people, places, concepts, technologies—and the relationships between them to build a Knowledge Graph. Imagine the context as a map of relationships rather than just a pile of documents. When Daniel asks a question, the system does not just find a piece of text; it traverses the graph. It sees that episode seven hundred fifty-two is related to the concept of answer engines, which is related to the concept of "pigeon English," which is a term we used to describe how people talk to search engines. It allows for multi-hop reasoning. You can find information that is three or four steps away from the original query but is contextually vital.

I can see how that would feel much more proactive. It is not just waiting for a keyword match; it is understanding the structure of our entire conversation history. It feels more like how a human brain retrieves memories. We do not just search for a word; we follow a trail of associations. If I think about my childhood home, I might then think about the tree in the backyard, then the tire swing, then the time I fell off it. A vector search would just find "childhood home." A graph search finds the tire swing.

Precisely. And Microsoft Research has done some incredible work on this with their GraphRAG framework. They use a technique called community detection. They take the whole graph and group related nodes into "communities" at different levels of granularity. Then they generate summaries for each of these communities. So, when a user asks a broad, holistic question like "What is the general philosophy of My Weird Prompts regarding AI safety?", the system doesn't have to search every single episode. It looks at the high-level community summaries that represent the "AI Safety" cluster of our show. It provides a much more comprehensive and synthesized answer than a standard RAG system ever could.

That sounds like it solves the "lost in the weeds" problem. But what about the structure of the documents themselves? Daniel is talking about long-standing memory. Does that involve how we actually store the text?

It does. That leads into another method called Hierarchical RAG or parent-document retrieval. This is a way to solve the chunking problem I mentioned earlier. Usually, we chop text into small bits, say three hundred words each, so the vector search is precise. But three hundred words often isn't enough context for the LLM to understand the nuance. With Hierarchical RAG, you store small chunks for the actual vector search to ensure high precision, but those small chunks are linked to much larger "parent" documents or even the full transcript. When the system finds a relevant small chunk, it does not just grab that sentence; it pulls in the entire surrounding context of the parent document. This gives the model the "big picture" that Daniel is looking for. It allows the system to be holistic without being overwhelmed by noise during the initial search phase.

So, instead of just getting the one sentence where we mentioned a specific technology, it pulls in the entire ten-minute discussion about that technology. That would definitely make the AI feel more like it actually knows us and our history. It is like if I asked you about a specific person, and instead of just giving me their name, you reminded me of the whole dinner party where we met them.

It really would. And there is a newer approach called RAPTOR, which stands for Recursive Abstractive Processing for Tree-Organized Retrieval. This is specifically designed for the kind of long-standing memory Daniel is talking about. It recursively clusters and summarizes text at different levels of abstraction. So, the system has a top-level summary of the entire podcast, mid-level summaries of different themes like AI engineering or creative writing, and then the raw text of the episodes at the bottom. Depending on the question, the retriever can choose the right level of detail. If you ask a broad question about our philosophy, it looks at the top-level tree. If you ask about a specific piece of code Daniel wrote, it goes to the leaves of the tree. It is a very sophisticated way of managing what we call "global" versus "local" context.

That sounds like a massive upgrade for our internal systems. I can imagine Daniel’s experiments with personalized AI systems would benefit from that hierarchical structure. It solves the signal-to-noise ratio problem we discussed in episode eight hundred ten. If you have millions of tokens of memory, you cannot just dump it all in. You need that organized tree. It is about having different resolutions of memory available at once.

You really do. And there is one more piece of the retrieval puzzle that people often overlook, which is Reranking. This is the second pass. The initial retrieval might pull in twenty or fifty potential documents using a fast, "bi-encoder" model. Many of those will be irrelevant because the initial search is fast but a bit sloppy—it is just looking at distance in a vector space. Then, you use a much more powerful but slower model, called a cross-encoder, to look at those fifty documents and the original query together. The cross-encoder can look at the actual interaction between the words in the query and the words in the document. It is like a second interview for the top candidates. It ranks them much more accurately, ensuring that the top three or five pieces of context provided to the generation model are actually the best ones.

It is like having a quick-sorting assistant who brings you a stack of folders, and then you personally go through those folders to find the exact page you need. It adds a layer of intelligence to the process that a simple database query just cannot provide. It is the difference between a librarian who says "The books on history are in aisle four" and a librarian who says "I found these six books on the French Revolution, and these three are the most relevant to your specific question about the Bastille."

It turns the retrieval from a simple database lookup into a multi-stage pipeline of intelligence.

Dorothy: Herman? Herman, are you there? Sweetheart, I am sorry to bother you, but I was just at the store and I could not remember if you said you wanted the spicy pickles or the regular ones for the dinner on Friday. And did you ever find that blue Tupperware I lent you? I need it for the brisket.

Mum? Mum, I am actually recording the show right now. We are right in the middle of a segment.

Dorothy: Oh, is that what this is? I thought I heard voices. Hello Corn! I hope you are eating enough. Herman, just tell me about the pickles and I will let you go back to your computer things.

Hi Dorothy! I am doing great. Herman, you better tell her about the pickles or we will never hear the end of it.

Regular pickles, Mum. Regular is fine. And I think the Tupperware is in my car. I will bring it over tonight. I have to go now, we are live.

Dorothy: Okay, bubbeleh. Don't work too hard. I will see you later. Love you!

Love you too, Mum. Goodbye. Sorry about that, Corn. She always seems to have a sixth sense for when the microphones are hot.

No worries at all. It is always good to hear from Dorothy. And honestly, it is a perfect segue into what we were talking about. Her calling you with a specific, domestic reminder is a form of proactive context injection. She knew you had a dinner coming up, she knew there was a pickle requirement, and she initiated the retrieval of that information based on her internal "calendar" of your life.

Ha! I suppose you are right. That was a very human-centric, proactive R-A-G system right there. Although a bit more disruptive than I would like for our production pipeline. She basically performed a "trigger-less" retrieval based on a temporal event.

Well, let's get back to Daniel's prompt. He is talking about moving from reactive to proactive. We have talked about the retrieval methods like Knowledge Graphs, Hierarchical R-A-G, and Reranking. But how do we actually make the system proactive? How does the AI decide to go looking for things before we even ask?

That is the shift toward Agentic R-A-G. In a traditional system, the R-A-G process is a fixed step: user asks, system retrieves, system generates. In an agentic system, the model is given tools to search its own memory and is allowed to reason about what it needs. When Daniel sends a prompt, the agent might decide it needs to look at three different sources. It might say, "I need to check the recent episode transcripts, but I also need to look at Daniel's GitHub activity from last week to understand the context of this specific code question." It becomes a loop. The agent can perform an initial search, look at the results, evaluate them—using something like Self-R-A-G—and then decide to perform a second, more targeted search based on what it just learned.

So the "Retrieval" part of R-A-G becomes a series of actions taken by the agent, rather than just a single database call. That seems much more aligned with what Daniel is doing with his experiments in proactive context development. It is about giving the AI the autonomy to curate its own awareness. It is not just "searching," it is "researching."

Precisely. And we are seeing this move toward what some researchers call "LongRAG." This is an architecture where you do not chunk the text at all. Instead, you use these massive context windows we have in two thousand twenty-six—like the two-million-token models—to ingest entire documents at once. The "retrieval" then becomes finding the right document, and the model's internal attention mechanism handles the rest. This eliminates the "lost in the middle" problem where models would forget information buried in a long chunk. When you combine this with the proactive nature of agents, you get a system that can truly handle the holistic ingestion Daniel is talking about. The agent selects the "books" (the long documents), and the model's huge context window allows it to read them all at once.

I want to dig into that signal-to-noise ratio again. If we are moving toward holistic ingestion and massive context windows, how do we prevent the system from getting distracted by irrelevant details? If I mention a sandwich I ate in episode four hundred, and we are talking about AI automation in episode eight hundred thirty-three, how does the system know to ignore the sandwich?

That is where Contextual Compression and "Long-Context Filtering" come in. Before the retrieved information is sent to the final generation stage, a smaller, specialized model goes through the retrieved text and strips out everything that is not relevant to the current task. It is like a high-speed editor. It takes five thousand words of retrieved context and compresses it down to the five hundred words that actually matter. There is also a technique called "Late Interaction" models, like ColBERT, which keep more of the original information available during the search process, allowing for much finer-grained filtering. This saves on tokens, reduces latency, and significantly improves the accuracy of the final answer. It is one of the most effective ways to handle the "sea of context" Daniel described.

So, it is not just about finding the information; it is about refining it. It is a multi-stage process of discovery, selection, and then distillation. It is like gold mining. You have to move a lot of dirt to find the nuggets, but you don't want to deliver the dirt to the jeweler.

And for someone like Daniel, who is technically literate and deeply engaged with these systems, the next step is often what we call "Active Context Management." This is where the user, or a sub-agent acting on their behalf, is constantly updating a "working memory" file. Instead of searching the whole database every time, the system maintains a high-level summary of the current state of the world, the project, or the conversation. As new information comes in, that summary is updated. This is very much what we discussed in episode seven hundred ninety-five regarding sub-agent delegation. You have one agent whose entire job is just to keep the context file fresh and relevant. It is like a "scratchpad" that the AI carries with it.

That feels like the ultimate version of what Daniel is looking for. It is a personalized, evolving memory that does not just sit in a database waiting to be queried, but is actively maintained and ready to be used at a moment's notice. It is less like a library and more like a living assistant who is always keeping their notes up to date. They are anticipating what you might need based on the current trajectory of the project.

I love that. And honestly, for our own podcast, that is the direction we are heading. We have so much data now—transcripts, show notes, listener feedback, Daniel's prompts—that a simple vector search just doesn't cut it anymore. We need that agentic layer that understands the narrative arc of the show over the last three years. We need a system that knows when we are contradicting something we said in episode two hundred and can proactively flag it.

It is interesting to think about the historical context here too. We have moved so far from the early days of just trying to get a model to remember the previous sentence. Now we are talking about managing thousands of hours of audio and millions of lines of text as a single, cohesive memory. It is a shift from "memory as a storage problem" to "memory as a reasoning problem."

It really is a massive shift. And for the listeners who are building these systems, the takeaway is clear: do not just rely on your vector database. If you want your AI to feel intelligent and proactive, you have to invest in the retrieval pipeline. You need to look at reranking, knowledge graphs, and contextual compression. The "R" in R-A-G is where the real engineering is happening right now. We have largely "solved" generation for most common use cases; the frontier is now how we feed that generator the right fuel.

So, if Daniel is looking for practical next steps for his experiments, what would you suggest he prioritizes? He mentioned he is experimenting with long-standing memory and personalized systems.

I would say the first priority should be implementing a Reranker. It is the lowest-hanging fruit with the highest impact on quality. You can use models like B-G-E-Reranker or Cohere's Rerank three. After that, I would look into Graph R-A-G, especially for a podcast or a personal history where the relationships between topics are just as important as the topics themselves. If he can map out how his different projects and ideas connect over time using something like Neo-four-j or even a simple network-x graph, the AI will be able to make much more insightful connections. And finally, I would look at that "Active Context Management" idea—having a dedicated process that summarizes and updates his "current state" so the model always has a high-level overview before it even starts a search.

That is a solid roadmap. It moves the system from being a passive observer of data to an active participant in its own knowledge management. It is about building a system that doesn't just store information, but understands it. It is about moving from a "reactive pull" to a "proactive push" of information.

And that is the difference between a tool and a collaborator. When the system can proactively bring up a relevant point from a year ago because it understands how it connects to what you are doing today, that is when the magic happens. It is what makes the AI feel like it is truly part of the team, not just a fancy search engine. It is about creating a shared history.

I think Daniel is already well on his way there. His prompts always push us to think about these deeper structural questions. It is not just about "how do I use this tool," but "how do we rethink the architecture of intelligence." He is looking at the plumbing of the mind, in a way.

And that is why we love getting these prompts from him. They keep us on our toes and force us to dive into the latest research. I mean, I have three papers on my desk right now about "ColBERT," which is a multi-vector retrieval model that provides even more granular matching than standard embeddings by storing a vector for every single token in a document. There is always a deeper level to explore. We haven't even talked about "Matryoshka Embeddings" yet, which allow you to truncate vectors to different sizes depending on your latency needs.

Well, before you disappear back into those papers and start explaining the math of Matryoshka dolls, Herman, we should probably start wrapping this up. We have covered a lot of ground today—from the limitations of basic vector search to the power of Knowledge Graphs, Hierarchical R-A-G, Reranking, and the move toward Agentic, proactive systems. We have looked at how Daniel can move from a reactive "pull" to a holistic, intelligent "ingestion."

It has been a great deep dive. And I think it is a perfect follow-up to our previous discussions on context engineering. The more we do this, the more I am convinced that context is the most important ingredient in the entire AI stack. It is the difference between a generic response and a truly personalized, useful insight.

I agree. It is the fuel that makes the engine actually go somewhere useful. Before we sign off, I want to remind everyone that if you are enjoying these deep dives into the weird and wonderful world of AI prompts, please leave us a review on your favorite podcast app. Whether it is Spotify, Apple Podcasts, or wherever you listen, those reviews really help other curious minds find the show. It helps the algorithms "retrieve" us for other listeners!

They really do. We appreciate all the support from our listeners. It is what keeps us going and allows us to spend forty-eight hours reading about graph theory without feeling too guilty.

You can find all our past episodes, including the ones we referenced today like episode eight hundred ten on agentic interviews and episode seven hundred ninety-five on sub-agent delegation, at myweirdprompts.com. We have a full archive there with a search feature—which, by the way, is powered by a hybrid search system—so you can dig into any topic we have covered over the last eight hundred plus episodes.

And if you want to get in touch with us, like Daniel did, you can use the contact form on the website or email us directly at show at myweirdprompts dot com. We love hearing your ideas, questions, and weird prompts. Especially the ones that make me have to buy new textbooks.

Our show music was generated with Suno, which is another great example of how these generative systems are becoming part of our creative workflow. It is all about that human-in-the-loop collaboration.

It really is. Alright, I think that is a wrap for today. I am going to go find that Tupperware for my Mum before she calls back and interrupts us again. I think it might be in the trunk under my spare tire.

Good luck with the pickles and the Tupperware, Herman. Thanks for listening to My Weird Prompts, everyone. We will see you in the next one.

Goodbye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.