Daniel sent us a question that I've actually been chewing on myself. He's been building retrieval systems, and he's noticed something. When you skip the traditional RAG database and just let an agent fetch documents live at query time, you're still doing vector search under the hood. So the question is — are we just creating a tiny disposable RAG store every single time we have a conversation with an AI? And if that's true, what are we actually trading off when we choose live retrieval over pre-indexed embeddings?
Oh, this is a great question. And by the way, today's script is coming from DeepSeek V four Pro, which feels appropriate for an episode about the guts of retrieval.
I'll allow it. So Daniel lays out two approaches. Option A — you take every Israeli architectural standard, every Taken, and you upsert them into a vector database. Painful week of copying and pasting, but you've got a static index. Option B — you give the agent a skill that says, here's the index page, go find the right regulation when you need it. Live fetching, no maintenance burden. He's gravitating toward B, and honestly, I see why.
He's right to gravitate there, but we should unpack what's actually happening because his intuition about the vectorization step is spot on. Every time an LLM processes text, that text gets converted into embeddings. That's just how transformers work. The input comes in as tokens, those tokens get mapped to vectors in a high-dimensional space, and attention does its thing across those vectors. So yes, in a very literal sense, the model's context window is a temporary vector store.
Temporary is doing a lot of work there. A RAG database is persistent, indexed, optimized for approximate nearest neighbor search. The context window is none of those things.
And this is where most people get confused about what retrieval actually means. When you do traditional RAG, you've got your documents pre-chunked and pre-embedded. Those embeddings are sitting in a vector database with an index — something like HNSW, hierarchical navigable small world graphs. When a query comes in, you embed the query, run an ANN search, get your top K chunks back, and stuff them into the prompt. The search is fast because the index structure does the heavy lifting of pruning the search space.
In Daniel's live-fetching approach, you're not doing any of that. The agent is navigating a website, reading pages, and the model's own attention mechanism is doing something that looks a lot like retrieval, but it's not vector search in the Pinecone sense. It's a transformer attending over tokens.
So let me draw the distinction clearly, because this is where the philosophy meets the engineering. When Daniel says every turn creates a little RAG store, what he's really noticing is that the model's key-value cache during inference acts as an associative memory over the tokens in context. Each token's representation gets enriched by attending to every other token. So if you've got a ten-thousand-token context window with a regulation document in it, the model can pull information from any part of that document because the attention mechanism has built those connections.
That's not the same as vector search. The attention mechanism has quadratic complexity in sequence length. It's doing exact attention, not approximate nearest neighbor. So it's more precise in one sense — every token can attend to every other token — but it's also way more expensive per token than a vector database lookup.
This is where the tradeoffs start to get interesting. Let me put some numbers on this. A typical vector database query — embedding plus ANN search over a million documents — might take ten to fifty milliseconds. The attention computation over a hundred-thousand-token context window in a large model, by contrast, can take seconds per forward pass. So you're paying a latency cost for the live approach that scales with context length.
That's only if you're actually filling the context window every turn. In Daniel's scenario, the agent fetches one regulation document at query time. That document is probably a few thousand tokens at most. The attention cost on a few thousand tokens is negligible compared to the cost of the full context.
But there's another dimension here — retrieval quality. When you pre-embed a corpus of regulations, you're doing chunking. You're deciding how to split documents into semantically coherent pieces. You're tuning chunk size, overlap, and embedding model selection. All of that engineering work goes into making sure that when you search for "minimum elevator width," you get back exactly the relevant paragraph, not the entire thirty-page elevator regulation document.
Daniel's approach just fetches the whole page.
The agent navigates to the URL, grabs whatever HTML is there, and dumps it into context. Maybe it does some cleaning, but it's not doing the careful chunking and retrieval that a well-tuned RAG pipeline does. So you might end up with a lot of irrelevant text competing for attention with the part you actually care about.
There's research on this, right? The lost-in-the-middle problem?
Yeah, and it's persistent. Models are better at attending to information at the beginning and end of their context. Stuff in the middle gets less attention weight. So if your regulation document is twenty pages long and the elevator width spec is on page eleven, you might get worse retrieval than if you'd done proper chunking and vector search.
Daniel's approach trades maintenance burden for retrieval precision. But I wonder if that tradeoff is actually worse in practice than it sounds on paper. Because the thing about architectural regulations is they're not novels. They're structured documents with headings, numbered sections, tables. If the agent is smart about how it navigates, it might be able to jump directly to the relevant section.
This is where agent design gets really important. A well-designed skill for Taken retrieval wouldn't just say "go to the index page and read." It would understand the structure of the regulations database. It would parse the index, identify which specific regulation document is relevant, navigate to that document, and then extract the relevant section. That's a multi-step process that requires the agent to do some reasoning about information architecture.
That reasoning step is where things get interesting, because you're essentially pushing the retrieval logic from the vector database into the agent's planning loop. Instead of saying "here's a query, find similar vectors," you're saying "here's a query, figure out where the answer lives, go get it, and read it.
Which is more flexible but also more brittle. Vector search degrades gracefully — if you don't find exactly the right chunk, you'll find something adjacent, and that might still be useful. With agent-based navigation, if the site structure changes or the agent takes a wrong turn, you get nothing.
Or worse, you get the wrong regulation and don't realize it. That's the silent failure mode that keeps me up at night.
You don't sleep, Corn. There's a difference.
But let's talk about the maintenance argument, because I think Daniel's really onto something there. He says with the static RAG approach, if a Taken changes, you're out of luck until you re-index. With the live approach, as long as the links still work, you're always getting the current version.
In a regulatory context, that's huge. Building codes change. Sometimes they change in ways that create liability if you're working from an outdated version. A static RAG database for architectural regulations is a liability management problem. You need to track which version of each regulation you've indexed, monitor for updates, and re-index promptly when things change.
Daniel mentions checking the structure every six months. That's probably fine for architectural standards, which don't change that frequently. But for other domains — think legal research, financial regulations, medical guidelines — six months is an eternity.
The FDA updates drug labeling guidance multiple times a year. The SEC issues new rules and amendments constantly. In those domains, the maintenance burden of a static RAG database becomes prohibitive. You'd need continuous indexing pipelines, change detection, version tracking. It's a whole infrastructure problem.
The live approach isn't just about laziness. It's about correctness guarantees. You're always getting whatever is live on the authoritative source, and if that source is wrong, that's the source's problem, not yours.
There's a subtlety here that I think Daniel is glossing over. He says "whenever they're updated, it doesn't matter, so long as the links are changed." But links do change. Government websites restructure their URLs all the time. The Israel Standards Institute might reorganize their site tomorrow, and suddenly all your index links are broken.
That's where the index page approach helps. If the skill points to a single index page that lists all current regulations, and the agent navigates from there, you're insulated from individual page URL changes. As long as the index page URL stays stable, the agent can find things.
Index pages get restructured too. And there's a deeper problem — what if a regulation is split into two regulations? The agent might find the old reference and not realize there's a new structure.
That's a harder problem, but it's a problem for the static RAG approach too. If you've indexed based on the old structure, your chunks might not correspond to the new regulatory boundaries. At least with the live approach, if the agent can navigate the new structure, it finds the current information. With static RAG, you're serving stale chunks until someone notices and re-indexes.
Let me push on something Daniel said that I think deserves more scrutiny. He says that when the agent fetches text, it has to vectorize it to hold it in reasoning. That's true, but I think it's worth being precise about what that means. The model doesn't maintain a separate vector store for each conversation. The key-value cache is ephemeral — it's computed fresh each forward pass and discarded after the response is generated.
It's not really a RAG store in any persistent sense. It's more like working memory.
And working memory has different properties than long-term memory. Working memory is high-fidelity but transient. The model can attend to exact token sequences in its context with perfect recall — assuming the attention mechanism is doing its job. A vector database, by contrast, gives you fuzzy recall. You're searching for semantic similarity, not exact matches.
Fuzzy recall has advantages. With vector search, you can find information that's conceptually related even if the wording is completely different. If someone asks about "elevator dimensions" but the regulation uses the phrase "vertical transportation conveyance specifications," a keyword search would miss it entirely. Vector search catches it because the embeddings are semantically similar.
Keyword search would miss it, but attention over the full document wouldn't. If the agent fetches the whole elevator regulation document and the model attends over all of it, it'll find "vertical transportation conveyance specifications" because it's reading the whole thing. The question is whether it correctly identifies that as relevant to "elevator dimensions.
That's where model capability comes in. A good model with a large context window can do that kind of cross-referencing surprisingly well. But it's not free — you're paying in tokens and latency for the model to read the entire document.
Let's talk about costs concretely. If you're using Claude or a similar model, you're paying per input token. A typical regulation document might be five thousand to twenty thousand tokens. If every query requires fetching and reading a full regulation document, you're adding significant token costs compared to a RAG approach where you retrieve maybe five hundred to a thousand tokens of relevant chunks.
You're also saving the cost of maintaining the vector database, the embedding costs for indexing, the storage costs for the embeddings themselves. For a few thousand regulations, those costs are negligible. For millions of documents, they become real.
And this is where the scale of the corpus matters enormously. Daniel's talking about Israeli architectural Takens. How many are there? A few hundred? Maybe a few thousand? That's a tiny corpus. You could literally stuff all of them into a modern context window and skip retrieval entirely.
Some of these regulations are probably quite long. And you'd need to leave room for the conversation history, the system prompt, the user's query, and the model's response. But you're right that at this scale, the engineering tradeoffs are different than they would be for, say, all of US federal regulations.
The Code of Federal Regulations is about two hundred million words. That's not fitting in anyone's context window anytime soon. For that, you absolutely need some kind of retrieval, whether it's vector search, keyword search, or agent-based navigation.
Scale dictates architecture. At small scale, the live-fetching approach is elegant and maintainable. At large scale, you need indexing and retrieval.
Here's the thing — even at large scale, there's a hybrid approach that combines the best of both. You could maintain a vector index of document summaries or section headings, use that for retrieval to identify which documents are relevant, and then have the agent fetch those specific documents live. That gives you the scalability of vector search with the freshness guarantees of live fetching.
That's essentially what Daniel's index page approach does, just with a different retrieval mechanism. The index page is a curated list of documents. Vector search is an automatically generated index. Same function, different implementation.
The automatically generated index has the advantage of handling queries that don't map cleanly to document titles. If someone asks "what are the fire safety requirements for stairwells in residential buildings above four stories," that might span multiple regulations. A good vector search can pull relevant chunks from several documents. An agent navigating by document title might struggle to know which documents to look at.
Unless the index page is well-structured and the agent is good at reasoning about information architecture. But that's a big unless.
Let me pull on a thread that I think ties this all together. Daniel asks whether every conversation turn essentially creates a little RAG store. I think the more interesting question is what kind of memory architecture emerges from different retrieval strategies. With static RAG, you've got a curated long-term memory — the vector database — and each query does a targeted lookup. With live fetching, you've got no long-term memory at all — the agent starts from scratch each time and builds its understanding from whatever it finds.
That's a really useful framing. Static RAG is like having a research assistant who's read all the regulations and can pull relevant passages from memory. Live fetching is like having a research assistant who's really good at using the library catalog but hasn't read anything in advance.
Both approaches converge on the same thing at query time — text in the context window that the model attends over. The difference is in how that text gets selected and how fresh it is.
What's the actual accuracy tradeoff? If I'm building an architectural assistant and I need it to never give wrong information about building codes, which approach do I trust more?
I don't think there's a clean answer. With static RAG, your retrieval accuracy depends on your chunking strategy, your embedding model, and your vector index parameters. With live fetching, your accuracy depends on the agent's navigation ability and the quality of the source website's information architecture.
Both can fail in different ways. Static RAG can retrieve outdated or irrelevant chunks. Live fetching can navigate to the wrong page or fail to parse important details.
There's actually some interesting work on this. Researchers have been comparing retrieval-augmented generation with what they call agentic retrieval, where the model actively searches and browses rather than relying on a pre-built index. The agentic approaches tend to do better on tasks that require multi-step reasoning or combining information from multiple sources, but they're slower and more expensive per query.
Because the agent is doing multiple rounds of search and reading, right? It's not just one retrieval step.
And each round adds latency and token cost. For a simple factoid query like "what's the minimum elevator width," the overhead of agentic retrieval might not be worth it. For a complex query like "compare the elevator requirements for hospitals versus residential buildings and identify any contradictions," the agentic approach might be essential because no single chunk contains the answer.
That's the multi-hop reasoning problem. And it's where I think Daniel's approach really shines, because the agent can follow chains of references. One regulation might reference another, and the agent can navigate to that second regulation and read it too.
Static RAG can do multi-hop too, but it requires either retrieving a lot of chunks and hoping the relevant ones are all there, or doing iterative retrieval where each round's results inform the next query. Agent-based navigation handles multi-hop more naturally because it's inherently sequential.
We've got freshness, maintenance burden, multi-hop capability, and retrieval precision all pulling in different directions. How do you actually make a decision?
I think you start with the failure modes and work backward. For architectural regulations, what's the cost of getting something wrong? If your assistant tells you the wrong minimum elevator width and someone builds an elevator that doesn't meet code, that's a real problem. Maybe a lawsuit. So correctness matters a lot.
Correctness means different things for each approach. For static RAG, correctness means your index is current and your retrieval is accurate. For live fetching, correctness means the agent reliably finds the authoritative source and extracts the right information.
There's a third dimension too — verifiability. With static RAG, you can log exactly which chunks were retrieved and show your work. If something goes wrong, you can trace it back to a specific chunk that was outdated or poorly chunked. With live fetching, the agent's navigation path is harder to audit. You can log the URLs it visited, but understanding why it chose those URLs and whether it read them correctly is more opaque.
That's a strong argument for static RAG in high-stakes domains. Audit trails matter.
Daniel's approach has its own audit advantage — the source is always live. If someone questions a recommendation, they can click the link and see the current regulation. With static RAG, the chunk in your database might not match what's currently on the website, and proving that it was correct at the time of retrieval requires version tracking.
We're circling around the same tradeoff from different angles. Let me try to synthesize this. Daniel's live-fetching approach is essentially trading retrieval precision and latency for freshness and maintainability. The cost is that you're relying on the agent's navigation skills and the model's attention mechanism rather than a purpose-built vector index. The benefit is that you never serve stale data and you don't have to maintain an indexing pipeline.
I'd add that the tradeoff looks different at different scales. For a few hundred documents, live fetching is probably fine and maybe even better. For millions of documents, you need some kind of index. For everything in between, you're making a judgment call.
There's also the question of document structure. Regulations are highly structured — numbered sections, cross-references, tables. A good RAG pipeline can preserve that structure in chunking. An agent doing live fetching might just dump raw HTML into context and lose some of the structural cues that help the model understand what's important.
That's a real concern, but it's also addressable. A well-designed agent skill could parse the HTML, extract the relevant structured content, and format it cleanly for the model. That's extra engineering work, but it's doable.
At which point you're building a custom retrieval pipeline anyway, just one that runs at query time instead of indexing time.
And query-time processing has different constraints than indexing-time processing. At indexing time, you can afford to do expensive operations — run large embedding models, do sophisticated chunking, build complex index structures. At query time, every millisecond of processing adds to the user's latency. So you're limited in how much processing you can do.
Unless you cache aggressively. If the same regulation gets fetched frequently, you can cache the cleaned version and avoid re-processing it each time. But now you've got a cache invalidation problem, and we're back to the maintenance burden we were trying to avoid.
There's a famous computer science quote about cache invalidation being one of the hard problems. And naming things. And off-by-one errors.
You just named two things.
That's the joke.
So where does this leave us? I think Daniel's intuition is basically right — for his use case, live fetching has a lot to offer. But I want to push back on the idea that we're just creating a little RAG store every turn. That's true in a superficial sense, but it misses what makes RAG actually work.
RAG isn't just about having vectors in memory. It's about having the right vectors in memory, selected through a retrieval process that's optimized for recall and precision. The context window gives you perfect recall over whatever happens to be in it, but the selection process — how that text got there — is what determines whether the system actually works.
That selection process is where all the hard engineering lives. Chunking, embedding, indexing, reranking — that whole pipeline exists to solve a specific problem that attention alone doesn't solve.
Let me give a concrete example. Say you've got a thousand regulation documents, each fifty pages long. A user asks about elevator widths. With good RAG, you embed the query, find the three most relevant chunks across all thousand documents, and put maybe fifteen hundred tokens into context. The model sees exactly the relevant information, clearly presented, with minimal noise.
With live fetching, the agent has to figure out which document to look at, navigate to it, and then either read the whole thing or try to find the relevant section. If it reads the whole fifty-page document, that's maybe twenty-five thousand tokens of mostly irrelevant text. The model has to sift through all of that to find the one paragraph about elevator widths.
That sifting is expensive and error-prone. The lost-in-the-middle problem means that if the elevator width information is buried somewhere in the middle of that document, it might get less attention weight than the introductory boilerplate at the top.
The RAG approach isn't just about finding information — it's about presenting it in a way that maximizes the model's ability to use it correctly. That's the piece that I think gets overlooked when people talk about just fetching documents live.
Yet, for Daniel's specific case — Israeli architectural Takens — I suspect the live approach works well in practice. The corpus is small, the documents are probably well-structured with clear headings, and the queries are likely specific enough that the agent can navigate to the right section quickly.
Also, and this is worth saying explicitly, Daniel's not choosing between RAG and no RAG. He's choosing between pre-built RAG and just-in-time RAG. The agent is still doing retrieval — it's just doing it by navigating a website instead of querying a vector database.
The retrieval mechanism matters less than the retrieval quality. If the agent can reliably find the right document and extract the right information, the fact that it's using URL navigation instead of cosine similarity is an implementation detail.
I think the deeper insight from Daniel's question is about the blurring line between retrieval and reasoning. When an agent navigates a website, reads documents, and synthesizes information, it's doing something that looks a lot like research. The retrieval isn't a separate step from reasoning — they're interleaved.
That's the agent paradigm in a nutshell. Instead of retrieve-then-read, you've got a loop of search, read, reason, search again, read more, synthesize. Each step informs the next.
That loop is powerful but unpredictable. With traditional RAG, you know exactly what's going to happen — embed query, search index, return top K, generate response. With an agent, the path is emergent. It might find the answer in two steps or get lost in a rabbit hole of cross-references for twenty steps.
That unpredictability is both the strength and the weakness of the agent approach. When it works, it can handle queries that would stump a simple retrieve-and-read system. When it doesn't work, it fails in ways that are hard to diagnose and harder to fix.
For production systems where reliability matters, you probably want to constrain the agent's behavior pretty tightly. Daniel's skill approach — here's the index page, here's how to navigate it, here are the scripts to use — is a good example of adding constraints to make the agent more predictable.
Those constraints are essentially encoding domain knowledge about the information architecture. Someone who knows the Takens system well has translated their mental model of how to find regulations into a structured skill that the agent can follow.
Which is a form of retrieval engineering, just at a different level of abstraction. Instead of tuning chunk sizes and embedding models, you're tuning navigation instructions and parsing logic.
I want to come back to something Daniel said at the end of his prompt. He asks what the actual costs are in terms of retrieval accuracy when doing it this way, given that ultimately it's all vectors. I think the answer is that the vector part isn't where the accuracy difference lives. The difference is in the selection process.
Say more about that.
Once text is in the context window, the model's attention mechanism works the same way regardless of how it got there. The accuracy difference between RAG and live fetching comes from which text gets selected and how it's presented. RAG gives you fine-grained control over chunking and retrieval. Live fetching gives you freshness and simplicity. But the vector math under the hood is the same.
The question isn't really about vectors at all. It's about information architecture and maintenance strategy.
And that's why I think Daniel's approach is smart for his use case. He's optimizing for the thing that matters most — keeping the information current with minimal human effort — and accepting some reduction in retrieval precision that probably doesn't matter much for his specific domain.
Because architectural regulations are well-structured, the queries are specific, and the corpus is manageable.
If he were building a legal research system for all of Israeli case law, I'd give different advice. But for a few hundred well-organized regulation documents, live fetching with a well-designed navigation skill is probably the right call.
There's one more advantage we haven't mentioned. When you do live fetching, you're not just getting the regulation text — you're getting context. You see the regulation in its original formatting, with its section numbers, its cross-references, its official header that says this is the current version as of this date. That contextual information can help the model understand the authority and scope of what it's reading.
A chunk in a vector database is stripped of a lot of that context. You might get the paragraph about elevator widths, but you lose the information that this is section four point two point one of the building accessibility standards, published in twenty twenty-three, with a technical amendment from twenty twenty-four.
For a professional architect, those details matter. They need to know which regulation they're complying with, not just what it says.
We've got freshness, context preservation, simplicity of maintenance on one side. Retrieval precision, latency, cost control on the other. Pick your tradeoff based on your domain.
I think that's a good summary. But let me push one more time on the philosophical question Daniel raised. Is every conversation turn creating a RAG store? I think the answer is yes in the sense that the context window is a vectorized representation of information that the model retrieves from via attention. But it's no in the sense that the retrieval mechanism — how information gets into that context window — is completely different from what we mean by RAG.
The term RAG has become so overloaded that it's almost meaningless at this point. Originally it meant a specific architecture — retrieve documents, augment the prompt, generate a response. Now people use it to mean anything that involves looking up information.
Daniel's approach is still retrieval-augmented generation, just with a different retrieval mechanism. The augmentation step is the same — text goes into the prompt. The generation step is the same — the model produces a response. The only difference is how the retrieval happens.
Which brings us back to the core insight. The retrieval mechanism is an implementation detail. What matters is whether the right information ends up in the context window at the right time.
Whether you can trust that it's correct and current.
Whether you can debug it when it goes wrong.
Whether it's cost-effective at your scale.
Basically, all the normal engineering tradeoffs, just applied to a new domain.
That's the thing about this field. The technology changes fast, but the fundamental questions — what are you optimizing for, what are the failure modes, how do you measure success — those stay the same.
Alright, I think we've given Daniel a thorough answer. Live fetching with agent skills is a legitimate architecture. It trades some retrieval precision for freshness and simplicity. For small, well-structured corpora, it's probably the right call. For large-scale or high-recall applications, traditional RAG still has advantages.
The philosophical question — are we just creating little RAG stores every turn — the answer is kind of, but not in any way that changes how you should think about system design.
Now, before we wrap, I believe we have a fun fact incoming.
Now: Hilbert's daily fun fact.
Hilbert: The platypus holds the record for the smallest number of electrolocation receptors among electrolocating monotremes, with approximately forty thousand mucous gland electroreceptors concentrated in its bill, a sensory system so refined it can detect the faint electrical signals of a shrimp's muscle contractions in complete darkness. During the high medieval period, no European naturalist had any idea this creature existed, and when the first platypus specimen reached England, it was widely assumed to be a taxidermy hoax.
I'm trying to imagine a medieval monk's reaction to a platypus, and honestly, I think "hoax" is the most charitable interpretation available.
Forty thousand electroreceptors. That's forty thousand more than I have.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop, and thanks to Daniel for another question that sent us down a very enjoyable rabbit hole. If you want more episodes like this one, head over to myweirdprompts.We'll be back soon.