#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM

When should an LLM be locked to specific documents, and when should it search the web? A practical framework for grounding decisions.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3930
Published: Jun 20
Duration: 29:59
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: rag ai-safety legal-technology

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When do you want an LLM locked inside a room with only the documents you handed it, and when should it be free to draw on its training knowledge and live web research? The answer isn't as simple as "legal documents = closed, everything else = open." The dividing line is task shape, not domain. A single agent workflow might need a closed step to extract clauses from a contract, then an open step to research relevant case law. The problem is that no major framework exposes a clean "closed_corpus = true" primitive. LangGraph forces you to gate tool availability per node and rely on system prompts that models still leak through. LlamaIndex offers composable query engines but no per-generation toggle. The Anthropic and OpenAI SDKs leave you building orchestration from scratch. The practical takeaway: you need per-step control, not application-level grounding decisions, and system prompts are a soft constraint — not a guarantee. For legal and compliance use cases, source-restricted retrieval is non-negotiable; for differential diagnosis, you want open retrieval. The frameworks give you bits and pieces, and you stitch them together.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM

Daniel sent us this one — he wants us to dig into open-corpus versus closed-corpus retrieval, and the broader distinction between what people call closed-world and what he's calling additive grounding in LLM and agentic systems. He's noticing the terminology hasn't settled — NotebookLM popularized "closed corpus," people toss around "closed-world" and "grounded," but there's no actual standard. The real question is practical: when do you want a model blending your documents with its own knowledge and live web research, and when do you want it locked in a room with only the pages you handed it? And crucially — do today's agentic frameworks actually give you a clean switch for that, or are we all just hacking around the gaps?

This is exactly the right question at exactly the right time. The terminology piece alone is worth unpacking, because engineers are building systems right now where the difference between "only these documents" and "these documents plus whatever else you know" is the difference between a defensible legal filing and a malpractice suit. And the frameworks — I've been poking at this across LangGraph, LlamaIndex, the Anthropic and OpenAI SDKs — the answer is: mostly no, there's no clean primitive. You get bits and pieces, and you stitch them together.

Bits and pieces and stitching. The fiber arts of production AI.

The quilt of despair. But let's start with the words, because the words are genuinely a mess. NotebookLM, when it launched the whole "grounded only in your sources" feature, called it "source grounding" and described the mode as being restricted to the documents you upload. They never actually used the phrase "closed corpus" in the product itself — that term bubbled up from the community, from practitioners describing what NotebookLM was doing. And it stuck because it's evocative.

"Closed corpus" does have a nice ring. Sounds faintly liturgical.

Like a set of canonical texts and thou shalt not deviate. But here's where it gets tangled. In the academic literature, "closed-world" versus "open-world" has been around for decades in knowledge representation and databases — closed-world assumption means anything not explicitly stated is false, open-world means anything not stated is unknown. That's a different axis entirely from whether a language model is allowed to use its parametric knowledge during generation.

The database people are using "closed-world" to mean "the set of facts is complete and final," while the LLM people mean "don't look outside this bucket of documents.

Yes, and those overlap but aren't the same. A database with a closed-world assumption can still answer queries by inference — "is there a flight to Chicago at 3 PM?" and if it's not in the table, the answer is no. An LLM constrained to a closed corpus is doing something different: it's generating text conditioned on a retrieval set, but the model itself still has all that parametric knowledge sitting there. The constraint is operational, not architectural.

Which is why "closed corpus" as a term is almost misleading. The model isn't closed. The retrieval pipeline is.

That's why I've started seeing people use "source-restricted retrieval" as a more precise term — it names what's actually happening. You're restricting the retrieval step to a specific index or document set. The generation step is still a general-purpose model that could, in principle, hallucinate or draw on its training data. You're relying on prompting, system instructions, and sometimes architectural gating to keep it honest.

The terms floating around — "closed corpus," "open corpus," "closed-world generation," "grounded," "source-restricted" — nobody's convened the standards body.

Nobody's convened it, and the frameworks aren't helping because they each use their own vocabulary. LangChain talks about "retrievers" and "tools" and you gate access by which tools are available to which node in the graph. LlamaIndex has "query engines" you can configure against specific indexes. The Anthropic SDK talks about "tools" and you can simply not provide a search tool to a particular agent step. OpenAI's Agents SDK has "tools" per agent and "handoffs." None of them expose a top-level parameter called "closed_corpus equals true." You build the constraint from lower-level primitives.

The vocabulary question Daniel raised — has it standardized? — the answer is no, and the fragmentation is actually a symptom of the fact that the frameworks haven't given us a clean abstraction.

When the primitive exists, the name settles around it. When everyone's building it themselves, everyone names it themselves.

Alright, so let's get to the decision that actually matters. When do you lock the doors?

Let's walk through concrete cases. The strongest case for strictly closed — source-restricted, no parametric knowledge, no web retrieval — is anything with legal or regulatory liability attached. You're reviewing a 200-page master services agreement, and you want the model to tell you what's in it. If the model starts filling in "standard" clauses from its training data that aren't actually in this contract, you've got a problem that could end up in court.

The model confidently tells you there's a limitation of liability capped at fees paid, because that's in 90 percent of the contracts it was trained on. But this one doesn't have that clause. Congratulations, you just advised your client they're protected when they're not.

That's not a hypothetical. There was a study out of Stanford and the University of Washington last year that tested LLMs on legal document QA under different grounding conditions. When models were allowed to draw on parametric knowledge alongside retrieved documents, error rates on specific clause identification nearly doubled compared to source-restricted generation.

So this isn't a subtle effect.

It's enormous. The model's training distribution acts like a prior that overpowers the specific document in front of it. And the more confident the model sounds, the more dangerous it is, because the user can't tell the difference between "this is in the document" and "this is what usually appears in documents like this." So legal and contract work — strict closed corpus, no question. Compliance auditing is the same category. You're checking a policy document against a regulatory framework, and both are provided. If the model starts inventing regulatory requirements from other jurisdictions, you're done.

What about medical?

Medical is fascinating because it's not one category. If you're summarizing a specific patient record for a clinician — closed. You want only what's in that record. But if you're doing a differential diagnosis assistant where the model is supposed to bring the full breadth of medical literature to bear on a set of symptoms, you want open retrieval with live search over PubMed or UpToDate or whatever your trusted sources are. The distinction is whether the task is "tell me about this specific artifact" or "bring everything relevant to bear on this question.

The dividing line isn't the domain. It's the task shape.

The task shape. Which is why any framework that bakes the grounding decision at the application level — "this app is closed" — is making a category error. You need per-generation control. One step in your agent's plan might be "read this contract and extract all indemnification clauses — closed." The next step might be "now research recent case law on indemnification enforceability in Delaware — open.

That brings us to what Daniel was really driving at. How do you actually wire this up?

Let's go framework by framework, because the answer is different everywhere and none of them are great. Starting with LangGraph — this is probably where the most production agent work is happening right now. LangGraph gives you a state graph where each node is a computation step. You can configure different nodes with different tools, different models, different system prompts. So the pattern is: you have a "closed node" that calls a retriever tool backed by a specific vector index of your documents, and its system prompt says something like "You may ONLY use information from the retrieved documents. If the answer is not in the documents, say so. Do not use your own knowledge." And a separate "open node" that has access to a web search tool and a more permissive system prompt.

The gating is done through tool availability plus system prompt constraints.

That's it. There's no "closed world" flag. You're doing it with the prompt and by selectively providing or withholding tools. And the prompt part is load-bearing in a way that makes me nervous, because system prompt adherence on "don't use your own knowledge" is not perfect. Models still leak.

How leaky are we talking?

It depends on the model. Claude and GPT-4 do reasonably well with strong system prompt instructions to restrict to provided sources. But "reasonably well" isn't "guaranteed." If you ask a model "does this contract include a force majeure clause?" and it doesn't, but the model knows force majeure clauses are standard, you'll occasionally get "While no explicit force majeure clause is present, Section 12 on unforeseen events may function similarly..." — which is the model being helpful and also being wrong in a legally significant way.

The model is doing the thing you'd want a human associate to do — flagging a gap — but the associate knows they're speculating, and the model doesn't.

A human says "heads up, I didn't see a force majeure clause, you might want to check that." The model blends the observation with the speculation into one confident-sounding paragraph. So with LangGraph, the pattern works but the system prompt is a soft constraint, not a hard one.

What about LlamaIndex?

LlamaIndex actually has a slightly cleaner abstraction here, though it's still not a first-class "closed versus open" toggle. They have the concept of a "query engine" that wraps a retriever plus a response synthesizer. You can configure a query engine against a specific index, and if you set the response mode to "compact" or "refine," it's effectively doing source-restricted generation over that index. The response synthesizer gets the retrieved nodes and the query, and it's instructed to answer from those nodes. If you want open retrieval, you configure a different query engine that has access to multiple indexes or a web search tool. The "query engine" is a useful unit of composition — you can route to different query engines based on the question type, which gets you partway to per-step control. But it's still not a parameter you pass at generation time that says "closed world true." You're composing different objects.

The Anthropic and OpenAI SDKs?

Both are lower-level than LangGraph or LlamaIndex — they're giving you the model, the tools interface, and the system prompt, and you build the orchestration yourself or with a thin agent loop. In the Anthropic SDK, you define tools per request. So if you want a closed generation step, you simply don't include a web search tool in that request, and your system prompt says to answer only from the provided context. If you want an open step, you include the search tool. The gating is implicit in which tools you pass to which API call.

The frameworks that give you the most control are also the ones that give you the least abstraction. You're back to system prompts and tool lists.

That's actually the story here. The higher-level the framework, the more it wants to make decisions for you about what tools are available and when. LangChain's older agent abstractions, before LangGraph, had a global tool list — every step had access to everything. That made closed-world steps essentially impossible without building your own routing layer. LangGraph fixed this by letting you scope tools to nodes, but you have to know to do it. The ease-of-use gradient runs exactly opposite to the control gradient. And for closed-world use cases, control is non-negotiable.

Let's talk about NotebookLM specifically, since Daniel mentioned it as the thing that popularized the term. How does it actually enforce source grounding?

NotebookLM is the most opinionated and the most constrained, which is why it's the most reliable for the closed-corpus use case. You upload documents, and the model is restricted to those documents. Under the hood, Google is doing a combination of things — retrieval is scoped to your document set, the system prompt is aggressively restrictive, and they've likely done additional fine-tuning or RLHF to reduce parametric knowledge leakage. The product doesn't expose any way to say "now also search the web." The closed corpus is the whole product.

Which makes it great for exactly one thing and useless for anything that needs outside context.

That's the tradeoff. NotebookLM optimized for trustworthiness within a defined document set, at the cost of never being able to tell you "by the way, the regulation this contract references was updated last month." For a legal researcher, that's a feature. For an analyst trying to understand market conditions, it's a bug.

Let's map the concrete cases. You mentioned legal and compliance as strict closed. What about the analysis tasks where additive grounding improves the answer?

The clearest case is anything involving current events or time-sensitive information. If you're analyzing a company's earnings call transcript, you want the transcript itself — closed — but you also want the market's reaction, analyst notes, competitor moves that happened since. That's additive. The model brings the transcript as the anchor, and retrieves context around it.

The transcript is the spine, the web retrieval is the flesh.

Another case: policy analysis. You've got a proposed piece of legislation — you want the model grounded in the text, but you also want it to know about related bills in other states, expert commentary, constitutional challenges that have been raised. If you go strictly closed, you get a summary of what the bill says. If you go additive, you get an analysis of what it means.

There's a spectrum here between "tell me what this document says" and "tell me what to think about this document.

The closer you get to "tell me what to think," the more you want additive grounding — but also the more you need to be transparent about what sources informed the analysis. If the model pulled in a think tank's critique and wove it into the analysis, the user needs to know that.

Which is where citation and provenance become load-bearing.

In an open retrieval setting, the citation isn't a nice-to-have. It's the only thing that lets the user distinguish "this came from the document" from "this came from a Reddit comment the model found.

The Reddit comment of authority.

The world's most confident expert. So the practical guidance starts to crystallize. Closed when the task is about the document itself — summarization, extraction, comparison of provided sources. Open when the task requires context the documents don't contain. And in the open case, citations are mandatory, and you should probably show the user which claims came from which source.

Now the part I think a lot of engineers are banging their heads against — when you need both in the same workflow. Step one is closed extraction, step two is open analysis.

This is where the frameworks really show their seams. In LangGraph, you can do it by having separate nodes with separate tool configurations, like we said. But there's a subtlety that trips people up: the state object carries context between nodes. If your open node can see the retrieved documents from the closed node, and it has web search available, it might use web search to "verify" the document contents — which sounds helpful but can introduce errors if the web sources are outdated or conflicting.

The state management is part of the gating problem, not just the tool list.

You might want the open node to see the extracted claims from the closed node but not the raw documents, or vice versa. The information architecture of your state object becomes part of your grounding strategy. I've seen teams spend more time on state design than on prompt engineering for exactly this reason.

What about the "skip grounding" approach? Is anyone doing a flag that just says "this generation, don't retrieve, don't ground, just use the model's parametric knowledge?

It's surprisingly uncommon as a named feature. Most frameworks assume you always want retrieval if you've set up a retriever. The way engineers actually do it is by having a conditional edge in the graph — "if this step is marked as closed, route to the retriever node; if it's marked as skip-retrieval, route directly to the generation node with no context." But it's a pattern you implement, not a flag you set.

The "clean switch" Daniel asked about doesn't exist. You're always assembling it from lower-level parts.

I think that's worth naming explicitly for practitioners. If you're starting a new project and you know you'll need per-step grounding control, budget time for building the gating infrastructure. It's not in the box. LangGraph gives you the graph structure to build it. LlamaIndex gives you composable query engines. The SDKs give you per-request tool control. But the "closed-world mode" toggle is something you design and implement.

There's an irony here. The thing that NotebookLM made look simple — "just ground in your sources" — turns out to be one of the harder things to get right in an agentic system, precisely because agents are designed to be open and flexible.

The agentic framing assumes the model should have access to whatever it needs. Closing the world runs against the grain of the architecture. Every framework's happy path is "the model has tools, it decides what to use." Forcing it to not use tools, to not draw on its own knowledge, is swimming upstream.

Which is why the system prompt approach is so common and so insufficient. You're asking the model to voluntarily restrict itself.

"Please ignore the vast corpus of knowledge embedded in your weights. " It works most of the time. Most of the time is not the standard for legal or compliance work.

What do the really paranoid teams do? The ones for whom "most of the time" is unacceptable?

A few patterns I've seen. One is the dual-model pattern. You use a smaller, dumber model for the closed extraction step — something that literally doesn't have enough parametric knowledge to be dangerous. A Llama 3B fine-tuned on extraction tasks, where the model's own knowledge is shallow enough that it can't confidently invent legal concepts. Then you use a larger model for the open analysis step where you actually want breadth.

Using weakness as a feature.

The model's ignorance becomes your safety guarantee. Another pattern is the verification loop. You run the closed generation, then you run a separate verification step — sometimes with a different model — that checks every factual claim against the provided documents and flags anything not found. That's expensive, it doubles your inference costs, but it catches leakage.

The third pattern?

The nuclear option — constrained decoding. You structure the output so the model can only generate from a predefined schema or template. If the schema only has fields for "clause type" and "clause text," the model physically cannot insert a commentary paragraph about industry standards. This is more common in extraction use cases than generation, but it's the closest thing to a hard guarantee.

The practical guidance for practitioners is: know your tolerance for leakage, and pick your constraint mechanism accordingly. System prompt for low-stakes, dual-model or verification loop for medium-stakes, constrained decoding for high-stakes.

Match the mechanism to the task shape. If you're extracting structured data, constrained decoding is feasible. If you're generating natural language summaries, it's not, and you're back to prompts and verification.

Let's talk about a specific failure mode I've seen. Team sets up a RAG system for customer support — closed corpus of product documentation. Works great for a month. Then someone adds a web search tool because "customers are asking about competitor comparisons and the docs don't cover that." Now the model is answering product questions from the docs but competitive intelligence from the web, and there's no clear boundary between the two modes. The user asks "does your product support SOC 2?" and the model retrieves a web page from a competitor claiming they support SOC 2, and blends it into the answer.

That's the contamination problem. Once the web search tool is available, the retriever might pull from both sources, and the model doesn't always know which is which. I've seen teams try to solve this with source filtering — tagging documents with a "corpus ID" and only retrieving from the corpus tagged "internal" for certain question types. But that's a routing problem on top of a retrieval problem, and it gets complicated fast.

The routing itself is probabilistic. You classify the question as "internal" or "external" and occasionally you misclassify.

Every layer adds a new failure mode. The question classifier gets it wrong. The retriever pulls irrelevant documents. The model ignores the system prompt. Each of these is individually unlikely to fail, but the compound probability of something going wrong somewhere in the chain is higher than anyone wants to admit.

The serial failure mode problem. Which brings us to something Daniel's prompt hinted at — the distinction between closed-world and additive grounding as a design philosophy, not just a technical setting.

This is the deeper point. When you choose closed-world, you're making a statement about what constitutes a valid answer. A valid answer is one that can be derived entirely from the provided sources. Anything outside that is, by definition, out of scope. When you choose additive grounding, you're saying a valid answer may include information the user didn't provide, as long as it's relevant and sourced. These are different epistemologies for the system.

Different theories of what it means to answer a question correctly.

Most production systems are muddling along somewhere in the middle without having made the choice explicitly. They've got a retriever, they've got a system prompt that says "use the provided context," but the model has web access and the prompt doesn't strictly forbid parametric knowledge. The result is answers that are grounded-ish. Sort of sourced.

Grounded-ish is the beige wallpaper of enterprise AI.

It's everywhere and nobody chose it on purpose. It's just what you get when you don't make the decision.

One piece of practical advice is: make the decision. For each generation step in your system, decide whether it's closed or open, and configure it accordingly. Don't just accept the framework's default.

Document the decision. When something goes wrong — and something will go wrong — you want to know whether it was a design choice or an oversight. "We decided this step should be closed and it leaked" is a very different postmortem from "we never thought about it.

Are there frameworks on the horizon that might give us that clean switch?

I haven't seen anything announced, but the direction the Anthropic SDK is heading — with more explicit control over tool availability per step, and with their focus on safety — suggests they're thinking about it. The "tools" parameter per request is already the primitive. What's missing is a higher-level concept of "grounding mode" that bundles the tool restrictions, the system prompt constraints, and maybe some decoding parameters into a single setting.

A grounding profile.

That's exactly what I want. A profile that says "legal extraction mode" and it sets the retriever to the contract index, removes web search, sets the system prompt to strict source-only, and maybe even switches to a model that's been fine-tuned for low hallucination on extraction tasks. One toggle, five things happen.

The market wants this. The question is whether the framework builders see it as their problem or the application developer's problem.

Right now it's firmly the application developer's problem. And I think that's partly because the framework builders are mostly focused on making the happy path easier — "look how quickly you can build an agent that uses tools" — rather than the constrained path. The constrained path is less flashy, but it's where the money and the liability are.

The unsexy work of making AI actually usable in high-stakes settings.

The glockenspiel of corporate approachability, but for safety guarantees.

There it is.

I've been waiting.

To land the practical guidance. Step one: pick your terms and be consistent within your team, because the industry won't do it for you. Step two: classify each generation step as closed or open based on the task shape, not the domain. Step three: implement the constraint using the primitives your framework gives you — tool gating, system prompts, separate indexes, verification loops — and know that none of it is a hard guarantee. Step four: if the stakes justify it, reach for the heavier machinery — dual models, constrained decoding, human review.

Step zero: make the decision consciously. Grounded-ish is not a strategy, it's an accident.

The thing I keep coming back to is that "closed-world" is almost a misleading metaphor for what we're actually doing. The model's world is never closed. It always has its training data. We're not closing a world — we're building a fence and asking the model to stay inside it.

The fence is made of prompts and tool lists. It's not a wall. It's a suggestion with infrastructure.

A suggestion with infrastructure. That's going to be the title of the postmortem for about forty percent of production RAG incidents this year.

At least forty percent. The rest will be "the retriever pulled the wrong chunk and the model ran with it.

Which is a whole other episode.

We've done that one. Different angle, different day.

One thing I want to add before we wrap — there's an emerging pattern I'm seeing that doesn't fit neatly into the closed versus open binary. It's what I think of as "staged grounding." You run the closed extraction first, produce structured outputs, and then those structured outputs become the input to the open analysis step. The open step never sees the raw documents. It sees a JSON blob of extracted facts. That creates a clean information boundary — the open step can reason about the facts, bring in outside context, but it can't accidentally quote the document or confuse the document's claims with web claims.

The extraction step acts as a filter that strips the document of its rhetorical context and leaves only the factual skeleton. And the factual skeleton is much harder to contaminate. You can't accidentally blend a web source's phrasing with the document's phrasing if the document's phrasing has been replaced with structured data.

That's elegant. It also means the extraction step is auditable — you can check whether the structured facts match the document before they flow into the open analysis.

That auditability is the thing most missing from current frameworks. When a LangGraph agent produces an answer that blends five sources, tracing which claim came from where is often possible but rarely easy. The staged grounding pattern makes provenance a first-class property of the architecture.

For practitioners listening to this — if you're building something where the answer matters, staged grounding might be a better fit than trying to get a single generation step to behave itself.

It's more engineering upfront, but the failure modes are cleaner and easier to debug. And in production, debuggability beats elegance every time.
And now: Hilbert's daily fun fact.
Hilbert: The word "cephalopod" entered English in the 1820s from the French "céphalopode," coined in 1798 from the Greek "kephalē" for head and "pous" for foot — literally "head-footed." But the etymological debate that raged through the 1910s among Tasmanian naturalist societies was whether the term should instead be "podocéphalien," foot-headed, to emphasize that the arms develop from the anterior foot region in embryogenesis. The proposal failed by a single vote at the 1913 Hobart Zoological Congress.

...right.

The Hobart Zoological Congress had strong feelings about mollusk nomenclature.

Apparently stronger than about anything else.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps. Find every episode at myweirdprompts.I'm Herman Poppleberry.

I'm Corn. See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM

Downloads

You Might Also Like

#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM