#1235: Beyond "No Training": Securing the New Agentic AI Stack

Think your data is safe because of a "no training" clause? We deconstruct the hidden security risks within the modern agentic AI stack.

0:000:00

Episode Details

Published: Mar 15
Duration: 30:57
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: ai-agents ai-security ai-orchestration

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The landscape of artificial intelligence has shifted from simple, stateless chat boxes to complex, autonomous agents. While many users and enterprises rely on the "we do not train on your data" banners found on major AI platforms, the reality of data privacy in 2026 is far more nuanced. Understanding where data actually flows requires looking past the model provider and examining the entire "agentic stack."

The Technical Reality of Model Training

There is a common fear that every prompt sent to a major AI provider is sucked into the next version of a foundation model. However, there is a strong technical incentive for providers to avoid training on raw production data. By 2025, the industry hit a "data wall," where high-quality human text became scarce.

Raw user data is often "noisy"—filled with typos, half-finished thoughts, and repetitive queries. If a provider were to dump this low-entropy data into their training sets, it could lead to model degradation or "model collapse." Instead, providers favor highly curated synthetic data. While enterprise APIs generally offer strict "no training" clauses, users must remain cautious of free consumer tiers or research previews, where data is often used for Reinforcement Learning from Human Feedback (RLHF) to fine-tune model behavior.

The Risks of the Agentic Stack

The real privacy challenge emerges when an AI becomes "stateful." To be useful, modern agents require memory and the ability to take actions. This requires a stack that includes vector databases, orchestration layers, and observability platforms. Each of these components introduces new vulnerabilities:

Vector Databases: These act as the agent's long-term memory. Even when data is converted into mathematical "embeddings," it is not truly anonymous. Recent research shows that these vectors can often be inverted to reconstruct the original sensitive text.
Observability Tools: Developers use these tools to debug agent behavior, but they often store full logs of every interaction. These logs can become accidental gold mines of personally identifiable information (PII) if not managed with strict retention policies.
The "Agentic Tax": Every time an agent moves data between a calendar, an email, and a database, it creates multiple "hops" where data can leak.

Moving Toward a Security Framework

As agents transition from tools we use to partners we trust, the security surface area expands. We have moved from giving apps access to specific files to giving agents access to entire digital lives. This shift necessitates a rigorous approach to data residency and security posture.

To navigate this, organizations should implement a multi-point audit framework. This includes identifying exactly where data resides at every step, understanding the retention policies of every third-party tool in the stack, and ensuring that the "memory" of an agent is as secure as the primary database. The goal is to move beyond simple compliance and toward a robust architecture that respects the complexity of stateful AI relationships.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1235: Beyond "No Training": Securing the New Agentic AI Stack

Daniel's Prompt

Custom topic: One common objection I want to hear from those customers considering building agentic AI applications is how do I know the models won't be used to train on my data. And we see usually for paid API usa

Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am joined as always by my brother.

Herman Poppleberry at your service. It is a beautiful day here in Jerusalem, and we have a really meaty topic to sink our teeth into today. Our housemate Daniel sent over a prompt that I think is going to resonate with anyone who has been trying to build or even just use AI tools over the last couple of years. It is one of those questions that sounds simple on the surface but reveals a massive, crumbling infrastructure underneath once you start digging.

Yeah, Daniel was asking about the actual reality of data privacy in this new world of agentic AI. You know, we have all seen those marketing banners on every website now that say we do not train on your data. It has almost become the new SSL certificate, right? It is just this baseline expectation that everyone claims to meet. But Daniel was wondering if that is actually a complete answer, especially when you are building complex agents that have memory, tool access, and long term storage. He is feeling that friction between the need for developer speed—just getting the agent to work—and the heavy burden of enterprise compliance.

It is the perfect time to talk about this because we are now in March of twenty twenty six, and the landscape has shifted. A few years ago, we were just talking about chat boxes. Now, we are talking about autonomous agents that live in our browsers, our IDEs, and our databases. The "we do not train on your data" line is a great start, but in twenty twenty six, it is essentially like a restaurant saying "we wash our hands." That is great, but I still want to know where the meat came from, how it was stored, and who has the keys to the freezer.

That is a great way to put it. We are moving from a world of stateless interactions to stateful relationships with AI. And that shift changes everything. So, today we are going to deconstruct the "Agentic Stack." We want to look past the model provider and see where the data actually flows. We will talk about why the "no training" promise might actually be true for technical reasons you might not expect, but also why your data is arguably more at risk now than it was two years ago.

We need to distinguish between the Model Provider—the big names like OpenAI, Anthropic, or Google—and the rest of the stack. If you are building an agent today, you are likely using a vector database like Pinecone or Weaviate, an orchestration layer like LangGraph or Haystack, and probably three or four different observability platforms. Each one of those is a potential leak point. The model is just the engine; the rest of the stack is the fuel system, the exhaust, and the GPS tracking.

Let us start with that first layer, the model itself. Daniel’s prompt really pushed us to look at the "no training" claim. Herman, when an enterprise user sends a prompt to a major API provider today, what is actually happening on the back end? Is the fear that our data is being sucked into the next version of the foundation model actually grounded in reality?

For the major players, the short answer is no. And I want to be very clear about why, because it is not just about legal compliance or being "good guys." There is a massive technical incentive for them NOT to train on your raw production data. We call this the "Garbage In, Garbage Out" problem, but in twenty twenty six, it has become the "Noise In, Collapse Out" problem.

Explain that. Why would my "high quality" business data be considered "noise" or "garbage" to someone like OpenAI?

Think about the scale of a foundation model. To train something like the models we are seeing today, you need trillions of tokens of extremely high quality, curated data. Raw user data is incredibly messy. It is full of typos, half finished thoughts, "hey are you there" messages, and repetitive queries. If a provider just dumped every single API call into their training set, they would be injecting massive amounts of low entropy, repetitive, and potentially contradictory information into their weights. It would actually degrade the performance of the model.

So it is a quality control issue. They would rather use synthetic data or highly curated public datasets than the random logs of a thousand different enterprise apps.

Precisely. By twenty twenty five, we hit what researchers called the "Data Wall." We basically ran out of high quality human text on the internet. The solution was not to grab messy private logs; the solution was synthetic data generation—using models to create perfect, logically sound training examples. Your email to your boss about a missed deadline is actually useless to a model trying to understand quantum physics or advanced Python architecture. In fact, it is worse than useless; it is a distraction.

That is a relief, in a way. The technical architecture of these models actually protects us because our data just isn't that interesting to the base model. But, there is a "but" coming, isn't there?

There is a huge "but." We have to talk about the exceptions. While the enterprise APIs for OpenAI, Anthropic, and Google have very strict "no training" clauses, that does not apply to everyone. If you are using the free consumer versions of these tools, or if you are using certain "Model as a Service" brokers or smaller research previews, the terms of service are often a total reversal. In those cases, your data is the product.

Right, we saw this with some of the smaller players in twenty twenty four and twenty twenty five. They offer a cheaper API, but in the fine print, it says they reserve the right to use "anonymized" data for model improvement.

And "anonymized" is a very loose term in the age of AI. If you are using a free tier, they are likely using your interactions for a process called Reinforcement Learning from Human Feedback, or R L H F. They want to see how you corrected the model, what you liked, and what you rejected. That is how they "fine tune" the behavior of the model. So, if you are a developer and you are using a free research preview to test your agent, you are effectively leaking your logic and your test data into their future tuning sets.

That is a big distinction. So, if I am paying for an API key, I am usually protected by a different set of terms. But Daniel’s prompt was really pushing us to look at the whole stack. We have moved from stateless chat, where you send a message and it is gone, to stateful agents. These are systems that remember who you are, what you did yesterday, and what your goals are. That memory has to live somewhere.

That is the core of the agentic shift. In the old world, the interaction was ephemeral. Now, we are building what we call the agentic stack. To make an AI useful as an agent, you usually give it a vector database. This acts as the long term memory. When the agent needs information, it performs a similarity search on that database to pull relevant context into its prompt. This is what we call Retrieval Augmented Generation, or R A G.

And this is where the privacy conversation gets much more complicated. Because while the model provider might not be training on your data, that vector database is a persistent store of every sensitive thing your agent has ever seen. It is a "hot" database that is being queried constantly.

Right. And unlike the model provider, who is just doing inference and then moving on, the vector database is actively storing your data in a searchable format. If you are a developer, you have to ask yourself, where is that database hosted? Is it a managed cloud service? What are their retention policies? If that database is compromised, someone has a perfectly indexed, searchable history of all your agent's interactions.

That reminds me of what we discussed back in episode twelve zero nine, when we were talking about the shift to agent first architectures. The complexity of managing that state is not just a technical challenge, it is a massive security surface area. If your agent is processing emails, for example, it is taking sensitive P I I—personally identifiable information—and turning it into vector embeddings.

And here is the thing about embeddings that people often miss. Even though the data is converted into a list of numbers, a vector, those vectors can often be inverted. There was a major research paper in late twenty twenty four that showed you can reconstruct a significant portion of the original text just from the vector embeddings with surprising accuracy. You cannot just assume that because it is in a mathematical format, it is anonymous.

That is a scary thought. So, if an agent summarizes a sensitive legal document and stores that summary and its embeddings in a vector store, a less privileged user who has access to that same vector store could potentially query it and retrieve that information. Or an attacker could "invert" the vectors to get the original text back.

We are seeing a lot of issues with what I call the "Agentic Tax." Every time an agent takes an action, it is moving data between different tools. It might pull something from your calendar API, send it to the L L M for processing, store the result in a vector database, and then log the whole transaction in an observability tool. Each of those "hops" is a place where P I I can leak.

Let’s talk about that observability piece for a second. Because as a developer, you need those logs. If your agent starts acting weird or hallucinating, you need to see exactly what was in the prompt and what the model responded with. But those logs are a gold mine of sensitive data.

They really are. I think the observability layer is actually the most overlooked risk in the current AI stack. Tools like LangSmith or Arize Phoenix are incredible for debugging, but by default, they often store the full text of every interaction. If your agent is handling customer support for a bank or a healthcare provider, your debugging logs are suddenly full of social security numbers, medical histories, and private account details. And often, these logs are kept for thirty, sixty, or ninety days by default.

So, we have the model provider, the vector database, the observability tools, and then we have the third party integrations. If I build an agent that has access to my email or my Slack, I am essentially giving that agent’s entire infrastructure a window into my private communications.

And this is where we need to be very careful about who we trust. It is not just about whether the AI is smart enough to do the job. It is about whether the company building the agent has the security posture to handle that level of access. We have gone from giving apps access to specific files to giving agents access to our entire digital lives. This is the "Agentic Shift" we talked about in episode twelve zero nine—the move from tools we use to partners we trust.

It feels like we are in a bit of a wild west phase right now. Companies are so eager to deploy agents and get those productivity gains that they might be cutting corners on the privacy audit side. Herman, how should a developer or a business owner actually approach this? If they want to build an agentic workflow but they are terrified of their data leaking, what is the framework?

I think we need a four point audit framework for any agentic integration. This is what I tell every enterprise client I work with. The first point is Data Residency. You need to know exactly where the data is being stored at every step. Is it in the United States? Is it in the European Union? Is it on a server you control, or is it in a multi tenant cloud environment where your data is sitting right next to everyone else’s? In twenty twenty six, "the cloud" is not a specific enough answer. You need to know the specific region and the specific provider.

That makes sense. And the second point?

The second point is Retention and Deletion Policies. Not just for the model provider, but for every tool in the stack. How long does your vector database keep deleted records? How long are your observability logs stored? If you have a customer who requests that their data be deleted under G D P R or similar laws, can you actually purge that data from your entire agentic stack? That is much harder than it sounds when your data is scattered across five different APIs. You need to ensure that a "delete" command propagates through the entire system.

That is a great point. G D P R compliance in an agentic world sounds like a nightmare if you haven't planned for it from day one. What is the third point?

The third point is Access Control and Sub processors. You need to look at who your providers are using. If you use a managed vector database, who are they using for their cloud hosting? Do they have S O C two Type two certification? Do they have HIPAA compliance if you are in healthcare? You have to follow the chain all the way down. If your vector database uses a third party logging service that isn't secure, your whole stack is compromised.

And the fourth point?

The fourth is what I call the P I I Masking Layer. This is the most practical thing you can do today. Before any data even hits the L L M or the vector database, you should be running it through a local or highly secure middleware that identifies and masks sensitive information. Instead of sending a customer’s real name and credit card number to the model, the middleware replaces them with placeholders like CUSTOMER NAME and CARD NUMBER. The model still gets the context it needs to perform the task, but the actual sensitive data never leaves your secure environment.

That seems like a very practical solution. It adds a bit of latency, I imagine, but for enterprise applications, that trade off is almost certainly worth it.

It definitely is. And we are starting to see some really cool tools emerging that do this automatically. There are open source libraries and even specialized APIs that are designed just to scrub P I I from L L M prompts in real time. By the time the prompt reaches OpenAI or Anthropic, it is already "clean." This effectively renders the "do they train on my data" question moot, because even if they did, they would only be training on placeholders.

I want to go back to something you mentioned earlier, Herman. You said that from a technical standpoint, training on raw user data is often counterproductive for these big companies. I think that is a really important point to hammer home because it changes the nature of the fear. It is not that they are trying to steal your secrets to make their model smarter; it is that the infrastructure itself is inherently leaky if not managed properly.

The risk is not malicious training; it is accidental exposure. It is a security problem, not a theft problem. If you look at the history of data breaches, it is rarely a sophisticated actor breaking into a high security vault. It is usually an unsecured S three bucket or a logging tool that was left open to the internet. In the world of AI, the vector database is the new unsecured S three bucket.

That is a great analogy. So, when people ask, "is my data being used to train the model," they are asking the wrong question. They should be asking, "where is my data being stored, who has access to it, and how long is it staying there?"

Right. And I think we also have to address the misconception that using a local model solves everything. People think, "oh, I will just run Llama three or some other open weight model on my own hardware, and then I am totally safe." But if you are still using a cloud based vector database and cloud based logging, you have only solved one piece of the puzzle. You have secured the engine, but the fuel lines are still running through someone else's backyard.

Plus, running these models at scale locally is still a massive undertaking for most small to medium businesses. The cost and complexity of maintaining that infrastructure can be prohibitive.

It can. But I do think we are going to see a shift toward local first vector stores and local first memory layers. We might see a world where the heavy lifting of the inference happens in the cloud with a trusted provider, but the actual memory and state management live on the edge or on the user’s own network. This is the "Privacy Preserving Agent" architecture that people are starting to get excited about.

That would be a fascinating development. It would essentially decouple the intelligence from the memory. You send a masked prompt to the cloud for the smarts, and you keep the context and the history locally.

That is the dream. And as hardware gets better and we see more specialized AI chips in laptops and even phones, that becomes much more viable. You could have a very high performance local vector store that handles all your personal data, and it only sends anonymized queries to the big models in the cloud. We are already seeing the early stages of this with Apple’s Private Cloud Compute and similar initiatives.

I want to pivot a bit and talk about the geopolitical side of this, because we are in Jerusalem and we see how these things play out on a global scale. When we talk about data privacy, we are often talking about American companies and American regulations. But what about the rest of the world? If you are a company in Israel or in Europe, using an American AI provider comes with a whole other layer of complexity regarding data sovereignty.

It really does. And that is why you see countries like France or Germany pushing so hard for their own national AI champions. They want to ensure that their citizens' data stays within their borders. From a pro American perspective, I think it is vital that our providers lead the way in privacy and security. If we want the world to use American AI, we have to be the most trusted. We cannot rely on just being the smartest; we have to be the safest.

And that ties back to the conservative worldview we often discuss. Security and privacy are not just individual rights; they are matters of national and economic security. If our businesses are leaking their intellectual property through poorly secured AI agents, that is a huge net loss for the country. It is a form of industrial espionage that we are accidentally facilitating.

It really is. We have to treat AI data with the same level of seriousness that we treat defense secrets or financial records. It is the lifeblood of the modern economy. If you are a pharmaceutical company and your agent is helping you design a new drug, that data is your most valuable asset. You cannot afford to have it sitting in an unencrypted log file in a third party observability tool.

So, let’s get practical for a minute. If someone is listening to this and they are in the middle of building an agent, what are the first three things they should do today to audit their privacy?

First, go into your observability tools right now and check your logging settings. If you are using LangSmith or Weights and Biases or anything like that, see exactly what is being stored. Most of them have settings to redact or anonymize data, or at least to limit the retention period to twenty four hours. Turn those on. You do not need a permanent record of every prompt to debug your system.

Good one. What is the second?

Second, look at your vector database provider. Read their privacy policy and their security documentation. Specifically, look for their data encryption standards and their sub processor list. If they are not S O C two compliant, you might want to consider moving to a provider that is, or look into self hosting a solution like Milvus or Qdrant on your own infrastructure. Self hosting a vector database is much easier than self hosting a foundation model.

And the third?

Third, implement a basic P I I filter in your application code. Even a simple regex based filter that catches common patterns like email addresses or phone numbers can go a long way. You do not need a massive machine learning model just to do basic scrubbing. Do it before the data ever leaves your environment. There are great open source tools like Microsoft Presidio that can help with this.

That is solid advice. It is about building layers of defense. No single tool is going to be a silver bullet, but if you have privacy in your logs, security in your database, and masking in your application, you are ahead of ninety nine percent of the people out there.

It is about being intentional. Don’t just follow the default settings. The defaults are usually designed for convenience and debugging, not for maximum privacy. As a developer, your job is to move from "it works" to "it is secure."

This really connects back to what we covered in episode twelve seventeen, where we looked at how agents can be manipulated to leak their own system instructions. If an agent can be tricked into giving up its internal logic, it can certainly be tricked into giving up the sensitive data it has in its memory.

That is such a good point, Corn. Prompt injection is not just about making the AI say something funny or offensive. It is a data exfiltration vector. If an attacker knows your agent has access to your email, they can craft a prompt that tricks the agent into summarizing your most recent sensitive emails and sending that summary to an external server. This is why we need to talk about "Tool Permissions."

Right, which brings us back to the importance of that fourth point in your audit framework: tool permissions. You should never give an agent more access than it absolutely needs to perform its task. It is the principle of least privilege, applied to AI.

Right. If your agent only needs to schedule meetings, it does not need the ability to read your entire inbox. It only needs access to your calendar and maybe a very narrow set of email metadata. The more we can silo these agents, the safer we will be. We are moving away from the idea of one giant, all knowing agent that does everything, toward a swarm of smaller, specialized agents that have very limited scopes and very strict privacy boundaries.

I think this is where the industry is heading. The "All Knowing Agent" is a privacy nightmare. But a team of specialized agents, each with its own little sandbox? That is something I can get behind. It is more robust, more debuggable, and much more secure.

I hope so. Because the stakes are high. We are entrusting these systems with more and more of our cognitive load. If we don't get the privacy right now, we are going to see a massive backlash that could stall the entire field.

It is interesting to see how this is evolving. I remember back in episode ten seventy, we were talking about the "Agentic Secret Gap." We were worried about how developers manage API keys and secrets within their agentic workflows. This conversation is really just the next level of that. It is not just about the keys anymore; it is about the data itself.

It is the natural progression. First, we had to figure out how to make the models work. Then we had to figure out how to connect them to tools. Now, we have to figure out how to do all of that without compromising everything we care about. It is the growing pains of a new technology stack. And honestly, it is a sign of maturity. We are moving past the "wow" phase and into the "how do we actually live with this" phase.

And I think we should be optimistic. The fact that we are even having these conversations, and that developers are starting to demand these features, is a good sign. The market is responding. We are seeing more privacy focused AI startups every day. We are seeing major providers like Modal and others offering more secure, isolated compute environments.

We are. And I think the big players are feeling the pressure too. They know that if they want to capture the enterprise market, they have to be beyond reproach when it comes to data security. The first major L L M provider to have a massive data breach is going to lose a huge chunk of their market share overnight. That is a powerful incentive. Nothing motivates a corporation like the fear of losing billions of dollars in enterprise contracts.

That is a fair point. But what about the individual user? The person who isn't a developer, who is just using these tools to help manage their life. Are they just out of luck?

For the individual, it really comes down to being a savvy consumer. Don't put anything into a free AI tool that you wouldn't want a human reviewer to see. Because even if they aren't training on it, there is often a "human in the loop" somewhere doing quality control. And for more sensitive tasks, look for apps that are built with privacy first architectures. They are out there, but you have to look for them. It is the same old advice we have been giving since the early days of the internet, just applied to a much more powerful set of tools. If you aren't paying for the product, you are the product.

It is a cliche because it is true. But with AI, the stakes are just so much higher because the tools are so much more intimate. They know our writing style, our schedules, our preferences, and our professional secrets.

Well, I think we have given Daniel and our listeners a lot to think about. This was a deep dive, but a necessary one. The reality of data privacy in twenty twenty six is that it is no longer a single checkbox. It is a continuous process of auditing and securing every layer of your stack.

Well said, Herman. And if you are out there building, don't let this discourage you. The potential of agentic AI is incredible. We just have to build it with our eyes wide open. Don't trust the marketing banners; audit the architecture.

Stay curious, but stay skeptical.

And hey, if you have been enjoying these deep dives into the weird and wonderful world of AI, we would really appreciate it if you could leave us a review on your favorite podcast app. Whether it is Spotify or Apple Podcasts, those reviews really help other curious minds find the show. It is the best way to support what we do here.

They really do. It makes a huge difference for us. And if you want to find our full archive of over twelve hundred episodes, or if you want to get in touch and send us your own weird prompt, head over to myweirdprompts dot com. You can find our R S S feed there, and all the different ways to subscribe.

And don't forget our Telegram channel. Just search for My Weird Prompts on Telegram to get notified every time a new episode drops. We love hearing from you guys, so keep those prompts coming. Daniel, thanks again for this one. It was a great excuse to dig into the technical weeds.

Definitely. It was a blast.

Alright, that is all for today. Thanks for joining us on My Weird Prompts. We will see you in the next one.

Stay curious, and stay secure. Goodbye for now.

So, Herman, before we totally wrap up, I was thinking about one more thing. We talked a lot about the technical side, but what about the legal side? Are we seeing any new regulations that are actually keeping up with this agentic shift? I mean, we have the Great Data Privacy Act of twenty twenty five, but does it actually cover agents?

That is a great question. We are starting to see some movement, especially in the U S with some of the recent executive orders. But the problem is that technology is moving at light speed and the legal system is moving at a more traditional pace. Most of the regulations we have right now were written for a world of static databases, not dynamic agents that can take actions on your behalf. The law is still trying to figure out who is responsible when an agent makes a mistake or leaks data—is it the developer, the model provider, or the user?

Right, it is hard to regulate something when the definition of what it is changes every six months. I mean, a year ago, we weren't even really talking about agents in this way. We were still obsessed with simple chatbots.

I think we are going to see a lot of litigation over the next couple of years that will eventually set the precedents. But until then, the best defense is a good offense—meaning, build your own security layers rather than waiting for the government to protect you. It always comes back to individual and corporate responsibility, doesn't it? Especially in a field as frontier as this. You have to be your own advocate.

Well, I think that is a perfect final thought. This has been My Weird Prompts. I am Corn Poppleberry.

And I am Herman Poppleberry. Thanks for listening.

We will catch you next time.

See you then.

Actually, Herman, one more quick thing. I just remembered a conversation we had back in episode twelve twenty, about APIs for agents. We were talking about the rise of the Model Context Protocol, or M C P. How does that fit into this privacy picture?

Oh, that is a great connection! M C P is actually a huge part of the solution. It is a standardized way for agents to connect to data sources and tools. Because it is a standard, it makes it much easier to build consistent privacy and security layers across different tools. Instead of having a custom, potentially leaky integration for every single app, you can have a single, hardened M C P gateway that manages all your agent’s connections. It allows you to define your security policies in one place and have them enforced across all your agent’s tools.

So, if you are a developer, using a standard like M C P could actually make your audit process much simpler. It brings some much needed order to the chaos.

It is a big step forward for the whole ecosystem. It makes the "Agentic Tax" much lower because you aren't reinventing the security wheel for every new tool you add to your agent.

That is really encouraging. It is good to see these standards starting to emerge. Okay, now I think we are actually done.

I think so too. Unless you have another one?

No, I think that covers it. My brain is officially full of vectors and P I I.

Mine too. Let's go get some coffee.

Sounds like a plan. Thanks again for listening, everyone.

Bye for real this time!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.