#2009: The Plumbing of AI Safety: Guardrails, Not Vibes

We dive deep into the specific libraries, proxy layers, and architectural decisions that keep an LLM from emptying a bank account.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2165
Published: Apr 4
Duration: 23:49
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-safety latency open-source-ai

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The industry has shifted from treating AI safety as a vague ethical concept to a concrete engineering discipline. As LLMs evolve from chatbots into agents that execute code and move money, guardrails have become the essential infrastructure—the literal plumbing—preventing catastrophic failures like emptying bank accounts or leaking sensitive data. This is no longer just about stopping a model from writing a mean poem; it's about architectural decisions that secure the entire stack.

Understanding where these guardrails live is the first step. They are fundamentally different from training-time alignment techniques like RLHF or Constitutional AI, which act as the model's internal compass. While crucial, internal compasses can be easily spun around by clever jailbreaks or formatted strings. To counter this, engineers deploy inference-time guardrails—middleware layers that sit in the call chain. These function as a "sandwich," with the LLM as the meat and the guardrails as the bread. An input guardrail checks for prompt injection, PII, and banned topics before the prompt reaches the model. Then, an output guardrail inspects the generated response for hallucinations, secret leaks, or policy violations before it reaches the user.

The primary tension in production is latency. Adding layers of "checking" inevitably slows down response times. Simple regex filters add negligible milliseconds, but the industry is increasingly moving toward "LLM-as-a-judge," where a smaller, specialized model reviews inputs and outputs. This can add hundreds of milliseconds or even over a second, a lifetime in real-time chat. To mitigate this, architects use a "Dual-Rail" approach, distinguishing between a "fast path" and a "slow path." Deterministic code or tiny classifiers handle obvious threats—like known injection strings or swear words—instantly. The heavy-duty LLM-based reasoning is reserved only for ambiguous intent or higher-risk scenarios, creating a tiered defense that balances speed and security.

The open-source ecosystem offers diverse tools for building this plumbing. NVIDIA NeMo Guardrails uses a specialized language called Colang to define conversation flows. Instead of writing endless "if-else" statements, developers define semantic intents, allowing the bot to steer conversations back to safe topics without becoming a rigid, old-school chatbot. Meanwhile, Guardrails AI addresses the probabilistic nature of LLMs with its RAIL markup language, enforcing strict output schemas. If a model produces malformed JSON, the library can trigger a "re-ask," forcing the model to correct its formatting and ensuring type-safe outputs for enterprise reliability.

For more granular control, prompt programming languages like Microsoft Guidance and LMQL act as steering wheels rather than just checkers. They interleave code with the LLM’s generation process, allowing developers to constrain the model’s token choices at a granular level. If the next word must be "Yes" or "No," the program logic prevents the model from considering any other tokens, making off-topic drift impossible. This token-level control is highly efficient but requires tight integration with the inference engine, making it ideal for self-hosted models like Llama rather than API-based services.

On the commercial side, the landscape is evolving rapidly as AI safety becomes a networking and security problem. Specialized models like Meta’s LlamaGuard act as dedicated safety classifiers, fine-tuned to detect categories like violence or hate speech. However, these models can be too aggressive, leading to "over-refusal" where legitimate use cases are blocked. Developers often use soft guardrails, treating safety scores as a dimmer switch rather than a light switch, tuning sensitivity based on the specific application. Meanwhile, companies like Lakera offer ultra-low latency APIs updated with real-time threat intelligence, acting as the "CrowdStrike" of AI by filtering the latest jailbreak trends. As the field matures, the focus is shifting from stitching together GitHub repos to integrated platforms that offer performance, governance, and continuous monitoring.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2009: The Plumbing of AI Safety: Guardrails, Not Vibes

Alright, we are diving into a big one today. Daniel’s prompt is about the actual, literal plumbing of AI safety. We are talking about guardrails. And I don’t mean the vague concept of "being ethical"—I mean the specific libraries, the proxy layers, and the architectural decisions that keep an LLM from emptying a bank account or leaking a CEO's home address.

This is such a critical shift in the industry, Corn. We’ve moved past the era where "safety" just meant the model wouldn’t write a mean poem. Now that we have agents actually executing code and moving money, guardrails are essentially the new enterprise firewall. By the way, a quick shout-out to our silent partner today—this episode’s script is being powered by Google Gemini three Flash. Herman Poppleberry here, ready to get into the weeds.

Poppleberry in the house. So, Herman, let's start with the "where." If I’m building an app, where does a guardrail actually live? Is it inside the model? Is it a separate server? Because I feel like people talk about "safe models" and "guardrails" as if they’re the same thing, but they really aren’t.

They are fundamentally different levels of the stack. Think of training-time alignment—things like Reinforcement Learning from Human Feedback or Anthropic’s Constitutional AI—as the model’s "upbringing." It’s baked into the weights. It’s the model’s internal compass. But as we’ve seen with every jailbreak on the internet, internal compasses can be spun around with a clever "DAN" prompt or a weirdly formatted base-sixty-four string.

Right, the "ignore all previous instructions" trick. It’s hard to hard-code a personality that can’t be tricked.

Well, not "exactly," but you’re on the right track. This is why we use inference-time guardrails. These sit in the call chain. Imagine a user sends a prompt. Before it ever touches the LLM, it hits an "Input Guardrail." This is a middleware layer that checks for prompt injection, PII—that’s Personally Identifiable Information—or banned topics. If it passes, it goes to the LLM. Then, the LLM generates a response, but before the user sees it, that response hits an "Output Guardrail." This checks if the model hallucinated, if it’s leaking secrets, or if it suddenly started talking about a competitor’s product.

So it’s a sandwich. The LLM is the meat, and the guardrails are the bread keeping everything from falling into your lap. But that bread isn't free, right? Every time you add a layer of "checking," you’re adding latency. If I’m a user and I have to wait an extra three seconds for a "safety check" to run on a separate model, I’m going to lose my mind.

That is the number one tension in production right now. The latency cost is real. If you’re using a simple regex filter or a keyword list, you’re looking at maybe ten milliseconds. No big deal. But the industry is moving toward "LLM-as-a-judge." That’s where you use a smaller, faster model—like Meta’s Llama-three-eight-B or a distilled specialized model—to look at the input and say "Is this a jailbreak?" That can add anywhere from two hundred milliseconds to over a second depending on the hardware.

A full second? In the world of real-time chat, that’s an eternity. That’s the difference between a snappy assistant and a loading spinner that makes me want to close the tab. How does that work in practice for something like a customer service bot? If the user asks "Where is my package?", does it really need to spend a second checking if that's a cyberattack?

It shouldn't, and that's where the architectural nuance comes in. You don't run the heavy-duty checks on every single interaction. You use a "Dual-Rail" approach. It’s about being smart with the "fast path" versus the "slow path." For common stuff—obvious swear words, known injection strings, or basic formatting—you use deterministic code or tiny, specialized classifiers. Those are lightning fast. You only trigger the "slow path"—the heavy LLM-based reasoning—when the intent is ambiguous or the risk profile changes. It’s a tiered defense.

Okay, let’s get into the actual tools. If I’m a developer and I want to build this today, I’m looking at the open-source world first. I keep hearing about NVIDIA NeMo Guardrails. What’s the "weird" thing about NeMo? Because NVIDIA doesn’t usually do high-level Python libraries for fun.

NeMo is actually really fascinating because of "Colang." It’s a specialized modeling language they developed just for conversation flows. Instead of writing a million "if-else" statements in Python, you define "flows." You can say, "If the user asks about politics, steer them back to our product catalog." It treats the LLM as a programmable state machine. It’s very powerful for "topic railroading"—ensuring the bot doesn't end up in a rabbit hole about conspiracy theories when it’s supposed to be selling shoes.

"Topic railroading" sounds like something a grumpy conductor does, but I get it. You’re narrowing the "latent space" the model is allowed to play in. But wait, if I use Colang to define these flows, am I essentially turning the AI back into a chatbot from 2015? Those "if-this-then-that" bots were terrible.

Not quite, because the "if" statements in Colang are still backed by semantic embeddings. It’s not looking for exact keywords; it’s looking for the intent of the sentence. If the user asks about "the current election" or "who I should vote for," the embedding model recognizes both as "political intent" and triggers the rail. It’s the flexibility of an LLM with the deterministic control of a script.

Got it. So it's a flexible leash. What about the one simply called "Guardrails AI"? I see that Python library everywhere on GitHub.

Guardrails AI is the darling of the structured data crowd. Their big thing is "RAIL"—Reliable AI Markup Language. Look, the biggest headache with LLMs in production is that they are probabilistic. You ask for JSON, and ninety-nine percent of the time it’s fine, but that one percent where it forgets a closing bracket or adds a conversational "Here is your JSON:" prefix breaks your entire backend.

Oh, I’ve lived that nightmare. You write a beautiful parser, and the LLM decides to be "helpful" by adding a preamble like "Certainly! I'd be happy to format that data for you."

Guardrails AI fixes that. It wraps the LLM call and enforces a schema. If the LLM misses the mark, the library can actually trigger a "re-ask." It sends the error back to the model and says, "Hey, you messed up the formatting, try again." It’s about making the model’s output "type-safe," which is a huge deal for enterprise reliability.

It’s basically a bouncer that says "You’re not coming in unless you’re wearing a suit and tie, and that tie better be a valid JSON object." Now, you mentioned Microsoft Guidance and LMQL earlier. Those feel different. They aren't just "checkers," right? They’re more like... steering wheels?

That’s a perfect way to put it. Guidance and LMQL—which stands for Language Model Query Language—are "prompt programming languages." Instead of sending a prompt and praying, you actually interleave your code with the LLM’s generation. You can force the model to choose from a specific list of tokens at a specific point in the sentence. It’s token-level control. If you know the next word must be "Yes" or "No," you literally don't let the model even consider any other tokens. It’s the ultimate guardrail because it’s impossible for the model to "drift" off-topic if the program logic is holding its hand token-by-token.

That seems way more efficient than waiting for a full response and then checking it. You’re preventing the "crime" before it happens. But I imagine that requires a much tighter integration with the inference engine. You can’t really do that over a basic API call to OpenAI as easily as you can with a local model.

You hit the nail on the head. That’s why those tools are huge in the open-source, self-hosted community. If you’re running Llama-three on your own hardware, Guidance is incredible. If you’re just hitting the GPT-four-o endpoint, you’re more limited to those "sandwich" style guardrails we talked about. Speaking of Llama, we have to talk about LlamaGuard.

Is that just Llama with a badge and a gun?

Pretty much. It’s a version of Llama that Meta fine-tuned specifically to be a safety classifier. It’s trained on thirteen specific categories of "bad stuff"—violence, hate speech, sexual content, criminal advice. Because it’s a standardized model, it’s become the industry benchmark for "LLM-as-a-judge." If you want to know if a prompt is safe, you run it through LlamaGuard-three first. It’s fast, it’s open, and it works surprisingly well.

It’s the "safety officer" model. I like the idea of a tiny, specialized model whose only job in life is to be a buzzkill. It’s efficient. But what happens if LlamaGuard is too strict? Like, if I'm building a tool for a novelist writing a thriller, and the model flags a description of a fictional crime as "violence"?

That’s the "over-refusal" problem. It’s a huge issue. If a safety model is too aggressive, it makes the product useless. This is why many developers use "soft" guardrails. Instead of a hard "Block," LlamaGuard might return a safety score from zero to one. If the score is 0.8, you might block it. If it’s 0.4, you might just append a warning or log it for manual review. It's about tuning the sensitivity to the specific use case.

So it’s a dimmer switch, not a light switch. Now, what about the corporate side of things? If I’m a bank, I’m probably not just stitching together a bunch of GitHub repos and hoping for the best. I want a dashboard. I want someone to sue if it fails. Who are the big players in "Commercial Guardrails"?

The commercial landscape is exploding because of "Day Two" operations. It’s one thing to build a guardrail; it’s another to monitor it, audit it, and update it across ten thousand employees. You mentioned Arthur AI and Robust Intelligence. Robust Intelligence was actually just acquired by Cisco—which tells you everything you need to know about where this is going. It’s becoming a networking and security problem.

Cisco buying an AI safety company is such a "nature is healing" moment for the tech industry. It reminds me of the early days of web security when everyone realized you couldn't just trust the browser. Back to the hardware and the wires—what does a "Robust Intelligence" or a "Lakera" actually give you that a Python library doesn’t?

Performance and "threat intelligence." Take Lakera, for example. They have "Lakera Guard." It’s an ultra-low latency API specifically for prompt injection and jailbreaking. They are constantly updating their "threat signatures" based on the latest jailbreaks found in the wild. If a new "Grandma exploit"—where you tell the AI to act like your grandma who used to read you napalm recipes—starts trending on Reddit, Lakera updates their filters instantly. You’re paying for the "up-to-the-minute" protection.

I love the "Grandma exploit." It’s so wholesome and terrifying. "Now dearie, to make the napalm, first you take some gasoline..." It's wild that a security system has to understand the concept of a "persona" to stop a hack. So Lakera is like the "CrowdStrike" of AI. They’re looking for the malware—which in this case is just a weirdly phrased sentence. What about the "Governance" side? I’ve heard Patronus AI and Calypso AI mentioned in the same breath as "risk management."

Those are more about the "Human-in-the-loop" and the "Evaluation" side. Patronus is really interesting because they focus on "automated evaluation." They help companies "score" their models. If you’re an insurance company, you need to know: "How often does our bot accidentally give legal advice?" Patronus runs thousands of adversarial tests—essentially attacking your own model—to give you a "hallucination score" or a "PII leak score." It’s about quantifying the risk so the legal department can sleep at night.

And Calypso?

Calypso AI is very focused on the "Control Plane." They provide a dashboard for security teams to see every single prompt entering or leaving the organization. It’s about observability. If an employee tries to paste sensitive source code into a public LLM like ChatGPT, Calypso catches it, masks the PII, and logs the incident. It’s "DLP"—Data Loss Prevention—for the AI age.

It’s funny, we spent decades trying to stop people from putting files on thumb drives, and now the "thumb drive" is just a text box that talks back to you. It’s a much harder problem to solve. You can't just block a USB port; you have to block a thought process.

It’s infinitely harder because language is fuzzy. You can’t just block "the word 'password'." Someone could ask the model to "Describe the string of characters I use to log in, but do it in the style of a pirate." A traditional firewall sees "pirate" and says "A-okay!" An AI guardrail has to understand the intent of the pirate.

"Arr, me secret is 'Hunter-two'!" Yeah, that’s a problem. So, Herman, if you’re a CTO today, how do you choose? Do you go with the "Build" route—using NeMo and LlamaGuard—or the "Buy" route with something like Arthur Shield or Lakera? What’s winning in the real world right now?

It’s a "Defense in Depth" strategy. Nobody is picking just one. The "winning" stack I’m seeing in production usually looks like this: You use something like LlamaGuard for general safety because it’s cheap and fast. You use Guardrails AI or Guidance for your core application logic to ensure the data stays structured. And then you layer a commercial tool like Lakera or Robust Intelligence on the very front end to catch the really sophisticated, "zero-day" prompt injections.

So it’s a series of filters. The big rocks get caught by the simple stuff, and the fine sand gets caught by the expensive stuff. But I have to ask—what about the model providers themselves? OpenAI, Google, Anthropic—they all have "Safety Settings" you can toggle. Why would I pay for a third-party tool if Google Vertex AI has a "Safety Filter" slider I can just move to "High"?

The problem with the built-in filters is that they are "black boxes." They are "all or nothing." If Google’s filter decides a prompt is "unsafe," it just returns a generic error. As a developer, you have no idea why it was blocked. Was it a false positive? Was it a specific word? You can’t tune it for your specific business logic.

Right, if I’m a medical app, I need the model to talk about symptoms and body parts. A "standard" safety filter might see those words and panic, thinking it’s "explicit content." Or if I'm a cybersecurity firm, I need my model to analyze malware code without the safety filter thinking I'm trying to hack a hospital.

A bank can’t use a safety filter that blocks the word "debt" or "bankruptcy." Third-party guardrails allow you to define your own "Allowed" and "Disallowed" spaces. They give you the "Why." They provide audit logs that you can show to a regulator. The cloud-native filters are a great "Step Zero," but they aren't a "Step Ten" for a regulated industry.

It feels like we are seeing the professionalization of the "vibes." We started with "Does this feel safe?" and now we have "Colang" state machines and "RAIL" markup languages. It’s becoming a real engineering discipline. I wonder, does this lead to a sort of "Guardrail Arms Race"? Like, as the bouncers get smarter, do the jailbreakers just get weirder?

Oh, absolutely. We’re already seeing "adversarial suffixes"—these are strings of seemingly random characters that, when added to a prompt, bypass the safety training of the model. The guardrail companies are in a constant race to detect these patterns. It’s no different than the antivirus companies of the 90s chasing polymorphic viruses. It’s a cat-and-mouse game that will likely never end.

Interesting. So the guardrail is the "antivirus" for the LLM. I suppose that means we’ll eventually have "Guardrail Bloat," where the safety layers are so heavy the AI can barely move.

That’s the risk. Because the stakes are moving from "The bot said something embarrassing" to "The bot just executed a ten-thousand-dollar wire transfer because of a prompt injection." There’s a notable quote I saw recently that really captures this shift. In twenty-twenty-four, guardrails were about stopping the AI from saying something offensive. In twenty-twenty-six—which is where we are now—guardrails are about ensuring the AI doesn't accidentally tank the company's balance sheet.

It’s the move from "Safety" to "Security." One is about feelings; the other is about assets. I suspect we’re going to see a lot more "Safety Engineers" who are actually just cybersecurity pros who learned how to talk to LLMs.

That’s the "new collar" job of the decade. And what’s interesting is the "Small Model" trend. Instead of using a massive model like GPT-four to check another model, companies are training these "Distilled Safety Models." Tiny, one-billion or three-billion parameter models that run on the same server as your app. It cuts the latency down to almost nothing while keeping the "reasoning" power of an LLM.

It’s like having a very small, very focused dog that only barks when it sees a specific type of intruder. It doesn't need to know how to solve calculus; it just needs to know "Stranger Danger."

That’s the future. Local, hyper-specialized guardrails. It’s the only way to scale this without making every AI interaction feel like you’re waiting for a dial-up modem to connect. Imagine a world where every API call has a "Safety Sidecar"—a tiny container running a 1B model that validates every token in real-time.

Alright, let’s talk practical takeaways for a second. If someone is listening to this and they’ve got a "wrapper" app that’s basically just a prompt and a text box, and they want to "harden" it—where do they start?

Step one: Use a structured output library. If you aren't using something like Guardrails AI or Pydantic with Instructor, you’re asking for trouble. Just getting your data into a reliable format solves half your "safety" problems because it prevents the model from hallucinating weird stuff into your UI. It forces the model to stay within the lines.

Step two: Add an injection detector. Even a simple one.

Yeah, look at the open-source "Prompt Injection" datasets on Hugging Face. Run a small classifier. It’s cheap insurance. And step three: If you’re in a regulated field, start looking at the "Eval" tools like Patronus or Arthur Bench. You can’t fix what you can’t measure. You need to know your "failure rate" before a customer finds it for you. If you don't know that your bot fails 2% of the time when asked about medical advice, you're flying blind.

"You can't fix what you can't measure." You’ve been reading those management books again, haven't you?

Guilty as charged. But in this case, the "measurement" is the difference between a successful product launch and a front-page headline about an AI gone rogue. Think about the Air Canada chatbot case from a few years back where the bot promised a refund that didn't exist. A simple output guardrail checking against the official refund policy database would have caught that before the customer ever saw it.

Right, the bot was "hallucinating" a generous policy because it wanted to be helpful. The guardrail would have been the "skeptical supervisor" saying, "Wait, check the handbook first." I think the "Dual-Rail" architecture is the big "aha moment" for me here. The idea that you don't have to choose between "safe and slow" or "fast and risky." You can have both if you’re clever about the routing.

It’s all about the routing. And we’re seeing that in the "AI Control Plane" too. Tools that manage these guardrails across multiple models. Because most companies aren't just using one LLM anymore. They’ve got Claude for writing, Gemini for long-context analysis, and Llama for internal tasks. You need a guardrail that works across all of them—a unified safety policy.

One "Ring" to rule them all, but instead of a ring, it’s a middleware proxy with a very strict set of rules.

Spoken like a true sloth.

Hey, I’m just here to make sure we don't move too fast and break the world. Someone has to be the voice of "Let's check the JSON formatting one more time."

And that’s why we make a good team. This is a deep topic, and we’ve barely scratched the surface of things like "Adaptive Guardrails" that change based on user reputation—where a trusted power user gets a lighter touch than a brand-new anonymous account—but for a "State of the Union" in twenty-twenty-six, I think we’ve covered the heavy hitters.

Agreed. If you’re building in this space, the "Wild West" days are over. The bouncers have arrived, they’ve got clipboards, and they’re surprisingly good at their jobs. It's no longer just about the prompt; it's about the entire pipeline.

They are. And they’re getting faster every day. The goal is "Invisible Safety"—where the user never feels the guardrail, but the developer never feels the anxiety.

Well, that’s our deep dive into the world of AI guardrails. It’s a lot of plumbing, but it’s the plumbing that’s going to allow AI to actually do "real work" in the world. Big thanks to Daniel for the prompt—this was a fun one to research.

Definitely. It’s one of those topics that sounds "dry" until you realize it’s the only thing standing between us and a very weird, very buggy future.

"A very weird, very buggy future." That should be the title of your autobiography, Herman.

I’ll put it on the list right under "How to talk to Sloths."

Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thanks to Modal for providing the GPU credits that power the generation of this show. Without those H-one-hundreds, we’d just be two animals shouting into the void.

And nobody wants that. If you found this useful, do us a favor—leave a review on Apple Podcasts or Spotify. It actually makes a huge difference in helping other "AI-curious" folks find the show.

This has been My Weird Prompts. You can find all our past episodes and the RSS feed at myweirdprompts dot com.

Stay safe out there.

And keep your JSON valid. See ya.

Bye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2009: The Plumbing of AI Safety: Guardrails, Not Vibes

Downloads

You Might Also Like

#2009: The Plumbing of AI Safety: Guardrails, Not Vibes