Guardrails & Alignment

Safety measures, content filtering, red-teaming

26 episodes

#3724: How the Pope's Letter on AI Actually Works

Unpacking the Pope’s new encyclical on AI: what it is, how Catholics interpret it, and why it matters beyond the Church.

ai-ethicsgenerative-aiinternational-relations

Jun 17

#3658: How Reddit Built Guardrails for Anonymity

Reddit didn't solve harassment by killing anonymity. It built friction, reputation systems, and distributed governance.

social-engineeringcontent-provenanceonline-privacy

Jun 10

#3422: How Rival Labs Reverse-Engineer a New AI Model in Hours

Inside the organized frenzy when a closed-source model drops — and how competitors map its every weakness.

ai-agentsai-securityprompt-injection

Jun 2

#3209: When Algorithms Become Censors

How SLAPP suits, libel tourism, and Google's algorithm chill journalism more effectively than any law.

free-speechmisinformationsocial-engineering

May 18

#2909: The Reassurance Mirage: When Moderation Fails

How the EU Digital Services Act exposes a 30-to-1 gap in appeal success rates between platforms.

ai-ethicscontent-provenancemisinformation

May 13

#2808: Falling for Your Chatbot: Love, Loss, and Language Models

Real cases of people falling in love with AI companions, why memory makes it feel real, and what happens when the illusion breaks.

ai-ethicsconversational-aiai-memory

May 1

#2558: Should You Say Please to AI?

The surprising cost, technical tradeoffs, and ethical dilemmas of saying "please" to chatbots.

ai-ethicsprompt-engineeringhuman-computer-interaction

Apr 29

#2526: How Peer Review Actually Works (and Fails)

The history of peer review, the Lancet's biggest scandals, and why arXiv is changing everything.

misinformationopen-sourcemedical-history

Apr 29

#2518: How Jailbreaking Reveals AI's Hidden Tension

What the DAN prompt and grandma exploits reveal about the structural conflict inside every LLM.

prompt-engineeringai-safetyai-alignment

Apr 27

#2472: When Guardrails Break: The Hidden Costs of AI Gateway Filtering

PII detection at the gateway layer can block legitimate invoices. Here's how guardrails actually work and where they fail.

ai-securitylatencyprompt-injection

Apr 25

#2413: When Your AI Says No to Everything

Why LLMs refuse 73% of harmless prompts — and the trade-off between safety and usefulness.

ai-safetyai-alignmentprompt-engineering

Apr 25

#2412: When AI Caves: Progressive vs. Regressive Sycophancy

Why do LLMs agree with you even when you're wrong? We break down the SycEval benchmark and the 78% persistence problem.

ai-safetyai-alignmenthallucinations

Apr 25

#2410: How Researchers Actually Measure Censorship in Chinese LLMs

Beyond headlines: the actual benchmarks, methodologies, and pitfalls in detecting political refusal in Chinese language models.

large-language-modelsai-safetycultural-bias

Apr 25

#2407: Three Landings in 90 Days: Pilot Automation Dependency

Why pilots aren't hand-flying enough, the regulatory floor that lets it happen, and what airlines are doing about it.

aviation-technologyhuman-factorssituational-awareness

Apr 16

#2250: How Incentives Shape AI Safety Research

Vendor labs, independent research orgs, government agencies—the AI safety field is messier and more diverse than most people realize. A map of wher...

ai-safetyai-alignmentanthropic

Apr 16

#2246: Constitutional AI: Anthropic's Theory of Safe Scaling

How Anthropic's Constitutional AI replaces human raters with AI self-critique guided by explicit principles—and what it assumes about the future of...

anthropicai-safetyai-alignment

Apr 12

#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

large-language-modelsai-safetyhallucinations

Apr 12

#2186: The AI Persona Fidelity Challenge

Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...

ai-safetyai-alignmenthallucinations