#fault-tolerance
46 episodes
#3284: Agent Infrastructure Engineer: The New DevOps
Agentic AI is splintering into real engineering disciplines. Here's what the "DevOps of AI" actually does.
#2989: Why Trains Crash When They Can't Steer
Stopping a train takes miles. Seeing an obstacle takes seconds. That gap explains everything.
#2938: How to Prevent Linux Desktop Crashes Under Heavy Load
Stop losing work to memory exhaustion, CPU lockups, and GPU hangs on Linux workstations.
#2924: When Adding One Agent Breaks Everything
The math behind why your 100-agent pipeline fails 40% of the time — and what to do about it.
#2780: Building Self-Healing Agent Pipelines
How to build an agent that monitors and fixes other agents in production — without the hype.
#2773: Beyond Static Fallbacks: Agentic Error Handling in AI Pipelines
From try-except blocks to planning agents that route around failures intelligently.
#2556: The Weird Myths of Solid-State Storage
No moving parts, no sound waves — just electrons trapped in silicon. How solid-state drives actually work.
#2550: Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs
How to design scripts and pipelines so re-running them is safe, even after a crash mid-execution.
#2179: Building Cost-Resilient AI Agents
Failed API calls in agent loops aren't just technical problems—they're direct budget drains. Here's how checkpointing, retry strategies, and cachin...
#2002: Brainstorming a Stable-by-Design Smart Home
We explore why Home Assistant is so fragile and brainstorm a stable-by-design future for the platform.
#1921: The Three-Second Heartbeat That Keeps Israel Safe
Why a civilian website sends an empty JSON payload every three seconds, even during peacetime, and what it reveals about mission-critical architect...
#1067: The 3,000-Person Army: How Major AI Models Actually Ship
Think AI is built by a few geniuses? Discover the army of 3,000 specialists required to ship a single major model update.
#1048: The Keepers: How the Samaritans Outlasted Empires
Discover how a community of 950 people used ancient scripts and "survival engineering" to outlast empires for over two millennia.
#1041: Before the Hum: Life in the Pre-Refrigeration Era
Explore the high-stakes world of food preservation, from 19th-century ice trades to the biological secrets of 50-year-old perpetual stews.
#1036: Is Kubernetes Too Big for Your Startup?
Is Kubernetes too complex for most teams? Explore the evolution of infrastructure from Google’s Borg to the new era of AI-driven scaling.
#1032: Ancient Backups: How History Survived the Delete Command
Discover how ancient civilizations used monks, clay jars, and geographic diversity to create the world's first distributed data networks.
#1012: When a Missile Test Is a Diplomatic Message
Explore the strategic signaling behind the GT-255 launch and why the U.S. relies on 50-year-old technology to maintain global security.
#989: From Shackleton to Supply Chains: The Industrialization of Polar Science
Beyond the ice: Explore the massive industrial operations and high-stakes geopolitics required to sustain human life at the Earth's poles.
#894: Iran After Khamenei: The IRGC’s Fight for Survival
Following the death of the Supreme Leader, we examine the IRGC’s grip on Iran’s economy, military, and its future as a "state within a state."
#893: The Art of Red Teaming: Why You Must Break Your Own Plans
Learn why the most resilient organizations pay people to prove them wrong and how red teaming techniques can prevent catastrophic failures.
#889: Why the Oldest Tech Wins in a Crisis
When 5G fails in a concrete bunker, why is a $30 plastic radio your best hope? Discover the physics of why old tech beats the new.
#880: The UX of Survival: Engineering Modular Prep Kits
Discover the PMPU strategy: a modular approach to emergency gear that prioritizes tech, connectivity, and organization when every second counts.
#873: How Parenthood Reveals the Hidden Tech of Emergency Dispatch
Discover how dispatchers bridge 1950s radio tech with modern satellites to save lives during critical "warm transfers" in real time.
#872: The Universal Lifeline: How Emergency Calls Really Work
Discover the invisible global protocols that allow your phone to call for help anywhere in the world—even without a SIM card or a plan.
#841: AI Gateways: Building Robust Infrastructure with LiteLLM
Discover how AI gateways like LiteLLM provide redundancy, caching, and unified tool access for scalable application development.
#777: The Multi-Monitor Edge: Why the Pros Shun Ultrawides
Explore why high-stakes professionals choose multi-screen arrays over trendy ultrawides for better focus, ergonomics, and reliability.
#771: Beyond Backups: The High Stakes of Critical Redundancy
How do hospitals and data centers stay online during a disaster? Explore the engineering of "five nines" and the limits of redundancy.
#764: The Bureaucracy of the Apocalypse
Explore the high-stakes engineering of military-grade shielding and how the state protects its "nervous system" from an electromagnetic pulse.
#762: The Decoupled Smart Home Trade-Off
Tired of your smart home crashing? Discover why moving your home's "brain" to the cloud might be the ultimate reliability hack for your setup.
#740: The War Against Entropy at 30,000 Feet
How long can a plane truly stay airborne? Explore the mechanical, human, and logistical limits of modern aerial power projection.
#728: The Invisible Infrastructure of Data
Ever wonder how your data actually sits on a disk? Explore the evolution of file systems from the limits of FAT32 to the magic of ZFS.
#654: The Anatomy of Failure: Turning Blips into Breakthroughs
Stop burying your mistakes. Learn how to perform a "failure autopsy" using industrial frameworks to turn setbacks into a strategic advantage.
#642: Why Your Car Is a Hostile Computer
Think building a PC is hard? Try wiring a car. Herman and Corn explain how to upgrade your ride’s tech without frying the CAN bus.
#621: From a Dead Motherboard to Five Nines
Discover how the world’s biggest platforms stay online when hardware fails. Herman and Corn break down the invisible systems of high availability.
#620: When ZFS Pools Survive Hardware Death
Your motherboard fried, but is your data safe? Discover the secrets of ZFS portability, forced imports, and professional recovery workflows.
#527: Who’s Really Flying? The Evolution of Aircraft Controls
From steel cables to digital signals: Herman and Corn explore how flight controls evolved and why some modern jets still use 1960s technology.
#502: Bile Acid Survival: Eating Without a Gallbladder
How do you stay healthy when life is a pressure cooker? Discover low-friction nutrition strategies for post-surgery recovery and high-stress life.
#493: Beyond the Magic Smoke: Predicting Hardware Failure
Learn how to spot motherboard degradation, track NVMe wear, and use hidden NVIDIA telemetry to save your data before the "magic smoke" escapes.
#458: Why Your City Won't Freeze When a Server Dies
Ever wonder how the power grid stays balanced? Herman and Corn dive into SCADA, PLCs, and the tech keeping our modern world running.
#457: Why a 90s Pager Beats Your Smartphone in an Emergency
Can your smartphone be trusted in a crisis? Explore why pagers and LoRa might be the ultimate "baby emergency" solution for parents.
#456: How a Stone Building Grounds Itself
Why does your wall outlet have three prongs? Discover the hidden physics of electrical grounding and how buildings stay safe from power surges.
#454: When 1950s Wiring Meets 2026 Life
Tired of the power tripping when you make toast? Herman and Corn explain the "16-amp ceiling" and how to modernize Israeli apartment wiring.
#438: The Hidden Engineering of Airport Approach Lighting
Discover the hidden engineering behind airport approach lights, from the "rabbit" flashers to the towers standing in suburban backyards.
#418: RAID is Not a Backup: Mastering Home Server Resilience
Why RAID isn’t enough and how snapshots act as a digital time machine for your home server’s survival.
#409: When RAID Fails: The Rebuild Time Nightmare
Learn the math behind RAID levels, the risks of drive rebuilds, and why ZFS is the modern gold standard for data integrity.
#385: The Unkillable Workstation: Building for Total Redundancy
Can you build a PC that never dies? Herman and Corn explore redundant power, memory mirroring, and high-availability clusters for home servers.