#2780: Building Self-Healing Agent Pipelines

How to build an agent that monitors and fixes other agents in production — without the hype.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2943
Published: May 12
Duration: 29:12
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: ai-agents ai-reasoning fault-tolerance

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A developer building a self-healing pipeline for his podcast production workflow sparked this episode's central question: has anyone built an agent that monitors and fixes other agents? The answer is yes — but the commercial landscape is still catching up to what people are building themselves. The core idea is simple: instead of a human staring at dashboards at 2 AM, a specialized janitor agent scans logs, detects failures, and applies fixes autonomously or escalates.

The industry hasn't settled on a term yet — "self-healing agent workflows," "agentic QA loops," "autonomous remediation," and "agent ops" all compete — but the underlying pattern is consistent. The best practice emerging across engineering blogs from Anthropic, Modal, and others is a three-tier approach. Tier one handles fully autonomous fixes for low-risk, well-understood failures (like mismatched speaker labels or silent gaps). Tier two involves automated diagnosis with human approval for changes that could alter pipeline behavior. Tier three is pure triage: the agent identifies the problem, gathers evidence, and opens an incident report for human review.

The key insight is that the triage taxonomy — the logic that routes failures to the appropriate tier — is more valuable than the fix instructions themselves. And because failure modes for agentic pipelines are surprisingly consistent (looping, hallucination, timeout, output drift, cost overrun), this taxonomy is highly templatizable across different projects. The real shift here is continuous validation: making inspection cheap enough to run every time, not once a month. For small teams and solo developers, the current best stack is likely Modal for pipeline execution, Claude for the agent skill, and webhooks plus cron jobs for the glue.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2780: Building Self-Healing Agent Pipelines

Daniel sent us this one — he's been building what's effectively a self-healing pipeline for the show's production, where an agent scans through Modal logs, spots things that are broken, and either fixes the small stuff silently or flags the complicated failures for human review. The question is, has anyone built a system specifically for this, what would you even call it, and what's the best way to actually deploy it?

I love how he frames this. "Using agents to fix other agents." It's the meta layer that nobody was talking about two years ago, and now it's basically the obvious next step once you've got agentic pipelines running in production.

It's the agent equivalent of hiring a janitor.

A very specialized janitor who also reads six hundred lines of JSON logs and knows which stack traces matter. And here's the thing — the nomenclature question he's stuck on is genuinely the right question to be stuck on, because the industry hasn't settled on a term yet. You'll see "self-healing agent workflows," which I think is actually the most descriptive. You'll see "agentic QA loops.

I believe I said that one.

You did, and it's a good term. You'll also see "autonomous remediation," "agent ops," "AI ops for AI" — which is a mouthful — and then the vendors have their own branding on top of that. But the core idea is the same across all of them. You have a production pipeline running agentic workflows, something breaks or degrades, and instead of a human staring at a dashboard at two in the morning, another agent diagnoses and fixes it.

The short answer to "has anybody built this" is yes.

Oh, absolutely yes. And not just one anybody. There's a whole emerging category. But here's where it gets interesting, and I think this connects to what he's actually asking about deployment. Most of what's out there falls into two buckets. Bucket one is the observability platforms that added agentic remediation as a feature. Think Datadog, New Relic, Grafana — they've all bolted on AI-driven incident response in the last eighteen months. Datadog's Bits AI can now auto-remediate certain alert conditions. Grafana has something called Sift, which is their AI assistant for incident investigation. But these are really designed for traditional infrastructure and application monitoring, not for agentic pipeline debugging.

They're watching whether your Kubernetes pod is alive, not whether your transcription agent hallucinated a speaker name.

They're great at "CPU spiked, roll back the deployment," terrible at "the agent keeps using the word 'delve' and we need to adjust the system prompt.

Which is a problem we have actually dealt with.

That's exactly the kind of thing Daniel's describing. It's not infrastructure failure, it's behavioral drift. The pipeline is running fine technically, but the output quality is degrading in subtle ways that only someone — or something — reading the actual logs would catch. That's bucket two, which is much newer and much more directly relevant. These are platforms built from the ground up for agent observability and remediation. LangSmith from LangChain, for example. They added something they call "automated feedback loops" where you can define rules that trigger corrective actions. If a trace shows the agent entering a loop, it can automatically inject a correction.

LangSmith is more of a development and testing platform, isn't it? Not a production autonomous fix-it thing.

That's the common perception, but they've been pushing hard into production. They have a feature now called "Guardrails" that runs in production and can actually intercept and modify agent behavior in flight. It's not quite "here are my Modal logs, go find what's broken," but it's adjacent. The closer fit is probably something like Braintrust or Arize, which are both doing eval-driven observability for AI systems. Braintrust in particular has this concept of "experiments" where you can run automated evaluations against production traces and trigger workflows based on the results.

None of these sound like they do what Daniel's skill does, which is basically: here's a raw log, here's a playbook for what to look for, go be a junior dev on call.

No, and that's the thing. What he built — a Claude skill with a specific playbook that knows his pipeline, knows what "broken" looks like for his particular use case, and can either fix or escalate — that's not really what the platforms do yet. The platforms give you frameworks and dashboards. What he wants is a janitor agent that sweeps the floor every two weeks. And the closest analogue in the current market isn't an observability platform at all. It's the emerging category of "agentic operations" or "agent ops" tools. There was a piece in The Information a few months ago about how companies like Cognition and Factory are building internal tools where one agent monitors another agent's outputs and flags anomalies. But these are mostly in-house, not commercial products.

The commercial landscape is still catching up to what people are actually building themselves.

And I think that's actually the most useful answer to the prompt. Yes, people have built this. No, there isn't a polished SaaS product you can just point at your Modal logs and say "handle it." The state of the art in mid twenty-twenty-six is: you build the skill yourself, you wire it to a cron job or a webhook, and you decide how much autonomy to grant it. The platforms give you building blocks, not turnkey solutions.

Which brings us to the deployment question. Cron job or human-in-the-loop?

Let's talk about the tiered approach, because I think that's where the best practice is emerging. And I say this as someone who's been reading every paper and postmortem I can find on agentic pipeline failures. The model that seems to work best — and this is showing up across multiple companies' engineering blogs, from Anthropic to Modal themselves to a bunch of smaller AI startups — is three tiers.

Walk me through them.

Tier one: fully autonomous fixes for things that are both low-risk and well-understood. A transcription segment has a mismatched speaker label? The concatenation step produced a file with a silent gap longer than five seconds? These are things where the failure mode is known, the fix is deterministic, and the blast radius of a bad fix is basically zero.

The "don't ping me about this" tier.

Tier two: automated diagnosis with human approval. This is where the agent says, "I've noticed that over the last three runs, the episode word count has been trending about twenty percent above target. Here's the relevant section of the system prompt that's probably causing it. I've drafted a revised version. Want me to apply it?" Human clicks yes or no. Very low cognitive load for the human, but the human still has the final say on anything that changes the pipeline's behavior.

Tier three is "something's on fire, I have no idea what to do, please look at this.

That's the escalation tier. The agent's job there isn't to fix anything, it's to triage. "Episode two six nine nine failed at the speech-to-text stage. The error is an API timeout from the transcription service. Here's the exact timestamp, the retry count, and the relevant log lines. I've opened a draft incident report." That alone saves a human twenty minutes of grep-ing through logs.

The deployment answer is: don't pick one model. Pick all three, and route based on the type of failure.

And the routing logic itself is part of the skill definition. When Daniel writes that playbook, the most important thing in it isn't the fix instructions. It's the triage logic. "If you see X, do Y autonomously. If you see A, draft a recommendation. If you see Z, just tell me." That triage taxonomy is the intellectual property of the pipeline. It's what turns a generic agent into a specialized operations engineer.

It's also what makes the skill templatizable, which was his other point. If you've figured out the triage taxonomy for one Modal pipeline, you can probably reuse eighty percent of it for the next one.

The failure modes for agentic pipelines are surprisingly consistent across domains. Looping, hallucination, timeout, output format drift, cost overrun. Those five categories cover probably ninety percent of what goes wrong in production. The specifics differ — what "output format drift" means for a podcast pipeline versus a document processing pipeline — but the detection patterns are the same. So you can absolutely template this.

The "self-healing agent workflow" template. You drop it into a new project, customize the domain-specific failure signatures, and you've got a janitor.

Here's what I find exciting about this. We're not just talking about reducing toil, though that's real. We're talking about something that changes the economics of running agentic pipelines at all. Right now, if you're a small team or a solo developer running agentic workflows in production, the biggest risk isn't that something will break. It's that something will break and you won't notice for three weeks, and by then you've produced sixty episodes with a subtle quality degradation that you now have to either live with or redo.

The silent failure problem.

Which is way worse than the loud failure. A crash you notice immediately. A system prompt that's slowly drifting toward using the word "moreover" in every other paragraph — that's a death by a thousand paper cuts.

Like adopting a feral cat. You don't realize how much damage it's doing until the furniture is shredded.

That's a vivid image. And self-healing workflows solve the silent failure problem by making inspection cheap enough to do regularly. If it takes a human an hour to review six hundred lines of logs, they'll do it once a month if you're lucky. If an agent does it in thirty seconds and costs a fraction of a cent, you run it every time.

Which is effectively continuous validation. And that's the real shift here. Not just easier setup, but ongoing verification that what you built is still doing what you think it's doing.

There's a term for this that's been floating around in the reliability engineering world: "continuous verification." It's been a thing in infrastructure for a while — chaos engineering, continuous testing in production. But applying it to agentic pipelines is new, because until recently you couldn't automate the judgment part. You could check "did the pipeline run," but you couldn't check "did the pipeline produce good output." The agent is the first thing that can make that qualitative judgment at scale.

Let's talk about the specific platforms, because Daniel asked about nomenclature and what's out there. You mentioned LangSmith, Braintrust, Arize. What about the Modal-native approach?

Modal's actually an interesting case because they've been building out their own observability features. They have a concept called "Modal Functions" that can be chained, and they expose pretty rich logs and metrics. But they don't have a native self-healing layer. What people are doing — and I've seen this pattern in a few engineering blogs, including some from Modal users — is wiring Modal's webhook notifications into a separate agent that does exactly what Daniel's describing. The pipeline finishes, Modal fires a webhook with the run metadata, the agent pulls the full logs, runs the playbook, and either applies fixes or sends a summary.

The platform is Modal, the agent is Claude, and the glue is a webhook and a cron job. That's the stack.

That's the stack for the DIY approach, and honestly, for a solo developer or small team, I think it's still the best approach. You get full control over the triage taxonomy, you're not paying for a platform you're only using ten percent of, and you can iterate on the playbook as fast as you can edit a system prompt.

What about the people who don't want to DIY? The turnkey options?

The closest turnkey thing I've seen is actually from a company called Fixie, which was acquired by a larger player last year. They built something they called "agentic guardrails as a service," where you could define policies and the platform would monitor agent outputs and intervene. But it was more focused on safety and compliance than on operational reliability. For the operational side, PagerDuty has been making noise about "AIOps for agentic workflows," but I've looked at their documentation and it's still very much in the "contact sales" phase of product maturity.

Which is code for "it doesn't really exist yet.

It exists for a specific set of enterprise customers with six-figure contracts. It does not exist for the person running a podcast pipeline on Modal.

The honest answer to "has anybody built this" is: yes, lots of people have built it for themselves, the platforms are racing to productize it, but if you want it today you're still largely building it yourself. The good news is it's not hard to build, because the agent does the hard part.

That's the point that I think is worth sitting with for a second. The reason this is easy to build is the same reason it's necessary. The agent is powerful enough to diagnose and fix agentic pipelines, which means the agentic pipelines are complex enough to need diagnosis and fixing. We've created a tool that's so capable it can maintain itself, and we've created systems that are so complex they need that self-maintenance. It's a perfect closed loop.

The snake that eats its own tail, but productively.

The ouroboros of operational excellence.

Don't put that on a T-shirt.

Too late, I'm already picturing it.

Let's talk about the trust calibration problem. Daniel mentioned two modes — fully autonomous for small things, human-in-the-loop for bigger things. But "small" and "big" are doing a lot of work there. How do you actually define the boundary?

This is where I see the most disagreement in the engineering blogs and postmortems I've read. Some teams draw the line at "does this change affect the output that end users see?" If yes, human approval required. If no — it's an internal optimization, a logging change, a retry on a failed API call — go ahead and fix it.

That's a clean heuristic. But output-visible changes can still be trivially safe. Changing a system prompt to stop overusing a word is output-visible, but the risk is basically zero.

The counterpoint is that even internal changes can be dangerous. A retry loop that wasn't bounded properly can burn through API credits. A logging change can accidentally expose sensitive data. So the "output-visible" heuristic isn't perfect.

Which is why I think the better boundary is reversibility. If the agent's fix is trivially reversible — you can roll back a system prompt change with a single commit revert — then the risk of autonomy is low even if the change is user-visible. If the fix is hard to reverse — it modified a database, it sent emails to users, it changed a pricing configuration — then you want a human in the loop.

Reversibility as the trust boundary. I like that. It maps cleanly to how we think about infrastructure changes. You don't manually approve every auto-scaling event, because scaling down is trivially reversible. You do manually approve database schema migrations, because rolling those back is painful.

The playbook should include a reversibility assessment. "Here's what I'm about to change, here's how you'd undo it, and here's my confidence that the undo will work." If the undo is "run git revert on commit X," green light. If the undo is "manually reconstruct the state from backups," red light.

This is where the skill template gets really valuable. You bake that reversibility assessment into the template, and now every agent that uses it has a consistent safety framework. You're not reinventing the trust boundary for each pipeline.

The other thing I'd add to the template is a cost cap. If the agent is running autonomously on a cron job, you want it to have a budget. "You can spend up to five dollars in API credits on fixes per run. If you think a fix will cost more than that, escalate." Otherwise you've created a very helpful agent that can accidentally run up a four-figure bill debugging a non-problem.

That's a failure mode I've actually seen documented. Someone at a startup built an autonomous debugging agent that, when it couldn't solve a problem, just kept trying increasingly expensive approaches. Longer context windows, more tool calls, more retries. By the time the human noticed, the agent had spent something like three hundred dollars trying to fix a bug that would have taken a human five minutes to solve.

The agent equivalent of a Roomba that keeps ramming into the same chair leg for an hour.

And the fix was trivial — add a cost cap and a "give up and escalate" threshold. But it's the kind of thing you don't think about until it bites you.

The template has three components now. Triage taxonomy, reversibility assessment, cost cap. Plus the actual fix playbook.

That's a solid foundation. And I think if you're building this — whether as a Modal cron job, a GitHub Action, or just a skill you invoke manually every couple of weeks — those four components are what separate a useful janitor from a loose cannon.

Let's zoom out for a second. The prompt asks about "using agents to fix other agents," and we've been talking about it in the context of a single pipeline. But the template idea suggests something bigger. If this pattern becomes standard — and I think it will — every agentic pipeline ships with a little janitor agent as a sidecar. It's just part of the deployment manifest.

I think that's exactly where we're heading. The sidecar pattern is well-established in infrastructure — Envoy for service meshes, Fluentd for logging. The agentic sidecar for self-healing is the natural extension. And the nomenclature will probably settle around "agent sidecar" or "ops agent" or something similarly unexciting.

"Self-healing agent workflows" is more descriptive than most industry jargon. I'm sticking with it.

It's a good term. And I think the platforms will eventually converge on it, or something close. The interesting question is whether the sidecar agent becomes a feature of the deployment platform — Modal builds it in, you just configure the playbook — or whether it remains a separate thing you wire up yourself.

Given how fast this space moves, I'd bet on the platforms absorbing it. Modal adds a "health check agent" configuration block, you point it at a playbook file, and it runs automatically after every pipeline execution. That feels like a twenty-twenty-seven feature.

I think that timeline is about right. And in the meantime, the people who build it themselves get a head start on the operational maturity curve. When the platforms do ship it, they'll have opinions about what the playbook format should look like, what the trust boundaries should be, what the cost caps should default to. Those opinions will be shaped by what the early builders learned.

Which is a good reason to be an early builder.

It's also a good reason to share what you learn. Daniel's skill template — if he open-sourced just the skeleton, the triage taxonomy and the fix patterns, without the show-specific stuff — that would move the community forward. A lot of people are trying to figure this out right now, and most of them are doing it in isolation.

The "here's what worked for me" blog post is the atomic unit of software progress.

We need more of them in the agentic operations space specifically. There's plenty of content about building agents. There's almost nothing about keeping them running in production for six months without them quietly degrading.

That's the unglamorous part. Nobody gives a keynote about "we ran our pipeline for a year and nothing broke.

The boring reliability story is the hard one to achieve. Anyone can build a demo that works once. Keeping it working, keeping the output quality consistent, catching the drift before it becomes a problem — that's the actual engineering.

That's what the self-healing workflow is really about. Not the heroics of fixing a dramatic failure, but the quiet maintenance that prevents the dramatic failure from ever happening.

The janitor, not the firefighter.

So to pull this together for the prompt: yes, people have built this. The platforms are emerging but not yet turnkey for the specific use case of "scan my Modal logs and fix what's broken." The best deployment model is tiered — autonomous for low-risk reversible fixes with a cost cap, human-in-the-loop for everything else. The template is absolutely worth building because the failure modes generalize across pipelines. And the term "self-healing agent workflows" is as good as anything the industry has come up with.

I'd add one more thing on the deployment front, which is about the cron schedule. Daniel mentioned every two weeks, and I think that's actually an interesting choice point. Two weeks is long enough that a lot of drift can accumulate. If the pipeline runs daily, a biweekly check means you could have fourteen episodes with a subtle quality issue before the janitor catches it.

The ideal frequency depends on the pipeline's velocity and the cost of a bad output.

For a daily pipeline, I'd run the janitor daily. The cost is negligible — we're talking about an agent reading logs, not retraining a model. For a weekly pipeline, weekly. The principle is: run the check at the same cadence as the pipeline itself. The janitor should be as regular as the thing it's maintaining.

Which goes back to the continuous validation point. If inspection is cheap enough to run every time, you run it every time.

That's the real shift in mindset. We're used to thinking of QA as a phase — you test before you ship, and then you're done. With agentic pipelines, QA is continuous. The pipeline is always shipping, and the janitor is always watching.

The factory that inspects itself while the conveyor belt is running.

Occasionally tightens its own bolts.

I'd trust a factory that does that.

More than I'd trust one that waits for the quarterly maintenance shutdown.

And now: Hilbert's daily fun fact.

Hilbert: The Kebra Nagast, a fourteenth-century Ethiopian manuscript chronicling the Solomonic dynasty, describes a traditional form of field hockey called genna played at Christmas — and specifies that the game must be played with a curved stick made from a tree that has never borne fruit, because a fertile tree's wood was believed to bring bad luck to the player who wielded it.

A tree that's never borne fruit. That's a very specific disqualification.

I appreciate the theological rigor applied to sports equipment.

We'd like to thank our producer Hilbert Flumingtop. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com. If you enjoyed this one, leave us a review — it helps other people discover the show. We'll be back with a new prompt soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2780: Building Self-Healing Agent Pipelines

Downloads

You Might Also Like

#2780: Building Self-Healing Agent Pipelines