#2141: Durable Agents: The Backend Tax

Why building AI agents means managing infrastructure. We explore durable execution backends like Temporal and AWS Step Functions.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2299
Published: Apr 9
Duration: 18:15
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents distributed-systems cloud-computing

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Infrastructure Tax of the Agentic Era

Building an AI agent is exciting until you hit the backend reality. A sophisticated script that browses the web and calls APIs might take ten minutes to run, but standard serverless functions time out after fifteen. If a server restarts, the agent loses its entire train of thought. Suddenly, you’re not building cool AI logic—you’re managing Kubernetes clusters, setting up Redis for state, and worrying about webhook reliability. This is the "infrastructure tax" of the agentic era.

The solution is a class of platforms focused on "durable execution." These systems treat workflows as code, allowing long-running processes to survive server restarts and network failures. They checkpoint your code’s state, resuming exactly where it left off with all local variables intact. Think of it as a video game save point for your source code.

Defining the Requirements

When choosing a platform for durable agentic backends, several key requirements emerge:

Code-First Definition: You want to write Python or TypeScript functions, not drag-and-drop blocks.
Persistence: The platform must save state across interruptions, whether from crashes, scaling events, or deliberate pauses (like waiting for human approval).
Webhook Integration: Authenticated webhooks should trigger jobs without complex middleware setup.
Observability: Deep logging of LLM calls, tool executions, and decision paths is essential for debugging hallucinations or costly loops.
LLM Routing: The ability to branch workflows based on prompt complexity or route to different model providers.

Platform Comparison

Temporal: The Gold Standard

Temporal has long been the heavyweight in durable execution, originally built for high-frequency trading and payment processing. The March 2026 release of Temporal 1.25 added native large language model task queues, making it ideal for agents. Its "event sourcing" magic records every side effect, allowing workflows to replay without re-running expensive LLM calls. For a routing agent that categorizes intents and routes to human-in-the-loop signals, Temporal handles the entire decision tree with a simple "while" loop.

AWS Step Functions: The Lego Bricks

For those locked into Amazon’s ecosystem, AWS Step Functions offers a serverless, state-machine approach. The January 2026 update added AI orchestration patterns, providing authenticated API Gateway triggers and CloudWatch logs out of the box. However, it feels more like assembling Lego bricks with Amazon States Language (a JSON file) rather than writing freeform code. The trade-off is less granular flexibility but easier infrastructure scaling.

Google Cloud Workflows: The Multi-Cloud Sleeper

Google Cloud Workflows is an HTTP-based orchestrator that’s serverless and surprisingly good at LLM routing. The February 2026 enhancements improved observability for long-running AI traces, letting you see latency per "thought" in the console. While it still relies on YAML or JSON definitions—which can be friction for code-centric developers—it excels at multi-cloud flexibility, branching workflows to different model providers without managing databases.

Azure Durable Functions: The Enterprise Choice

Azure Durable Functions, updated in December 2025 with AI extensions, is a mature implementation for .NET and JavaScript developers. It uses Azure Key Vault for environment management and Application Insights for logging. The "Virtual Entity" pattern maintains conversation state across agentic loops without manual JSON passing, automatically saving local variables to table storage. It’s a safe, corporate-friendly option with deep integration into the Azure ecosystem.

Fly.io: The Developer-Favorite Contender

Fly.io is emerging as a lightweight alternative for developers who want Temporal-like power without enterprise overhead. It focuses on running containers close to users, offering persistent volumes and simple scaling. While not a dedicated durable execution platform, it provides the infrastructure primitives needed to build custom agentic backends with minimal DevOps hassle.

Key Takeaways

Durability is Non-Negotiable: For agents that run longer than a few minutes, platforms must checkpoint state to avoid losing progress.
Trade-Offs Abound: Choose between code flexibility (Temporal) and managed ease (AWS Step Functions), or multi-cloud support (Google) versus enterprise integration (Azure).
Observability is Critical: Deep traces of LLM calls are essential for debugging and cost control.
The Future is Workflow-as-a-Service: The industry is shifting from stateless requests to stateful, long-lived agents that can wait for external events.

Ultimately, the best platform depends on your stack, team expertise, and whether you prioritize control or convenience. As agents become more complex, durable execution backends will be the essential infrastructure that lets developers stay in the flow of code.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2141: Durable Agents: The Backend Tax

Imagine you’ve just written this incredibly sophisticated AI agent. It can browse the web, reason through complex data, and even call external APIs to get things done. You’re ready to ship it, but then the reality of the "backend" hits you. You realize that if your script takes ten minutes to run, a standard serverless function is going to time out and die. If your server restarts mid-loop, your agent loses its entire train of thought. Suddenly, instead of building cool AI logic, you’re stuck managing Kubernetes clusters, setting up Redis for state management, and staying up at night wondering if your webhooks are actually hitting a live endpoint. It’s a massive buzzkill for any developer who just wants to stay in the flow of code.

It really is the "infrastructure tax" of the agentic era. You start with a Python script and end up as a part-time DevOps engineer just to keep a single conversation alive.

Exactly why we’re diving into this today. Daniel sent us a great prompt about this exact struggle. He wrote: "I’m looking for the top five platforms for running durable agentic backend workflows. The use case is simple: a developer wants to define an agent workflow in code—no drag-and-drop builders—but they don't want to manage the infrastructure to keep it persistently available. They need authenticated webhooks to trigger jobs, environment variable management, deep observability, and LLM routing. Basically, a code-defined agentic backend that just works. What are the best options right now and what are the honest tradeoffs?"

Herman Poppleberry here, and I have been waiting for this one. We are in April twenty twenty-six, and the landscape for what we call "durable execution" has absolutely exploded in the last few months. It’s no longer just about running a script; it’s about "Workflows as Code."

And just a quick heads-up for everyone listening—today’s episode is actually being powered by Google Gemini Three Flash, which is fitting since we’re talking about the high-tech backbone of the AI world. So, Herman, when Daniel says "durable," what are we actually talking about in a technical sense? Why can't I just wrap my agent in a FastAPI endpoint and call it a day?

Because the "A" in API usually stands for "Application," but in the agent world, it might as well stand for "Anxiety." If your agent is doing something complex, like researching a topic for twenty minutes, a standard HTTP connection is going to drop. Most serverless platforms like AWS Lambda have a hard timeout—usually around fifteen minutes. If your agent is halfway through a multi-step reasoning process and the environment kills the process, all that state—the "thought" the agent was having—is gone. Durability means the platform "checkpoints" your code. If the server crashes, the platform resumes the code on a new machine exactly where it left off, with all the local variables intact. It’s like a video game save point for your source code.

That sounds like magic, or at least like a lot of very clever engineering under the hood. And it feels like the timing is right because we’ve seen major moves lately. Temporal just dropped version one point twenty-five in March with native LLM task queues, and even the old guard like AWS Step Functions added specific AI orchestration patterns back in January.

The industry is moving away from "simple triggers" and toward "persistent state." We’re moving from a world of stateless requests to a world of stateful agents that might live for days or weeks while they wait for a human to click an "approve" button or for an API to return a result.

Well, I’m ready to see who’s winning this race. We’ve got five heavy hitters to get through, ranging from the developer-favorite startups to the massive cloud providers. Let's get into the meat of it. Where do we start?

We start by defining what a "durable agentic backend" actually looks like in 2026, because the requirements have shifted. When Daniel talks about a "code-defined" workload, he’s drawing a line in the sand. He doesn't want a "no-code" box-shoveling interface where you drag a "Send Email" block to an "LLM" block. He wants to write a Python function or a TypeScript class, hit deploy, and have that code become a living, breathing process that can survive a server restart.

Right, and it’s not just about surviving a restart. It’s about the "plumbing" around the code. If I’m building an agent that needs to, say, monitor a GitHub repo, summarize issues, and then wait for me to approve a draft response, that agent is essentially "sleeping" for hours at a time. In a traditional backend, I’d have to manage a database to save the state, a queue to handle the retries, and a cron job to wake it up. A durable backend platform says, "Don't worry about the database or the queue. Just write your code. When you call await, we’ll pause the execution, save the entire call stack, and wake it up whenever the data hits our webhook."

The "authenticated webhook" part is crucial too. You want your agent to be triggered by external events—a Stripe payment, a Slack message, a custom telemetry signal—but you don't want to spend three days setting up OAuth or middleware just to verify the payload. You want a platform where you can define an endpoint in your code and have the infrastructure handle the security and environment variables.

And don't forget the observability. If an agent "hallucinates" or gets stuck in a loop calling an expensive model, you need to see that in the logs immediately, not three days later when your credit card gets declined. You need a nested tree of every LLM call, every tool execution, and every decision the agent made.

It’s essentially "Infrastructure as a Service" but specifically for the non-deterministic, long-running nature of AI. We're moving from "Request-Response" to "Workflow-as-a-Service." That’s the target Daniel is aiming for.

So we’re looking for the sweet spot: the power of writing raw code with the convenience of a managed cloud. Let's look at who's actually delivering that, starting with the platform that’s been making a lot of noise lately by pivoting hard into this space.

If we are talking about the gold standard for this, we have to start with Temporal. They’ve been the heavy hitter in durable execution for years, but the March twenty-six release of Temporal one point twenty-five really changed the game for agents. They added native large language model task queues. Before that, you had to sort of shoehorn your AI logic into their standard workflow-activity model, but now it’s built to handle the non-deterministic nature of these models.

It’s funny because Temporal used to be the thing you only touched if you were building a high-frequency trading platform or a massive payment processor like Stripe. It felt a bit like bringing a tank to a knife fight for a simple chatbot. But now that agents are actually doing multi-step reasoning, that "tank" is exactly what you need. If your agent is halfway through a research task and the worker crashes, Temporal doesn't just restart the whole thing and bill you for another ten thousand tokens. It knows exactly which line of code it was on and resumes from the last checkpoint.

That’s the "event sourcing" magic. It’s not just saving variables to a database; it’s recording every single side effect. When the code "replays," it doesn't actually re-run the expensive tool calls or the LLM prompts. It looks at the history, sees that the prompt already returned a specific JSON object, and just hands that result to the function. It makes your code look like a standard, linear script, even if it takes three days to finish.

I saw a case study recently where a developer built a routing agent on Temporal. A webhook from a chat app would trigger the workflow, and the first step was a fast, cheap model to categorize the intent. If it was a complex technical query, it would route to a heavy-duty reasoning model. If it was a billing issue, it would wait for a human-in-the-loop signal from the finance team. The developer basically wrote a "while" loop with some "if" statements, and Temporal handled the persistence of that entire decision tree.

And if you want that same reliability but you’re already locked into the Amazon ecosystem, you’re looking at AWS Step Functions. They had a massive update in January for AI orchestration patterns. It’s a different philosophy because it’s fundamentally serverless and state-machine based. You get authenticated API Gateway triggers out of the box, and everything flows into CloudWatch logs.

But Step Functions feels a bit more like "Lego bricks" compared to Temporal’s "write whatever code you want" vibe, right? You’re defining states in Amazon States Language, which is basically a giant JSON file.

It is, though the CDK support has made it feel more like code lately. The tradeoff is the overhead. Temporal gives you ultimate control and local determinism, but the learning curve is steep—you have to be careful not to use things like random number generators inside the workflow logic. Step Functions handles the infrastructure scaling better without you thinking about "workers" at all, but you lose some of that granular code-level flexibility. For an agentic backend, the "durability" is the feature. You’re trading a bit of execution speed for the guarantee that the job will eventually finish, no matter what happens to the underlying hardware.

It’s that classic trade-off between the "Lego brick" orchestration of AWS and the "pure code" approach of Temporal. But if you’re already living in the Google ecosystem, Google Cloud Workflows is the middle ground that’s been getting a lot of love lately. They had those February enhancements for observability that specifically targeted long-running AI traces.

Google Cloud Workflows is fascinating because it’s fundamentally a HTTP-based orchestrator. It’s serverless, so you aren't managing workers, but it’s remarkably good at LLM routing. You can trigger a workflow via an authenticated webhook through a Cloud Function, and then it can coordinate calls to Vertex AI or even external APIs. The February update added much better tracing for nested LLM calls, so you can actually see the latency of each "thought" in the agent's process right in the console.

Is it still that weird YAML-based syntax, or can I stay in my IDE?

It’s still mostly defined in YAML or JSON, which is the friction point for a lot of people who want "Workflows as Code." But for multi-cloud flexibility, it’s actually a sleeper hit. If you want to trigger a job from a webhook and have it branch out to three different model providers based on the initial prompt's complexity, it handles that state management without you needing to stand up a database for the "checkpoints."

Now, if we look at the enterprise side, Azure Durable Functions is the one I see mentioned when people are doing heavy-duty .NET or JavaScript work. They had that December update with the AI extensions, right?

They did. Azure Durable Functions is a very mature implementation of the "Virtual Entity" pattern. It’s great for agentic backends because it uses Azure Key Vault for environment management and Application Insights for logs by default. The December update specifically made it easier to maintain "conversation state" across multiple turns in an agentic loop without manually passing a massive JSON object back and forth. You essentially just write a standard function, and Azure "pickles" the state of your local variables into table storage automatically.

It sounds like the "safe" choice for a corporate environment, but what about the developer who wants the power of Temporal without the "enterprise" headache? I know Fly dot io has been making moves there.

That’s the hybrid approach. You use Fly dot io for the compute layer—running your workers in those fast, global micro-VMs—and then point them at a Temporal cluster. What’s cool is that Fly adjusted their pricing just this month, in April, making it way more cost-effective for small-scale agents. You can run a tiny Temporal "sidecar" and have a durable backend for basically pennies if your agent isn't constantly looping.

So we’re seeing a real shift here. We’re moving from "how do I keep this server alive?" to "how do I define the logic of the agent's life?" But isn't there a massive second-order effect here with vendor lock-in? If I build my entire agent's "brain" around Azure Durable Functions or AWS Step Functions, I’m not just moving a script; I’m moving an entire state machine architecture if I ever want to switch.

You’re absolutely right. That’s the hidden cost. You’re offloading the "ops" work, but you’re tethering your logic to the platform's specific way of handling retries and state. If you go the Step Functions route, you’re married to the AWS ecosystem. If you go with Google Cloud Workflows, you have more multi-cloud flexibility, but you’re writing in a proprietary orchestration language. It’s a classic architectural trade-off: speed of delivery versus long-term portability.

It’s the "Home Depot" problem. You go in for a lightbulb and walk out with a whole smart home system that only works with one brand of hub. But when the alternative is building your own event-sourcing engine from scratch, I think most developers are happy to take the lock-in if it means their agent actually finishes its task.

That’s the calculation everyone has to make. But if we distill all this into a decision framework, the first actionable insight is pretty clear: if you need to go from zero to a working prototype by the end of the day, stick to the ecosystem you already inhabit. For most developers, that means starting with AWS Step Functions or Google Cloud Workflows. The setup time is almost non-existent because you aren't managing workers or clusters; you're just defining the state machine and letting the managed service handle the authenticated webhooks and environment variables.

It’s the path of least resistance. If I’m already using AWS, spinning up a Step Function to orchestrate a few Lambda calls for an agent is a no-brainer. I get the logs in CloudWatch for free, and I don't have to explain a new line item on the bill to my boss. But, and this is the second big insight, if you're building something that’s actually mission-critical—like an agent that’s going to run for three weeks and handle sensitive data—you have to look at Temporal. Its durability is the gold standard, and because it's open-source, you have that "escape hatch" where you can self-host if the cloud costs ever get out of hand.

The durability of Temporal is uniquely robust because it doesn't just "retry" a function; it reconstructs the entire state of the workflow by replaying the event history. For a complex, long-running agent, that’s the difference between a minor hiccup and a total system failure.

So for the listeners sitting there with a half-finished agentic script, what’s the immediate move? I’d say evaluate based on your current stack. Don't over-engineer it on day one. Pick the platform that integrates with your existing log aggregation and identity providers. Set up a sandbox, test how it handles a webhook trigger when the LLM takes forty-five seconds to respond, and see if the observability gives you a clear tree of what the agent actually "thought" at each step. If you can't see the trace, you can't debug the agent.

That’s the perfect place to leave it. If you can't see the internal monologue of the agent, you're just staring at a black box hoping for the best. And what’s wild is how fast this is moving. As of April twenty-six, we’re seeing a massive shift toward integrated AI observability. It’s not just about "did the function run," it's about "did the agent hallucinate on step four of a twelve-step chain." We’re getting to a point where these platforms might make third-party logging solutions redundant because the tracing is so deeply baked into the execution engine itself.

It makes sense. If the platform is already checkpointing the state, it might as well recorded the prompt and the response too. But it leaves me with a bigger question. Are these platforms always going to be "workflow-centric," where we’re basically just drawing sophisticated maps for the AI to follow? Or are they going to evolve into something truly autonomous, where the platform itself manages the agency?

That is the frontier. Right now, we’re still mostly using these as guardrails—durable tracks for the train to run on. But the more these engines understand the "intent" of the code through better LLM routing and native task queues, the closer we get to a backend that isn't just a host, but a partner in the execution.

Well, until the backends start writing the podcasts themselves, you’re stuck with us. This has been a deep dive into the plumbing of the future. Big thanks to our producer, Hilbert Flumingtop, for keeping our own workflows durable.

And a huge thank you to Modal for providing the GPU credits that power the infrastructure behind this show.

This has been My Weird Prompts. If you found this useful, search for My Weird Prompts on Telegram to get notified the second a new episode drops.

See you in the next one.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2141: Durable Agents: The Backend Tax

The Infrastructure Tax of the Agentic Era

Defining the Requirements

Platform Comparison

Temporal: The Gold Standard

AWS Step Functions: The Lego Bricks

Google Cloud Workflows: The Multi-Cloud Sleeper

Azure Durable Functions: The Enterprise Choice

Fly.io: The Developer-Favorite Contender

Key Takeaways

Downloads

You Might Also Like

#2141: Durable Agents: The Backend Tax