#1843: Why Is My AI Pipeline Stuck? (Kanban-Style Observability)

Stop digging through JSON logs. See your AI jobs moving on a board, not just server metrics.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1998
Published: Mar 31
Duration: 25:21
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents state-first-observability observability

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Hidden Cost of Invisible Workflows

Modern AI pipelines are complex. A single job might trigger a dozen stages—ingestion, validation, orchestration, generation, post-processing, delivery—and span multiple models, containers, and API calls. Yet, when something goes wrong, most teams are stuck staring at logs that look healthy while their actual work is frozen in place. This is the classic "needle in a haystack" problem: traditional monitoring tools are built to track server health, not the state of a business process.

The Core Problem: Infrastructure vs. Job State

Traditional observability rests on three pillars: logs, metrics, and traces. These are excellent for answering questions like "Is the server down?" or "Is latency spiking?" But they fail at answering the most critical question for an AI operator: "Where is my job right now?"

In a multi-stage agentic workflow, a single "job" might span ten minutes and fifty separate API calls. If one call fails and retries, the system metrics might look fine, but your job is stuck in an infinite loop of self-correction. You need to see that stall visually—like a stuck truck on a delivery map—not just a green light on a server dashboard.

The Kanban-Style Solution

The emerging solution is "State-First Observability," a movement that treats the "where" as more important than the "what." Instead of digging through JSON logs to find an error code, you want to see a card on a digital board that has turned red. This is essentially a visual representation of a state machine: each column is a status (e.g., "Drafting," "Review," "Publish"), and each job is a card.

This approach aligns with how humans naturally manage work. If you were managing human writers, you'd use Trello or Asana—not a Grafana dashboard tracking their typing speeds. The same logic applies to AI agents. The pipeline is a high-speed project management system where the workers are agents, and the board is the source of truth.

The Tool Landscape: From Heavyweight to Lightweight

The market for workflow visualization is wide but fragmented. On one end, you have heavy enterprise tools like Prefect and Temporal. Prefect Cloud offers beautiful workflow visualizations and is great for teams with a budget (starting around $500/month). Temporal is the gold standard for reliability, used by companies like Uber and Netflix, but its UI is more of a detailed timeline than a Kanban board, and it requires significant infrastructure to run.

On the other end, you have specialized AI observability platforms like Langfuse and Helicone. These are fantastic for deep-diving into LLM calls—seeing prompt versions, token counts, and costs—but they're still "table-heavy," presenting rows of data rather than a visual board. They're developer tools, not operator tools.

For Python-heavy environments like Modal, the options are even trickier. KaibanJS, a JavaScript framework for visualizing multi-agent systems, offers a built-in "Kaiban Board" that syncs with agent states in real-time. However, it's JS-native, requiring a bridge for Python teams. The philosophy is right, but the integration isn't seamless.

The DIY Path: Low-Code Observability

For teams without an enterprise budget, the answer often lies in building it yourself with lightweight components. Tools like Retool or Appsmith allow you to drag a Kanban component onto a canvas and map it to a simple database (e.g., Supabase). In your Modal code, you add one line at the end of each stage to update the job's status. This "Low-Code Observability" path gives you a "Mission Control" view without writing a custom React frontend from scratch.

Similarly, "State-as-a-Service" tools like AITable.ai or Baserow treat your pipeline like a database, where each row is a job and each column is a stage. Switch to Kanban view, and you have an instant visual monitor.

Key Takeaways

Traditional monitoring tracks infrastructure health, not job state.
Kanban-style observability visualizes workflows as cards moving through stages.
Enterprise tools like Prefect and Temporal are powerful but expensive and complex.
Lightweight tools like Retool or Supabase can bridge the gap for small teams.
The goal is to see "what" is happening now, not just "why" it happened.

The market is missing a polished, affordable tool for pro-level dev teams of 2-3 people who want a cool dashboard without the enterprise overhead. Until then, the DIY path using low-code tools and simple database updates is the most practical solution.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1843: Why Is My AI Pipeline Stuck? (Kanban-Style Observability)

Our Modal pipeline has twelve distinct stages and forty-seven possible states, and yet, when I look at our monitoring, I feel like I am trying to find a specific needle in a haystack made of other, slightly different needles. Today's prompt from Daniel is about that exact frustration. He is looking for a lightweight, kanban-style utility for visual workflow observability. Basically, a heads-up display that says Job forty-seven is stuck in stage three, rather than a deluge of logs telling us the CPU temperature in a data center we don't own.

It is a classic problem of abstraction layers. We have moved from simple scripts to these multi-stage agentic systems, but our monitoring tools are still stuck in the era of server health. By the way, today's episode is powered by Google Gemini three Flash, which is fitting since we are talking about the brains of these operations. I am Herman Poppleberry, and honestly, Corn, I have been obsessing over this gap between system telemetry and job state all week.

It is funny how we have all these high-powered tools like Datadog or Langfuse, but sometimes you just want a whiteboard with some sticky notes on it. Our own pipeline here at My Weird Prompts is a perfect example. It goes from ingestion to validation, then into the actual agent orchestration, generation, post-processing, and finally delivery. If something goes wrong in the validation stage, I do not want to go digging through five hundred lines of JSON logs to find the error code. I want to see a card on a digital board that has turned red.

That is the "State-First Observability" movement that is really gaining steam here in twenty-six. The idea is that for autonomous agents, the "where" is more important than the "what." Traditional observability is built on the three pillars: logs, metrics, and traces. But those are designed to tell you if the infrastructure is dying. Think of it like a hospital monitor. A heart rate monitor tells you the body is alive, but it doesn't tell you if the patient is actually writing a novel or just staring at the wall. When you are running a pipeline on Modal, the infrastructure is abstracted away. You do not care about the container's memory usage as much as you care about whether the "Research Agent" actually handed off its findings to the "Writer Agent."

So, when we talk about "kanban-style workflow observability," we are effectively talking about a visual representation of a state machine. Each column is a status, and each job is a card. It sounds simple, but why is it so hard to find a tool that just does that without wanting five hundred dollars a month and access to my first-born's metadata?

Because most companies building these tools are targeting enterprise DevOps teams who need to correlate a spike in latency with a specific database query. They are solving for scale and complexity, not for the "at-a-glance" clarity that a small team needs. If you look at something like Langfuse or Helicone, they are fantastic for deep-diving into LLM calls—seeing exactly which prompt version was used or how many tokens were consumed—but they are still very "table-heavy." You are looking at rows of data. What Daniel is asking for is a "Mission Control" view. He wants to look at a monitor on the wall and know the health of his business process, not the health of his clusters.

Right, and Modal's native limitations don't help much here. Don't get me wrong, I love Modal. Their execution logs are clean, and the CLI is great. But their dashboard is an execution history, not a workflow monitor. It tells you that a function ran and succeeded or failed. It doesn't naturally stitch twelve different function calls together into a single "Episode Production" job that I can track across a board. If I have a "parent" function that spawns ten "child" functions across different Modal containers, the dashboard shows me eleven separate bars. It doesn't show me one "Job" moving through stages.

And that is where the friction starts. If you want that view, you usually have to build it yourself or buy into a heavyweight orchestrator. Let's talk about that spectrum, because it is wider than people realize. On one end, you have your basic log aggregation—the ELK stack, which is Elasticsearch, Logstash, and Kibana. That is the "searching through haystacks" method. Then you move to the specialized AI observability platforms like Langfuse. Those are great for seeing your prompt tokens and cost, but they still feel like a developer tool, not a project management tool.

I think the "project management" comparison is key. An AI pipeline in twenty-six is essentially a high-speed project management system where the workers are agents. If I were managing human writers, I would use Trello or Asana. I wouldn't be looking at their heart rates and typing speeds in a Grafana dashboard. I would want to see the status of the article. Is it in "Drafting"? Is it in "Legal Review"? Is it "Ready to Publish"?

Precisely. And the reason traditional APM—Application Performance Monitoring—fails here is that it treats every request as a discrete event. In a multi-stage agentic workflow, a single "job" might span ten minutes and fifty separate API calls across three different models. If one of those calls fails and retries, the system metrics might look fine, but your job is stuck. You need to see that stall visually. It’s the difference between seeing a "green light" on a server and seeing a "stuck truck" on a delivery map.

I've had that happen. I'll be waiting for an episode draft to pop out, and nothing happens. I check the logs, and everything is "green," but then I realize the agent is stuck in an infinite loop of "self-correction" because it didn't like the tone of the third paragraph. A kanban board would show that card sitting in the "Review" column for ten minutes, which is an immediate red flag. But wait, if I’m using something like Langchain or CrewAI inside my Modal functions, don't they have built-in visualizers?

They do, but they are often ephemeral. They work while the script is running on your local machine, but as soon as you deploy that logic to a headless cloud environment like Modal, that local UI disappears. You need a persistent state store that lives outside the execution environment. There is a tool Daniel mentioned in his notes called KaibanJS that I think hits this head-on. It is a JavaScript framework specifically for building and visualizing multi-agent systems. It actually includes a built-in "Kaiban Board." It uses a Redux-style state management to sync what the agents are doing to a UI in real-time. So, as the agent works, the card moves. It is the first time I have seen a framework treat the visualization as a first-class citizen rather than an afterthought.

Is KaibanJS something you can just strap onto a Python-heavy environment like Modal, though? Most of our backend is Python. If I have to rewrite my entire orchestration layer in Javascript just to get a pretty board, that feels like a lateral move in terms of frustration.

That is the catch. It is a JS-native framework. If you are already in that ecosystem, it is perfect. But for a Python team, you are looking at a bit of an integration bridge. You would likely need to have your Python functions on Modal emitting state updates to a small Node server or a Supabase instance that KaibanJS can then pick up. It is not "plug and play" for us, but it represents the exact philosophy Daniel is looking for. It treats the "Board" as the source of truth for the system state.

What about the SaaS side of things? If I have a bit of a budget and I just want this to go away, where am I looking? You mentioned Prefect and Temporal in the plan. How do they handle the "Kanban" aspect?

Prefect is probably the closest to a "polished" version of this. Prefect Cloud has a really beautiful workflow visualization. They have this concept of "Flows" and "Tasks." When you run a pipeline, you can see the graph of how things are moving. It has a very clean UI that tells you exactly where a "Flow" is. The downside is the cost. Prefect Cloud for a team starts at around five hundred dollars a month. For a small independent podcast or a solo dev, that is a steep hill to climb just for a visual board. Also, you have to buy into their way of writing code—using their decorators and their orchestration logic.

Five hundred dollars a month buys a lot of coffee and manual log-checking. What about Temporal? I know they are the big players in "durable execution." They basically promise that your code will finish running even if the entire data center explodes and has to be rebuilt.

Temporal is the gold standard for reliability. If you are Uber or Netflix and you cannot afford for a transaction to ever fail, you use Temporal. Their UI is very functional—it shows you the history of every "Activity" in a "Workflow." But it is not a "kanban board." It is more of a detailed timeline. It’s very "industrial." Also, Temporal is famously "heavy." Running a self-hosted Temporal cluster is a full-time job in itself. You need a database, a visibility store, a worker pool... it’s a lot. They have Temporal Cloud, but again, you are entering enterprise pricing territory very quickly. It’s overkill for "I want to see where my agent is."

It feels like there is this massive hole in the market. On one side, you have "I am a hobbyist and I will use print statements in my logs," and on the other, you have "I am a Fortune five hundred company and I have a sixty-thousand-dollar observability budget." Where is the "I'm a pro-level dev team of three people who want a cool dashboard" option?

That is the "gap" Daniel is talking about. And honestly, right now, the answer for a lot of people is "Build it yourself using lightweight components." One of the most interesting suggestions in the research notes was using a tool like Retool or Appsmith. Since we are on Modal, we can have our functions push their state to a simple database—like a Supabase table or even a Google Sheet if you are feeling truly chaotic—and then point a Retool Kanban component at that table.

I actually like the Retool idea. It takes about twenty minutes to drag a Kanban component onto a canvas and map the "columns" to a "status" field in a database. Then, in our Modal code, we just add one line at the end of each stage: update_status_in_db(job_id, "Generation Complete"). But does that scale? If I have a thousand jobs running at once, is my Retool board going to have a heart attack?

Retool handles thousands of rows just fine. The bottleneck would actually be your own eyes. If you have a thousand cards on a Kanban board, you’ve just created a new kind of haystack. But for the scale most of us are working at—where you might have fifty active "episodes" or "research tasks" in flight—it is remarkably effective. It’s the "Low-Code Observability" path. It gives you that "Mission Control" feeling without having to write a custom React frontend from scratch. Plus, Retool has a generous free tier for small teams.

The "State-as-a-Service" approach is also gaining traction. There is this tool called AITable dot ai, or even Baserow, which are basically "Airtable clones with better APIs." You treat your pipeline like a database. Each row is a job, each column is a stage. You can switch to "Kanban View" in their UI, and boom—you have your visual monitor.

And it’s important to distinguish between "Observability" and "Tracing." Tracing—like what you get with Arize Phoenix—is for figuring out why something happened. It shows you the nested calls, the latency of each LLM request, and the cosine similarity of your embeddings. That is for developers. Kanban observability is for operators. It tells you what is happening right now. Most AI teams are currently drowning in Tracing data but starving for State data.

It's funny how we're circling back to "just use a database." It's almost like the "AI agents" are just rows in a CRUD app that move very slowly. But I suppose that's exactly what they are from a management perspective. If I think about our audio generation, it's really just a series of state changes: PENDING to TRANSCRIPT_READY to VOICE_SYNTH_COMPLETE.

Another tool worth mentioning is Lunary. It is open source, and it is marketed as a cleaner, more "heads-up" alternative to Langfuse. It focuses on "Runs" rather than just "Traces." The UI is much more visual. It doesn't give you a literal Kanban board out of the box, but it gives you a "Job List" that feels much more like a modern dashboard and less like a spreadsheet. It allows you to tag runs with custom metadata, so you could tag something as Project: Episode 402 and see all the related tasks in one view.

How does the integration look for something like Lunary or Arize Phoenix? Are we talking about wrapping every function call in a decorator? Because if I have to go back and touch every single one of my twelve stages, I might just stick to the logs.

Usually, yes. Most of these tools provide a Python SDK where you can wrap your LLM calls or your main functions. Arize Phoenix is particularly cool because you can run it locally. If you are developing on your machine and you want to see the "trace" of your nested agentic workflow, you just spin up a local Phoenix server, and it provides this incredibly detailed visual flow of how the data moved from "Agent A" to "Router" to "Agent B." It is excellent for debugging, though maybe less of a "permanent mission control" for a production pipeline on Modal. But the decorator overhead is actually quite low—usually just one @trace line above your function definition.

I think the "Human-in-the-loop" aspect Daniel mentioned is where this gets really powerful. If I see a card stuck in the "Review" column on my Retool board, I want to be able to click that card and see exactly what the agent is complaining about. Maybe it's a "safety filter" that got triggered, or maybe the output was just garbage. If I can have a "Approve" button right there on the Kanban card that pushes the job to the next stage... that's the dream. It turns the observability tool into an intervention tool.

That is the "Command Center" evolution. You are not just observing; you are participating. And that is why the "Build" option is so attractive for teams on Modal. Because Modal's execution environment is so flexible, you can actually build "Pause" points. You can have a function that saves its state, sends a webhook to your Retool dashboard, and waits. You, the human, look at the Kanban board, see the "Pending Approval" card, check the text, hit "Approve," and that triggers a Modal webhook to resume the execution. It’s basically building a custom "Human-Agent Collaboration" platform on top of a Kanban board.

That sounds like a sophisticated version of what we do now, which is me yelling "Herman, why is the bot broken?" and you looking at a terminal for twenty minutes.

My terminal is very comforting, Corn, but I admit it doesn't scale well. If we look at the "build" vs "buy" tradeoff here, it really comes down to how "agentic" your system is. If you just have a linear pipeline—A goes to B goes to C—then a simple Retool board is plenty. But if you have agents that can branch, loop, and call each other dynamically—meaning the "next stage" isn't always predictable—then you probably need something like KaibanJS or a more robust orchestration layer like Temporal, despite the complexity. You need a system that can handle non-linear card movement.

Let's talk about the "DIY" approach on Modal specifically. If I'm a developer listening to this and I want to set this up tomorrow, what's the minimal viable setup? I don't want to spend three days on infrastructure for my infrastructure.

Minimal viable? Okay. Step one: Create a Supabase project. It's free and gives you a Postgres database with a REST API. Step two: Create a table called jobs with columns for id, status, payload, and updated_at. Step three: In your Modal code, use the httpx library to send a PATCH request to Supabase at the start and end of every major function. You can even write a simple Python context manager so you just do with job_status("Processing"): and it handles the API calls for you. Step four: Log into Retool, connect your Supabase DB, and drag the "Kanban" component onto the page. Map "Column" to the status field. You are done in under an hour, and you have a real-time visual workflow monitor.

And that costs exactly zero dollars until you hit a pretty significant scale. That’s much more appealing than the five-hundred-dollar-a-month "Enterprise Observability" tax. But does this approach work if I have multiple people on the team? If you and I both have the Retool board open, will it stay in sync?

Because Supabase and Retool both support real-time updates through web sockets. If a Modal worker updates a row in Postgres, the card will physically slide from one column to the next on both our screens simultaneously. It is. And it solves the specific problem of "noise." Traditional tools give you a deluge of telemetry because they don't know what your "job" is. They just see "Function A ran for two seconds." By building the state-tracking into your code, you are defining the "meaning" of the workflow. You are choosing what constitutes a "stage" move.

I think there's also a psychological benefit to the Kanban view. When you're running complex AI tasks—like generating a whole podcast script—it feels like a "black box." You're just waiting for the finished product. Seeing the cards move across the board makes the "work" visible. It feels more like a factory floor and less like a magic trick that occasionally fails. It reduces the anxiety of "is it actually doing anything?"

That is a great point. It makes the system "legible" to non-technical stakeholders too. If Hannah or Ezra wanted to see how the "Episode Production" was going, they could look at a Kanban board and understand it instantly. They couldn't do that with a Datadog dashboard or a stream of Python tracebacks. They can see "Oh, we have five episodes in the 'Audio Edit' stage and two in 'Fact Checking'." It turns a technical pipeline into a business process.

Well, Ezra might just try to eat the iPad, but the point stands. Legibility is a form of reliability. If you can see where it's broken, you can fix it faster. I also think it helps with identifying bottlenecks. If I notice that cards always spend three hours in the "Fact Checking" stage but only two minutes in "Generation," I know exactly where I need to optimize or add more agents.

There is one more tool I should mention for the "lightweight" category: Lilypad. It is a newer project that focuses on versioning the logic and the prompt together as a single unit. It treats every "run" as a reproducible artifact. It’s not a Kanban board, but it gives you this very clean "Heads-up" view of every time a specific function was called, what the inputs were, and what the output was. It is very "developer-centric" but much more focused than a general-purpose logger. It’s great for when you want to see a history of "Stage 4" specifically across all your jobs.

It feels like we're seeing the "unbundling" of observability. Instead of one giant platform that does everything poorly for everyone, we're getting these "surgical" tools. KaibanJS for agent state, Lilypad for function versioning, and then the "DIY" boards for the high-level workflow. It’s a modular approach to monitoring.

I think that is the right way to look at it. The "heavyweight" platforms are trying to be the "Nginx for your AI stack"—which we have talked about before in terms of gateways—but for observability, the "one size fits all" approach is failing because the "state" of an AI job is a business-logic concern, not an infrastructure concern. No generic tool can know that "Stage 3" of your specific pipeline is the most critical part. Only you know that.

So, let's wrap this into some actual takeaways for someone in Daniel's position. If you've got a Modal-based pipeline and you're tired of the "telemetry deluge," what's the decision tree look like?

If you have more money than time and you want a "pro" feel: Go with Prefect Cloud. It is built for this. It understands Python, it understands multi-stage workflows, and the UI is beautiful. You will pay for it, but it works. It’s the "out of the box" solution for people who want to focus on their agents, not their dashboards.

And if you have more time than money, or you want something perfectly tailored to your weird little pipeline?

Then the "DIY Mission Control" is the winner. Use Supabase or any lightweight DB as your "state store," and use Retool for the UI. It gives you the Kanban view Daniel is looking for, and it allows you to add "Human-in-the-loop" buttons later if you want to be able to intervene in the pipeline. It’s the most flexible and cost-effective route for a small, agile team.

What about the "Agent-Native" route? If someone is building a swarm of agents rather than a linear pipeline?

If you are building specifically with multi-agent frameworks and you are comfortable in the JavaScript ecosystem, KaibanJS is the clear choice. It is the only thing right now that treats the Kanban board as the primary interface for the AI's "thought process." It’s built for the "chaos" of agents talking to each other.

I'm honestly tempted to spend some time this weekend setting up a Retool board for our own stuff. I'm tired of asking you if the "Audio Post-Processing" stage is done. I want to see the card move. It would also be great to have a "Re-run" button on the card for when I don't like a specific generation.

I would appreciate that, Corn. It would save me from having to explain that "Post-Processing" takes exactly as long as it takes, and not a second less. And adding a "Re-run" button in Retool that triggers a Modal webhook is actually quite simple. We could even pass different parameters directly from the UI.

Spoken like a true donkey. But seriously, this gap in the market—this "observability for the rest of us"—is going to get filled fast. I wouldn't be surprised if in a year, Modal or someone like them just has a "Toggle Kanban View" button in their dashboard that automatically maps your functions to columns. It’s such a logical way to view these systems.

I hope so. Until then, we are in the era of "Creative Integration." We are stitching together these lightweight tools to create the visibility we need. It is a bit more work upfront, but the payoff is a pipeline you can actually trust because you can actually see it. It moves AI development from guesswork to engineering.

Well, not exactly—I'm not allowed to say that. But you're right. Visual clarity is the antidote to AI "black box" anxiety. This has been a great dive, Herman. I think Daniel has plenty to chew on here. It’s all about surfacing the state, not the logs.

It is a fascinating space. As these agentic systems get more complex, the "Mission Control" desk is going to become the most important part of the stack. You can't manage what you can't see. Thanks for the great questions, Corn. You always push us to find the practical "aha" moments in these technical weeds.

That's what I'm here for. Taking it slow, asking the right questions, and occasionally making fun of your terminal obsession. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes and ensuring our "State Machine" stays in the "Success" column.

And a big thanks to Modal for providing the GPU credits that power our agentic experiments and this very podcast. Without their infrastructure, we wouldn't have any states to monitor in the first place.

If you're finding these deep dives helpful, we'd love it if you could leave us a review on Apple Podcasts or Spotify. It actually makes a huge difference in helping other "weird prompters" find the show and join the conversation.

This has been My Weird Prompts. We will catch you in the next one, hopefully with a fully functional Kanban board to show for it.

See ya. Keep those cards moving.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1843: Why Is My AI Pipeline Stuck? (Kanban-Style Observability)

Downloads

You Might Also Like

#1843: Why Is My AI Pipeline Stuck? (Kanban-Style Observability)