#2332: Voice-to-Task: Building the Claude Task Planner

How does a voice note turn into a completed task? Dive into the architecture and tradeoffs of building a Claude-powered task execution system.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2490
Published: Apr 19
Duration: 22:57
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: voice-to-text automation latency

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Claude Task Planner is a system designed to turn spoken tasks into completed actions using voice transcription, webhooks, and Claude CLI. At its core, the system relies on four key components: a voice note app, a webhook, a webhook receiver, and Claude CLI. Each step in the chain—voice to text, text to webhook payload, webhook to receiver, receiver to CLI invocation, and CLI to execution—presents potential failure points that must be carefully managed for the system to be reliable.

Transcription accuracy is critical. While modern tools like AssemblyAI and Voicie achieve high accuracy rates, even a small error can lead to significant misinterpretations in task contexts. For example, "delete the draft" versus "deliver the draft" could have drastically different outcomes. Structured transcription tools can pre-process tasks, extracting intent and entities for easier routing, but this adds dependencies and may lose nuance. Alternatively, sending raw transcripts to Claude allows for flexible interpretation but relies on the model to parse ambiguous input.

The webhook layer acts as the bridge between transcription and execution. Tools like N8N provide a workflow automation solution that can serve as the webhook receiver, offering visual debugging and built-in reliability features. However, writing a custom receiver allows for finer control over error handling and retry logic, though it requires more maintenance.

Claude CLI handles the actual task execution, whether it’s writing files, running commands, or making API calls. Keystroke emulation, though available, is discouraged in favor of API-based interactions whenever possible. Running the system on a VPS ensures robustness, as it remains always-on and connected, but introduces server maintenance and security considerations.

Security is a key concern. Public, unauthenticated webhook endpoints can allow anyone to queue tasks for your Claude agent, making shared secrets and rate limiting essential. Failure modes must be carefully designed to avoid false positives, where tasks are executed incorrectly, as these can be more damaging than false negatives.

Claude Code Routines offers a simpler entry point for those who want to test the concept without building custom infrastructure. However, for more complex workflows, transitioning to N8N or a custom receiver provides greater flexibility and control.

Ultimately, the Claude Task Planner is an asynchronous command interface that rewards structured input. Training users to speak in clear, task-specific formats improves reliability, making the system a practical tool for automating tasks that don’t require real-time execution.

Mentions

AssemblyAI Real-time speech-to-text API
AWS EC2 Cloud virtual server instance
Claude CLI Anthropic's command-line AI agent
ClickUp Project management platform
N8N Open-source workflow automation tool
Otter.ai Voice transcription app with Zapier integration
Slack Team communication platform
Todoist Task management app
Voicie Task extraction from voice notes
Zapier No-code automation platform

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2332: Voice-to-Task: Building the Claude Task Planner

Daniel sent us this one, and it's a builder's prompt. He's describing a system he calls the Claude Task Planner, where you speak a task into a voice note app, that note gets transcribed and fired off via a webhook, and on the other end something like Claude CLI picks it up and actually executes it. The question underneath all of this is how you wire it together properly, what the tricky bits are, and whether you can make it robust enough to actually trust. Voice in, work done. That's the pitch.

I love that framing because the gap between "I said a thing" and "the thing got done" is where basically every productivity system falls apart. The voice part is easy now. The execution part is where it gets interesting.

By the way, today's episode is powered by Claude Sonnet four point six, so the friendly AI down the road is writing our script while we talk about building pipelines that talk to Claude. There's something poetic in there somewhere.

Layers within layers. Very on-brand for this show.

Alright, so let's actually dig into what this system is. At the highest level, you've got four components. A voice note app, a webhook, a webhook receiver, and then Claude CLI running on a desktop or a server. Daniel mentions N8N as a possible middleware layer, and a VPS for robustness. What's your first read on the architecture?

My first read is that it's genuinely elegant in its simplicity, and the elegance is also its main risk. Each handoff in that chain is a potential drop. Voice to text, text to webhook payload, webhook to receiver, receiver to CLI invocation, CLI to actual execution. That's five transitions, and any one of them can fail silently if you're not careful.

Which is why the robustness question isn't an afterthought. It's kind of the whole design problem.

And the reason this is even possible at all right now is that voice transcription has gotten good. The leading apps are hitting something like ninety-eight percent accuracy on clear speech, which sounds high until you realize that in a task context, one misheard word can send the wrong instruction downstream. "Delete the draft" and "deliver the draft" are very different commands.

That is an extremely uncomfortable example and I appreciate it.

The transcription layer matters a lot. Tools like AssemblyAI are doing real-time speech to text with decent entity and intent extraction on top — not just a word blob, but structured output that's easier to route. Voicie does something similar, specifically oriented toward task extraction and webhook delivery to downstream tools like ClickUp or Slack. The question is whether you want to trust that extraction layer or whether you want Claude to do the interpretation itself.

That's an interesting fork in the road. Either the transcription tool does the heavy lifting of figuring out what the task actually is, or you send the raw transcript to Claude and let it figure it out.

Real tradeoffs either way. If you pre-process with a tool like AssemblyAI's task extraction, you get structured data that's easier to validate and route, but you've added a dependency and potentially lost nuance. If you send the raw transcript to Claude, you get more flexible interpretation, but you're relying on the model to correctly parse ambiguous voice input and you've used a token budget before you've even started the actual task.

If you're building this for personal use, which is kind of what Daniel's describing, does the extra structure actually help? Or is it overhead?

For personal use with a consistent vocabulary and consistent task types, I'd lean toward raw transcript to Claude. You train yourself to speak in a way the model handles well, and you get more flexibility. The structured extraction tools start earning their keep when you're routing to multiple downstream systems or when the speaker pool is larger and less predictable.

The webhook layer. Walk me through what's actually happening there.

A webhook is just an HTTP POST. When your voice note app finishes transcribing, it fires a POST request to a URL you've configured, with the transcript as the body payload. That URL is your webhook receiver. The receiver validates incoming requests and then decides what to do with the content — in Daniel's setup, that means triggering Claude CLI.

N8N fits in as the receiver?

N8N is a workflow automation tool, open source, self-hostable, which matters a lot for this kind of setup because you're passing potentially sensitive task data through it. It can act as the webhook receiver, do a quick format or validation step, then shell out to Claude CLI with the task as an argument. The alternative is writing a dedicated receiver yourself, which is more control but more maintenance.

What's the argument for writing your own versus using N8N?

Control over the failure behavior, primarily. N8N is a general-purpose tool and its error handling is generic. If you write a dedicated receiver, you can build in exactly the retry logic and alerting you want. The counterargument is that N8N has enough built-in reliability features that for most personal automation setups, it's fine — and you get a visual workflow editor, which is useful for debugging.

There's a real developer personality split here. The people who will absolutely write their own receiver and the people who will absolutely use N8N, and they will both think the other group is making the obvious mistake.

Honestly both are right for different contexts. dev writeup on a custom Claude Code task system is a good example of the roll-your-own approach working well — someone built a system dispatching voice tasks from a phone to a desktop CLI agent, integrating with Todoist, and the custom receiver gave them the fine-grained control they needed.

The Claude CLI piece. This is where the actual work happens. What does that invocation actually look like?

Claude CLI, or Claude Code if you're using the newer tooling, takes a prompt as input and executes it in an agentic loop. Your receiver calls something like "claude execute" with the task text as the prompt, and Claude takes it from there — writing files, running commands, making API calls, depending on what permissions you've granted it.

That's where keystroke emulation comes in?

That's one approach for interacting with applications that don't have APIs. It's brittle, it's slow, it breaks when the UI changes, but sometimes it's the only path. The honest advice is to avoid it wherever possible. If the thing you're trying to automate has an API, use the API. Keystroke emulation is the last resort, not the first tool.

What about the VPS question? Daniel flags that as a robustness consideration.

Your laptop sleeps, loses connectivity, runs out of battery. A VPS is always on, always connected, and you can give it a stable domain for the webhook receiver. AWS EC2 is the obvious reference point — a small instance for a few dollars a month running your receiver and Claude CLI permanently. The tradeoff is you're now maintaining a server, and you have to think about securing the webhook endpoint so random people on the internet can't send it instructions.

That second point seems underappreciated. If your webhook endpoint is public and unauthenticated, you've essentially given anyone who finds the URL the ability to queue tasks for your Claude agent.

Which could range from annoying to bad depending on what permissions that agent has. At minimum you want a shared secret in the webhook header that your receiver validates before processing anything. That's table stakes.

Rate limiting is in the same category?

Rate limiting protects against your own pipeline misfiring as much as external abuse. If your voice note app glitches and sends the same webhook fifty times in ten seconds, you want your receiver to recognize that and not spin up fifty Claude sessions. A simple deduplication check on the payload is usually enough for a personal setup.

The failure modes here are kind of interesting because they're not symmetric. A false negative — a task that doesn't get executed — is annoying but recoverable. A false positive, where a task gets executed twice or with a garbled transcript, can be a real problem depending on what the task was.

That's exactly the right frame for the failover design. You want to be conservative about execution, not about delivery. Deliver reliably, execute cautiously. Which means logging everything, having a confirmation step for high-stakes tasks, and building in a way to inspect the queue before things run.

Claude Code Routines apparently supports some no-code automation including webhook triggers. Is that a simpler entry point for someone who doesn't want to build all of this from scratch?

It is, and it's worth flagging. The Routines feature lets you define workflows that Claude can execute on a trigger, including webhooks, without writing a custom receiver. The limitation is you're working within the Routines abstraction, which is less flexible than a fully custom setup. For someone who wants to test the concept before committing to infrastructure, it's a sensible starting point.

The progression is something like: start with Routines to prove the concept, move to N8N when you need more routing flexibility, build a custom receiver when you need precise control over failure handling.

That's a pretty clean ladder. And the voice note app choice can stay consistent across all three stages, which is nice. You're changing the receiver layer, not the input layer.

The thing that strikes me about this whole system is that it's essentially an async command interface. You're not typing commands, you're speaking them, but the underlying model is the same as if you'd typed "claude do this thing.

Which means all the discipline that goes into writing good prompts also applies here. Vague voice input produces vague results. The voice interface doesn't lower the bar for prompt quality, it just makes it easier to generate prompts quickly — and potentially easier to generate bad ones, because speaking is lower friction than typing.

You're more likely to be vague when you're not staring at a text field.

The best voice-to-task workflows I've seen train the user to speak in structured formats. Not natural language, but something like "task, add to project X, priority high, deadline Friday." It feels stilted at first but the downstream reliability is much better. You're basically inventing a spoken DSL.

A domain-specific language for your own task queue. I find that delightful. The sloth in me appreciates anything that makes doing things require less effort while lying down.

The latency question is interesting here too. The round trip from speaking to execution, if you're running on a local machine with a fast connection, can be under thirty seconds for a simple task. Through a VPS and Claude's API, you're probably looking at forty-five seconds to a couple of minutes depending on task complexity. Fast enough for async work, but not real-time.

Which means it's well-suited for tasks you'd put in a to-do list, not tasks you need done in the next five seconds.

This is a task queue, not a command prompt. And that distinction matters for the feedback loop. You want some kind of notification when the task completes or fails — a push notification, a Slack message, a log file — something that closes the loop because you've moved on by the time it finishes.

Alright, let's get into the specifics of building this out. There are a few decisions that look small but have significant downstream consequences.

If you strip it down, the core is four components in a chain: voice note app, webhook, receiver, and Claude CLI. Each one has a specific job, and the chain only works if each handoff is clean.

The voice app's job is transcription and dispatch. Apps like AssemblyAI and Voicie handle both — they transcribe the audio and push the text payload to a webhook URL you configure. The output is just text over HTTP, which is what makes the whole thing composable.

Ideally with metadata attached. Timestamp, source device, maybe a confidence score on the transcription. That metadata becomes useful when you're debugging or when you want the receiver to make decisions based on context.

The webhook is just the delivery mechanism. It's not doing any logic.

The intelligence lives at both ends, in the transcription model and in Claude. The webhook is the pipe. What the receiver does with the payload is where the interesting decisions happen — does it pass the raw transcript straight to Claude, do any parsing first, log before executing?

ai case is worth walking through because it's the most common entry point. You record a voice note in Otter, it transcribes, and through Zapier you route that transcript to a webhook endpoint. The Zapier step is doing almost nothing technically — it's just watching for a new Otter transcript and firing an HTTP POST. But for someone who doesn't want to write a receiver from scratch, that's the difference between shipping something and not.

Zapier is the training wheels version of N8N. It handles the plumbing visually, you don't touch code. The limitation is you're paying per task, and you have less control over failure behavior. N8N running on your own machine or a VPS gives you retry logic, conditional routing, error branches, and it's not metering every execution.

What does the actual payload look like when it hits the receiver?

A transcript field with the raw text, a timestamp, a source identifier, sometimes a confidence score. AssemblyAI's output includes word-level confidence scores, so you could flag low-confidence segments before they reach Claude — either reject the task or surface it for human review rather than letting Claude guess at what you meant.

The tradeoff is basically: how much do you trust the transcription layer versus how much do you want to build defensively around it?

That answer changes based on what the tasks are. Low-stakes tasks — add a reminder, move a file, draft a message — you can tolerate occasional misinterpretation. High-stakes tasks — execute a deployment, send an email to a client, modify production data — you want more validation in the chain. Which suggests the receiver should have a task classification step, not just "here is the text, run it," but "what category is this, and does this category require confirmation?

N8N handles that well because you can build branching logic visually. High-stakes branch gets a confirmation step, low-stakes branch goes straight to Claude CLI.

Audit trails feel underrated in personal automation. Every payload that hits the receiver should be written to a log before anything else happens — timestamp, raw content, what decision the receiver made. That's your forensic record when a task disappears into the void.

When the log shows a task arrived but Claude never ran it, what happens then?

That's where failover becomes non-negotiable. The most common place this breaks is the Claude CLI process itself — a crash, a timeout, a network hiccup, and the task just evaporates. No retry, no notification, nothing.

Which is fine when the task is "remind me to buy milk" and deeply not fine when it's "submit this pull request.

The VPS argument lives exactly here. If you're running Claude CLI on your laptop, you're exposed to every reason a laptop isn't running. A VPS is just always on. The task hits the receiver, triggers Claude CLI on the remote machine, and you don't have to think about whether your laptop is awake. Pair that with process supervision — something like systemd watching the Claude CLI process and restarting it if it dies — and you have a real failover setup.

What's the right retry window?

Exponential backoff is the standard answer. First retry after thirty seconds, second after two minutes, third after ten, then give up and alert. The alert is important — silent failure is the worst outcome in any automation system.

Rate limiting is the other side of that coin. If you're sending voice notes frequently, you could flood the receiver with tasks faster than Claude can process them.

Claude's API has rate limits, and even below the hard limits you want to be deliberate about concurrency. The receiver should have a queue with a configurable concurrency ceiling — one or two tasks at a time for a personal setup.

That's where this diverges sharply from traditional task management tools. Something like Todoist just stores tasks, it doesn't execute them. Here you're delegating execution to an agent, and the queue management a human does implicitly has to be made explicit in the system.

Keystroke emulation sits at the edge of that boundary. If Claude can't accomplish something through its normal agentic tools and you've set up keystroke emulation as a fallback, you're giving it the ability to operate your entire desktop. A misheard word or ambiguous instruction gets executed at the operating system level with no easy undo. It should be gated behind explicit high-privilege classification.

The practical architecture for someone building this seriously: VPS for reliability, process supervision for resilience, retry with backoff for transient failures, a concurrency queue to prevent overload, and keystroke emulation only as a last resort.

That's the production version. The hobbyist version is a Raspberry Pi under your desk running N8N with one retry configured and a Slack message when something fails. Both are valid depending on what you're automating. The Raspberry Pi is underrated — the hardware cost is negligible, it runs N8N fine, and you get the "always on" property without paying EC2 pricing for a task runner processing maybe ten voice notes a day.

If someone wanted to actually build this, where do they start?

Voice app first. ai is the easiest entry point because it has Zapier integration out of the box. If you want more control over the transcription output, AssemblyAI has a straightforward REST API and the word-level confidence scores are useful for filtering ambiguous input. Then N8N for the receiver — run it locally or on a Raspberry Pi, the webhook node gives you an endpoint in about three clicks, wire it to a function node that passes the transcript to Claude CLI via a shell command. Log the payload first, before anything else executes.

Authentication on that endpoint from day one, not as an afterthought.

A shared secret in the header, checked before the payload is processed. It takes five minutes to add and saves you from some very bad days.

Configure N8N's built-in retry on the Claude CLI node. Three attempts, exponential backoff. Add a notification step at the end of the error branch so you know when something hit the ceiling. That covers ninety percent of real-world failure scenarios without building anything custom.

Resist the keystroke emulation until you actually need it. Build the simplest version that works, add complexity only when a specific gap demands it.

Voice to webhook to Claude CLI with logging and retry is useful before you've added a single line of advanced tooling. The graveyard of over-engineered personal automation systems is enormous — full of people who spent three weekends building the perfect architecture and never recorded a single voice note into it.

Where does this go? Right now you're routing tasks from your voice to an agent on a machine. What does the next version look like?

The thing I keep thinking about is ambient task capture. Right now the model is deliberate — you pick up your phone, you record a note, you intend to create a task. The interesting frontier is continuous transcription where the system is listening for task-like utterances in normal speech and extracting them without you explicitly switching into task-capture mode. The transcription accuracy is already there. The question is whether the classification layer can reliably distinguish "remind me to call David" from "I was just telling someone I should call David" in conversational context.

That's a much harder classification problem than anything we've talked about today.

And it raises the stakes on false positives considerably. A misfire in deliberate mode means one bad task in your queue. A misfire in ambient mode means the system is acting on things you never intended to delegate.

Which brings the trust question back around. The whole architecture today depends on a human being the intentional trigger point. Remove that and you've changed the contract significantly.

That's the open question worth sitting with. Not whether the technology can do it — it probably can — but whether you want it to. Automation that waits for your signal is a tool. Automation that infers your intent from ambient context is something closer to a collaborator, and that relationship requires a different kind of trust.

Good place to leave it. Thanks to Hilbert Flumingtop for producing this one, and Modal for keeping the infrastructure running. This has been My Weird Prompts. If you've got a minute, leave us a review wherever you're listening.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2332: Voice-to-Task: Building the Claude Task Planner

Mentions

Downloads

You Might Also Like

#2332: Voice-to-Task: Building the Claude Task Planner