#2550: Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs

How to design scripts and pipelines so re-running them is safe, even after a crash mid-execution.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2708
Published: Apr 30
Duration: 27:05
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: fault-tolerance data-integrity reliability

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs**

Most developers know idempotency in theory: run an operation once or a hundred times, get the same result. In practice, strict mathematical idempotency is often impossible or absurdly expensive. The real goal is making pipelines resumable — safe to re-run without breaking things, double-charging APIs, or leaving half-baked states.

Checkpoints Are Not Booleans

The naive checkpoint is a flag file. Step one creates .step1.done, step two checks for it. This works until someone runs the script in a different directory, the temp folder gets cleaned, or — worst case — the flag file is created before the operation actually completes. A checkpoint should be written after the operation succeeds, and ideally contain a checksum or row count that proves completion, not just attempt.

Manifests: Checkpoints with Receipts

For data pipelines processing files from S3 or similar sources, a manifest file (JSON or database table) records each input file's name, content hash (SHA-256 of the file bytes), timestamp, output rows, and status. Before processing, check the manifest. If that exact content hash exists with status "complete," skip it. Content hashing catches files that were renamed, overwritten, or had modification times touched — things filenames miss entirely.

Transactional Writes and Atomic Renames

In shell scripts without database transactions, you can fake atomicity: write output to a temporary location, then atomically rename it into place. On most filesystems, a rename within the same filesystem is atomic. This eliminates race conditions where a downstream process reads a half-written file. Combined with lockfiles for mutual exclusion between concurrent processes, this pattern prevents a whole class of subtle bugs.

Deterministic State Checks vs. Bookkeeping

Your checkpoint tells you what you think happened. A state check tells you what actually happened. Before creating a database, check if it exists. Before inserting a user, query by unique key. When your manifest says "step three done" but the table doesn't exist, re-run step three. When the manifest says "not done" but the table exists, update the manifest and move on. This reconciliation is what tools like Ansible do at the module level — and it's worth testing explicitly by running deploy scripts twice and verifying the second run is a no-op.

Why This Matters for APIs and Money

When calling paid external APIs, never trust their idempotency claims under load. Build idempotency on your side by hashing request parameters and caching responses. One post-mortem described a batch job with a retry loop that charged $40,000 in API credits over a weekend because the API returned timeouts after actually succeeding on their end. The provider's stance: "our timeout behavior is documented."

The core principles: idempotent operations where feasible, checkpointing with receipts, and deterministic state checks as ground truth. Together, they turn fragile scripts into pipelines you can re-run with confidence.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2550: Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs

Daniel sent us this one — he's been thinking about idempotency in development workflows. Not the textbook definition, but the practical side: how do you design scripts and pipelines so that re-running them is actually safe, and when something breaks halfway through, you're not left staring at a half-baked state trying to figure out what ran and what didn't. He wants principles, practical patterns, and where this really matters — long data jobs, deploy scripts, batch processing, anything hitting paid external APIs.

Which is the approach I see in probably eighty percent of internal tooling scripts. And it works right up until it doesn't, and then it's two in the morning and you're grepping through log files trying to figure out which records already got inserted.

Before we dive in — quick note, this episode's script is coming from DeepSeek V four Pro, so if anything sounds slightly too coherent, that's why.

Okay, so let's start with what idempotency actually means in a pipeline context, because the mathematical definition and the practical engineering definition have drifted apart in useful ways.

The math one being — apply the operation once or a hundred times, you get the same result.

In math, f of f of x equals f of x. In engineering, we've loosened that to mean "re-running this script won't break anything." It might do redundant work, but the final state is correct and nothing gets double-charged or double-inserted. And that loosening is important, because strict mathematical idempotency is sometimes impossible or absurdly expensive to achieve.

Give me an example where strict idempotency is the wrong target.

If your pipeline sends a notification at step seven, a strictly idempotent design would mean re-running step seven doesn't send another email. You'd need to store a record of every email ever sent and check against it. That's a distributed systems problem for a five-line sendmail call. The pragmatic approach is — design the pipeline so that re-running it from the top doesn't re-trigger step seven if it already succeeded. That's not mathematical idempotency, that's checkpointing.

You're saying the real principle isn't "make every operation idempotent," it's "make the pipeline resumable.

And that breaks down into three things Daniel mentioned — idempotent operations where feasible, checkpointing so you know what's been done, and deterministic state checks so you can verify what's been done without relying on your own bookkeeping.

Let's unpack checkpointing first, because I think that's the one people reach for instinctively and then get wrong in subtle ways.

The classic naive checkpoint is a flag file. Step one creates a dot step one dot done file in a temp directory, step two checks for it before running. It works until someone runs the script in a different working directory, or the temp directory gets cleaned, or the flag file gets created before the step actually completes.

I've been burned by that last one. Script creates the flag, then the actual operation fails, and now the checkpoint says "done" but the work wasn't done.

That's the atomicity problem. Your checkpoint write and your operation completion need to be ordered correctly. Write the checkpoint after the operation succeeds, not before. And even better, make the checkpoint contain a checksum or a row count that proves the operation actually completed, not just that it was attempted.

A checkpoint isn't just a boolean — it's a state record.

And this is where manifest files come in, which Daniel specifically called out. A manifest is essentially a structured log of what happened — which files were processed, how many records, what the output hash was. It's a checkpoint with receipts.

Let's make this concrete. Suppose I've got a data pipeline that pulls CSVs from an S3 bucket, transforms them, and loads them into Postgres. Fifty files a day, and the transform step is expensive. What's the manifest pattern look like?

You maintain a manifest — could be a JSON file, could be a database table — that records, for each input file, the file name, the ETag or content hash, the timestamp it was processed, the number of output rows, and a status. Before processing a file, you check the manifest. If that file hash is already there with status "complete," you skip it. If it's there with status "failed," you clean up any partial output and retry. If it's not there, you process it.

The content hash is the key, not the filename.

Files get overwritten, renamed, copied. If you're checking "have I processed sales data dot CSV," you're going to miss that someone dropped a new version into the bucket. The content hash — SHA-256 of the file bytes — tells you whether you've actually processed this exact data before.

This is the content-hash-based skip Daniel mentioned. I've seen this save people from re-processing terabytes of data because someone touched a file's modification time.

It's not just for files. The same pattern applies to API calls. If you're calling an external API that charges per request, you can hash the request parameters and store the hash alongside the response. Before making the call, check if you've already got a cached response for that exact parameter hash. It's idempotency at the network boundary.

That's clever. You're not trusting the API to be idempotent — you're building idempotency on your side.

You should never trust the API to be idempotent, even if the docs claim it is. Stripe's API has idempotency keys and they're well-implemented, but I've seen plenty of payment processors where the idempotency guarantee degrades under load. When money's involved, you build your own guardrails.

Speaking of money — Daniel mentioned external APIs with cost as one of the places this matters most. What's the worst-case scenario for getting this wrong?

I saw a post-mortem a couple years back where a batch job calling a third-party address verification API had a retry loop without idempotency. The API returned a timeout error, but the request had actually succeeded on their side. The batch job retried, got charged again, timeout again, retried again. They ran through about forty thousand dollars in API credits over a weekend before someone noticed the billing spike. The API provider's stance was basically "our timeout behavior is documented, not our problem.

So the naive "run from scratch every time" approach isn't just slow, it's potentially expensive in ways that don't show up until the bill arrives.

It's not just money. It's also correctness. If your deploy script isn't idempotent, re-running it might create duplicate resources, clobber configuration, or leave you in a split state where half your infrastructure is at the new version and half at the old.

Let's talk about deploys, because that's where the idempotency conversation gets really practical for most developers. You're pushing code, running database migrations, updating configuration. What does an idempotent deploy actually look like?

The gold standard right now is declarative infrastructure as code — Terraform, Pulumi, that family of tools. You don't write "create a server," you write "there should be a server with these properties." The tool figures out whether it needs to create, update, or do nothing. That's idempotency at the architecture level.

If the tool crashes halfway through?

Terraform maintains a state file — essentially a manifest — that tracks what it's created and what it hasn't. If it crashes, you re-run it, it reads the state file, figures out what's already been done, and picks up from there. It's not perfect — state file corruption is a real thing — but it's miles ahead of imperative bash scripts that just run commands sequentially with no memory.

The state file corruption problem is interesting though. You've got this single point of failure that, if it gets out of sync with reality, your idempotency guarantee evaporates.

And that's where deterministic state checks come in — the third principle Daniel mentioned. Instead of trusting your own bookkeeping, you check the actual state of the world. Before creating a database, check if a database with that name already exists. Before inserting a user record, query for it by unique key. Your checkpoint tells you what you think happened, but the state check tells you what actually happened.

I like that distinction. The checkpoint is your memory, the state check is ground truth. And when they disagree, you've got a problem, but at least you know you've got a problem.

You can design your scripts to reconcile that disagreement automatically. If your manifest says "step three done" but the state check says "the database table doesn't exist," you re-run step three. If your manifest says "step three not done" but the table does exist, you update the manifest and move on. This is essentially what Ansible does — it checks the current state, compares it to the desired state, and only makes changes if there's a difference.

Ansible's approach is interesting because it pushes idempotency down to the module level. Each module is supposed to be idempotent on its own — the "user" module won't create a duplicate user, the "file" module won't change permissions if they're already correct.

When module authors get that right, it's beautiful. You can re-run an entire playbook and nothing happens on the second run because every module checks current state first. When they get it wrong, you get the kind of bugs where re-running a playbook slowly drifts your configuration because some module isn't truly idempotent.

There's a trust-but-verify dynamic. You're relying on the tool's idempotency guarantees, but you should also be testing that re-runs are actually no-ops.

That's a practice I wish more teams adopted — explicitly testing that your deploy scripts are idempotent by running them twice in a row and verifying the second run is a no-op. If the second run shows changes, something's not idempotent.

Let's shift to the pattern I think is most underused — transactional writes. Daniel mentioned this, and it's one of those things that sounds obvious but almost nobody does in scripting contexts.

Because it's genuinely harder in a scripting context than in a database context. In a database, you've got transactions — you wrap your inserts in BEGIN and COMMIT, and if something fails, everything rolls back. In a shell script writing to the filesystem, you don't have that.

You can fake it.

The pattern is — write your output to a temporary location, do all your work there, and then atomically move or rename the final result into place. On most filesystems, a rename within the same filesystem is atomic. So you never have a half-written file at the expected path. Either the old version is there, or the complete new version is there.

This matters even more when the consumer of your output is another process. If you're writing directly to the target path and the consumer reads while you're still writing, it gets a partial file.

I've debugged exactly that bug. A data pipeline that wrote CSV output in a streaming fashion, and a downstream process that watched the directory and picked up files as soon as they appeared. Every few days, the downstream process would get a truncated file because it grabbed it before the write was complete. The fix was writing to a dot tmp file and renaming on completion.

That's such a simple fix and it eliminates a whole class of race conditions.

It connects to another pattern Daniel mentioned — lockfiles. If you've got multiple processes that might try to do the same work, you need mutual exclusion. A lockfile says "I'm working on this, don't touch it.

Lockfiles are deceptively hard to implement correctly though. The classic trap is the check-then-act race condition — you check if the lockfile exists, it doesn't, so you create it. But between the check and the create, another process did the same thing.

That's why you need an atomic lock acquisition. On Unix, you use flock or you create a directory with mkdir, which is atomic. On Windows, you use exclusive file opens. You don't check and then create, you attempt to create and treat failure as "someone else has the lock.

You need lock expiration, because processes crash and leave stale locks.

Every lock needs a timeout. And the process holding the lock should refresh it periodically if the work is long-running. If the process dies, the lock expires and something else can take over.

Let's zoom out for a second. We've been talking about specific patterns, but I want to talk about the mindset shift. The naive approach is "I'm writing a script that does a sequence of things." The idempotent approach is "I'm writing a script that ensures a desired state exists, and it's safe to run it whenever I'm not sure.

That's exactly the framing. And it changes how you think about error handling. In a sequential script, an error means "something went wrong, abort." In an idempotent script, an error means "this step didn't reach the desired state, but everything before it is fine, so I can fix the problem and re-run.

Which is a much less stressful way to operate. I've been on call for systems where a failed deploy script at step nine of twelve meant an hour of manual cleanup before you could even attempt a re-run. That's the pain Daniel's describing — the half-finished state where you're guessing what's broken.

The guessing is the worst part. Without checkpoints or state checks, you're manually reconstructing what happened. Did the migration run? Did the cache clear? Did the load balancer update? You end up running SQL queries and checking timestamps and asking teammates in Slack.

The Slack part is too real. "Hey, did the Tuesday deploy actually finish? I'm seeing some weird behavior.

Nobody knows, because the person who ran it went to lunch and the terminal output scrolled off the screen.

Let's build a practical checklist. If someone's writing a script or pipeline today and they want it to be safely re-runnable, what should they actually do?

Step one — identify the expensive or dangerous operations. These are your API calls, your database writes, your file generation, your notifications. Anything where doing it twice is bad.

Expensive in time, money, or correctness risk.

Step two — for each of those operations, add a guard. Before doing the thing, check if it's already been done. The guard can be a content hash check, a database query, an API status call, whatever makes sense for that operation.

If the guard says "already done," skip. If it says "not done," proceed.

Step three — write your output atomically. Temp file plus rename. Or use database transactions. Or use API idempotency keys if the API supports them. The goal is that a failure during the operation doesn't leave a partial result.

Step four — record completion after success, not before.

Step five — test your idempotency by running the script twice and verifying the second run does nothing. If it does something, figure out why and fix it.

I'd add a step six — think about cleanup. If your script creates temporary resources, make sure they get cleaned up even if the script fails. Otherwise re-runs might collide with leftover temp files from the previous attempt.

Temp file naming should include a unique run identifier so different runs don't step on each other. And you should have a cleanup step that runs regardless of success or failure — trap EXIT in bash, finally blocks in Python, that kind of thing.

Let's talk about the places where this matters most, because Daniel specifically called those out. Long-running data jobs — if your job runs for six hours and fails at hour five, you don't want to redo four hours of work.

This is where checkpointing really shines. Every N records, or every M minutes, you write a checkpoint that says "processed up to record X." On restart, you read the checkpoint and resume from X plus one. Apache Spark and Flink do this natively, but you can implement it yourself for simpler pipelines with a SQLite database or even a text file.

The challenge is that the checkpoint itself has a cost. If you checkpoint after every record, your checkpoint overhead dominates your processing time. If you checkpoint every million records, you lose a lot of work on failure.

It's a tuning problem. You want the checkpoint interval to be roughly the amount of work you're willing to redo. If redoing ten minutes of work is acceptable, checkpoint every ten minutes. If you need exactly-once semantics and can't afford to redo anything, you need a much more sophisticated approach — probably an event log like Kafka with consumer offsets.

Deploy scripts — we touched on this, but I want to emphasize the blast radius issue. A non-idempotent deploy script that fails halfway through can leave production in a state that nobody intended and nobody understands.

The larger the deploy, the worse this gets. If you're deploying to a hundred servers and the script fails after updating fifty of them, you're now running a split version. Idempotency means you can re-run the deploy and it'll update the remaining fifty without breaking the fifty that already updated.

The other place Daniel mentioned — batch processing. I think batch processing is interesting because it's often treated as less critical than streaming, but the failure modes are worse. A streaming job that fails, you lose a few seconds of data. A batch job that fails, you might lose a day of processing and not notice until the reports don't show up.

Batch jobs are often scheduled, so if they fail at 3 AM, nobody's watching. They just silently produce incomplete output and you find out the next morning. Idempotent design means the scheduler can just re-run the job and it'll pick up where it left off, no human intervention needed.

The silent failure is what scares me. At least with a hard crash you know something's wrong. A batch job that half-finishes and exits zero is a nightmare.

That's why exit codes matter. Your script should exit non-zero if any step didn't reach its desired state. Don't catch exceptions and exit zero unless you're absolutely sure the work is complete. And your scheduler should alert on non-zero exits.

Let's talk about a counterintuitive point — sometimes "run from scratch every time" is actually the right call.

I was waiting for this.

When your input data is small, your processing is fast, and the cost of building idempotency exceeds the cost of just redoing the work. If your entire pipeline runs in thirty seconds and processes a megabyte of data, the engineering time you spend on checkpointing might never pay back.

And there's a simplicity argument too. An idempotent pipeline has more moving parts — manifest files, state checks, atomic writes. More things that can go wrong in their own special ways. If you can afford to just wipe the output directory and re-run, that's sometimes the more reliable approach.

The trick is knowing which situation you're in before the pipeline grows to the point where re-running takes hours.

Pipelines have a way of growing. That thirty-second script becomes a five-minute script becomes a two-hour job, and by the time you realize you need idempotency, you've got years of accumulated complexity to retrofit.

Maybe the principle is — if there's any chance this pipeline grows, build in at least minimal idempotency from the start. A manifest file is cheap.

A manifest file and atomic writes. Those two patterns alone cover a huge percentage of the pain. You can add more sophistication later, but those give you a foundation.

What about the "did this step run" guard pattern Daniel mentioned? I feel like there's a right way and a wrong way to implement that.

The wrong way is checking for side effects that might have other causes. Like, "if the output file exists, assume the step ran." But maybe the output file exists from a previous run with different inputs, or someone created it manually while debugging.

You need a guard that's specific to the inputs.

The guard should incorporate the input hash or the parameters. "Does the output exist for these exact inputs?" Not just "does the output exist?" This is where content-hash-based skips come in — you hash the inputs, and the guard checks whether a result for that hash has already been computed.

Which is essentially memoization at the pipeline level.

It's function memoization applied to data processing. And it works beautifully for deterministic transforms — same input always produces same output, so you can safely skip if you've already computed it.

What about non-deterministic steps? If your step calls an API that returns different results each time, or generates a timestamp, the content hash approach breaks down.

For non-deterministic steps, you fall back to simpler checkpointing — a boolean flag that says "this step ran to completion." You accept that re-running might produce slightly different output, but you ensure it doesn't produce duplicate side effects. The idempotency concern shifts from "same output" to "no double-charges, no duplicate records.

That's where the idempotency key pattern shines for APIs. You generate a unique key for each logical operation, send it with the request, and the API uses it to deduplicate. Even if the response is different, you only get charged once.

Stripe's implementation of this is worth studying. You include an idempotency key header, and Stripe stores the response for that key. If you send the same key again, you get the stored response, not a new charge. The keys expire after twenty-four hours, which is a reasonable tradeoff between safety and storage cost.

Twenty-four hours is interesting — it means you can safely retry a payment for a full day, but you're not asking Stripe to store idempotency keys forever.

That expiration window is documented, which is crucial. You need to know how long your idempotency guarantee lasts. If you're building your own idempotency layer, document the retention period for your manifest or checkpoint data.

Let's talk about a failure mode I've seen a few times — the manifest file itself becomes a bottleneck or a corruption risk.

Single manifest file for a distributed pipeline is a recipe for contention. Multiple workers all trying to read and write the same file, you get race conditions or you serialize everything through a lock and kill your throughput.

You shard the manifest.

You shard by some natural partition key — input file name, date, customer ID. Each worker owns its own manifest shard. Or you use a database with row-level locking instead of a file. SQLite actually works surprisingly well for this if your concurrency is moderate — it handles the locking for you and you get transactional guarantees.

SQLite as a manifest store is underrated. It's a single file, it's portable, it handles concurrent reads, and with WAL mode it handles concurrent writes reasonably well.

You can query it. Instead of grepping through a JSON file to find out if a particular record was processed, you run a SELECT. For pipelines that process millions of items, that queryability becomes essential.

We should talk about one more anti-pattern — the script that's "mostly idempotent" but has a few steps that aren't, and nobody remembers which ones.

That's worse than not being idempotent at all, because it creates a false sense of safety. You re-run the script thinking it's safe, but step four quietly double-inserts a bunch of records.

I've seen this documented in runbooks. "If the deploy fails, you can safely re-run the script EXCEPT you must manually revert step six first." And step six is buried in a comment on line 247.

The fix is making the non-idempotent steps loudly non-idempotent. They should fail explicitly on re-run rather than silently doing the wrong thing. Add a guard that says "this step has already run, and re-running it is not safe — abort and tell the human to handle it manually.

Fail closed, not fail open.

If you can't make it safe, make it loud.

Let's bring this back to Daniel's original framing. He contrasted the naive "run from scratch" approach with idempotent design. I think the real insight is that "run from scratch" isn't naive if your workload is small and simple — it's actually the right call. The naive part is assuming your workload will stay small and simple.

The pain of the half-finished script is universal. Every developer has a story about a script that crashed at step seven of ten and left them hand-editing a database to clean up. The patterns we've been discussing — manifest files, content-hash skips, atomic writes, state checks — they're all about making that pain go away.

The common thread is: don't trust your memory of what happened, and don't trust that the operation completed just because you asked it to. Check the actual state, record what you've done, and design every step so that doing it twice isn't catastrophic.

Run your scripts twice. If the second run isn't boring, you've got work to do.

Alright, I think we've covered the ground. Principles, patterns, where it bites hardest. Daniel, hopefully that gives you a framework.

Now: Hilbert's daily fun fact.

Hilbert: The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when it appeared on the Scottish royal coat of arms. Scotland is one of the only countries whose national animal does not actually exist.

That tracks, somehow.

This has been My Weird Prompts — thanks to our producer Hilbert Flumingtop. You can find every episode at myweirdprompts dot com. We're back next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2550: Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs

Downloads

You Might Also Like

#2550: Idempotent Pipelines: Checkpoints, Manifests & Safe Re-Runs