#4080: How Resumable Uploads Actually Work

What happens when your file transfer fails at 92% — and the app just picks up where it left off.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-4259
Published: Jul 3
Duration: 22:50
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: networking fault-tolerance data-integrity

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When a file transfer dies at 92%, the network layer shrugs. TCP sessions evaporate, servers forget you existed, and without application-layer intervention, you're starting from zero. The magic of "resuming" an upload isn't magic at all — it's a carefully engineered protocol stack that remembers what the transport layer can't.

Downloads solved this decades ago with HTTP byte range requests (RFC 7233). A simple "Range: bytes=5000-" header tells the server to stream from byte 5000 onward, and a 206 Partial Content response delivers only the missing data. It's standardized, universal, and trivial to implement.

Uploads are a different beast. HTTP PUT expects a complete body with no native partial upload support. The workaround is chunking: splitting files into pieces, uploading each as separate requests, and tracking success locally. But this is custom application logic — every implementation has different bugs, different retry semantics, and different state management.

Enter TUS (The Upload Server), an open standard that solves upload resumption at the protocol level. The client POSTs to create an upload, gets a unique URL, then sends PATCH requests with an Upload-Offset header. The server tracks a simple integer offset and appends bytes. If offsets mismatch, the server returns 409 Conflict with the correct position. Vimeo, Google Photos, and major CDNs use this approach in production.

The real complexity lives client-side. Apps must persist progress to disk — usually SQLite or JSON — and handle atomic writes to avoid corruption during crashes. Tools like rclone add SHA-256 chunk verification, catching silent bit flips that byte offsets miss. The payoff is a fundamentally different reliability model: eventually consistent uploads that fire and forget, turning apps into sync engines rather than dumb pipes.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#4080: How Resumable Uploads Actually Work

Daniel sent us this one — he's been thinking about what actually happens when you're uploading a big file over a terrible connection, and somehow the app doesn't just give up. You know, you're at a campsite, one bar of cellular if you hold the phone at exactly the right angle, you're trying to push a four gigabyte video to the cloud, and at ninety-two percent it dies. But instead of starting over from zero, the app just says "resuming" and picks up where it left off. Daniel's question is: what's doing the lifting there, from an engineering standpoint?

It's one of those things where if it works, you don't think about it. If it doesn't, you throw your phone into a lake.

Which is a valid debugging strategy, I've heard.

It's not. But the question gets at something genuinely important right now. Remote work means people are uploading large files from coffee shops with spotty Wi-Fi. Mobile-first apps assume you're going to walk through a tunnel mid-transfer. Cloud-native workflows treat the network as fundamentally unreliable. The tech that makes resumption possible is completely invisible to the user, and it's not one thing — it's a stack of design decisions that most people never need to know exist.

Until they do. And then it feels like magic, which usually means there's a protocol doing something clever that nobody talks about.

And the first thing to understand is that this isn't TCP's job. People assume TCP retransmission handles everything — packet gets lost, TCP resends it, problem solved. But TCP only works within a single session. If the connection drops entirely, the TCP session is gone. The server has no idea what data made it through. You're back to square one unless something at the application layer is keeping track.

The network layer says "I have failed you, good luck," and the application layer has to pick up the pieces.

That's the whole game. And it breaks into two very different problems — download resumption and upload resumption. Downloads are relatively easy. Uploads are where it gets interesting, and where most of the clever engineering lives.

Let's define the problem properly. You've got a large file — say a raw video project, a disk image, a dataset — and you're pushing it across a network that might drop at any moment. Wi-Fi bridges between buildings, cellular in a moving car, satellite with weather interference. These aren't edge cases anymore. They're how a lot of people work.

Streaming video sidesteps this entirely. You buffer a few seconds ahead, if a packet vanishes you just ask for it again before the buffer runs dry, the viewer never notices. But file transfers don't have that luxury. Every single byte has to arrive intact and in order, or the whole thing is useless. A corrupted database backup isn't ninety-seven percent useful. It's zero percent useful.

Streaming is loss-tolerant by design. File transfer is loss-intolerant. And that distinction is the entire reason resumable protocols exist. If you're downloading a file and the connection drops at eighty percent, you don't want to throw away the first eighty percent and start over. That's bandwidth you already paid for, time you already spent, and on a metered cellular connection, possibly money down the drain.

The question becomes: how does the application remember what it already has, and how does the server agree to only send or receive what's missing? That's the resumption handshake, and it's not something the transport layer gives you for free.

This is where people get tripped up by TCP. TCP is brilliant at what it does — guaranteed in-order delivery within a single connection. But a connection drop isn't a packet loss problem. It's a session death problem. The TCP state on both ends evaporates. The server doesn't remember you. The client's send buffer is gone. If you reconnect, you're a brand new stranger as far as the network stack is concerned.

The application layer has to carry the memory. It has to say "we've met before, here's where we left off, let's continue." That's not a network feature. That's software design.

The shape of that design depends entirely on whether you're downloading or uploading. Downloads have had a standard solution for decades. Uploads are the hard problem, and the one Daniel's scenario really hinges on.

Downloads first, because that's the easy one and it sets up why uploads are painful. HTTP has a built-in mechanism called byte range requests, defined in RFC 7233 back in 2014. Here's how it works. The client sends a GET request with a header that says "Range: bytes equals five thousand dash." That means "give me everything from byte five thousand onward." The server responds with a 206 Partial Content status code and streams only the missing bytes.

The client already has the first five thousand bytes sitting on disk, the server knows how to serve a slice, and they just pick up where the TCP session died.

And this is widely supported. Every major web server, every CDN, every object storage service. It's standardized, it's simple, and it works. If you've ever paused a download in a browser and resumed it hours later, you were using byte range requests under the hood.

Which makes upload resumption sound like it should be the same thing in reverse. But it's not.

Not even close. HTTP PUT has no native concept of partial uploads. The server expects a complete body. There's no standard header for "I already sent the first three megabytes, just take the rest." The server doesn't maintain session state between requests. When the connection drops, the server has no way of knowing how many bytes it received before the interruption. It might have a partial file sitting in a buffer somewhere, but the protocol gives it no mechanism to ask the client "where were we?

The client has to become the state manager. It has to know what it sent, what the server acknowledged, and what still needs to go.

That's where things get custom and messy. The workaround is chunking. The client splits the file into pieces — say five megabyte chunks — uploads each one as a separate HTTP request, and tracks which ones succeeded. If the connection drops, it only retries the failed chunks. But this is all application logic. You're building a mini protocol on top of HTTP, with your own tracking database, your own retry semantics, your own error handling. Every app that does this has a slightly different implementation, and they all have bugs.

Which is the kind of problem that eventually makes someone say "we should standardize this before we all go insane.

That someone was the team behind TUS. T-U-S, which stands for The Upload Server, though everyone just calls it TUS. It's an open standard, version one point zero, and it solves upload resumption cleanly at the protocol level. Here's the flow. The client sends a POST to create a new upload. The server returns a unique URL just for that upload — something like slash files slash a seven f three b nine. No data has been sent yet. Then the client sends PATCH requests to that URL with an Upload-Offset header saying "I'm starting at byte zero" or "I'm resuming at byte fifteen million." The server looks at what it already has, verifies the offset matches, and accepts only the new data. Each successful PATCH returns a 204 No Content with the updated offset.

The URL is the session. The server doesn't need to remember you — it just needs to know what's at that endpoint.

And the Upload-Offset header is the handshake. The client says "I believe you have this many bytes, here's the next batch." If the offset is wrong — maybe the server crashed and lost some data — the server responds with a 409 Conflict and tells the client the correct offset. The client adjusts and resends. It's a negotiation, not an assumption.

This is actually deployed? This isn't some RFC that three people implemented?

Vimeo uses it for video uploads. Google Photos uses a TUS-like chunked mechanism for large videos from mobile devices. Major CDNs support it. There are client libraries in over twenty languages. It's real infrastructure. And the elegance is that the server implementation is straightforward — you're basically appending bytes to a file and tracking an integer offset. The complexity lives in the client, where it belongs, because the client is the one with the flaky connection.

Which brings us to what the client actually has to do to make this reliable. Because the protocol handles the client-server conversation, but the app still has to survive its own crashes.

This is the part nobody thinks about. The app needs to persist upload progress locally. If you're uploading a four gigabyte video and the app crashes at ninety percent, it has to wake up and know exactly where it left off. That means writing state to disk — usually SQLite or a simple JSON file — every time a chunk completes.

That's a tradeoff. Write too often and you drain the battery on a phone. Write too rarely and a crash loses more progress.

There's also the atomicity problem. If you're updating a progress file and the app crashes mid-write, you get a corrupted state file and the upload might restart from zero even though the server has most of the data. The fix is atomic writes — write to a temp file, then rename it over the real one. But that adds a filesystem sync operation, which on mobile flash storage can be surprisingly slow.

You're trading off reliability against latency and battery, and the right answer depends on whether your user is on Wi-Fi at a desk or cellular in a moving car. The protocol can't decide that for you.

This is where tools like rclone get interesting. rclone does chunked transfers with retry logic, and it tracks SHA-256 hashes of each chunk. So when it resumes, it doesn't just trust the byte offset — it can verify that the data the server already has is actually correct. No silent corruption from a network glitch that flipped a bit mid-transfer.

Which is the kind of paranoia you want in something that's syncing your backups. Trust but verify, except the trust part is optional.

That verification layer becomes critical when you zoom out and think about what resumable uploads actually enable for app design. It's not just a convenience feature. It changes the reliability calculus entirely. You can design for eventually consistent uploads — the user taps "share," walks into a dead zone, and three hours later the file just shows up in the cloud. The app doesn't need to block on a spinner. It can fire and forget.

Which is a fundamentally different mental model. Instead of "I am transferring a file right now and must babysit this progress bar," it becomes "I have expressed an intent to upload, and the system will make it true eventually." Like sending a letter instead of making a phone call.

That's not just a UX nicety. It means you can batch uploads, queue them by priority, defer large files to Wi-Fi only. The app becomes a sync engine, not a dumb pipe. But that only works if the resumption mechanism is bulletproof, because the user is no longer watching. If it silently corrupts, they might not discover it for weeks.

Which brings us back to Daniel's actual scenario — the NAS over Wi-Fi with an Ethernet bridge. Because that's where the absence of resumption support stops being theoretical and starts being a daily headache.

SMB and NFS were designed for reliable local networks. They assume the connection stays up. If your Wi-Fi bridge drops for two seconds — which it will, especially if it's hopping between access points or dealing with interference — the SMB session dies. The file copy fails. You get a cryptic error, a partial file, and you have to start over.

If it's a fifty gigabyte media archive, starting over is not a minor inconvenience. It's a "reconsider your life choices" moment.

The fix is rsync with two flags: dash dash partial and dash dash append-verify. Partial tells rsync to keep the incomplete file instead of deleting it. Append-verify tells it to resume from the last complete block, but also checksum what's already there to make sure it's not corrupted. It's not as elegant as TUS — it's a file-level tool, not a protocol — but it works over SSH and it's been doing this for decades.

Rsync is basically the duct tape of resumable transfers. Not pretty, but it holds.

The more modern approach is a cloud sync layer. Synology Drive, for example, implements chunked resumption at the application level. It doesn't care that the underlying SMB connection is flaky, because it's not using SMB for the actual transfer — it's using its own protocol over HTTP or HTTPS, with chunk tracking and retry logic built in. The NAS still sees the file, but the sync engine handles the unreliable part.

You're layering reliability on top of unreliability. The Wi-Fi bridge drops packets, the sync engine shrugs and retries the chunk. The user just sees files appearing.

That pattern — application-layer resilience over an unreliable transport — is exactly what mobile apps have been doing for years. But mobile developers face a choice. You can roll your own chunked upload logic. It's flexible, you control everything, but it's bug-prone and you're now maintaining a mini protocol forever. Or you adopt TUS, which is standardized and has client libraries in over twenty languages, but it requires server support.

The trend seems to be toward TUS for anything new.

Because the hidden cost of rolling your own isn't the initial implementation. It's the edge cases. What happens when the server restarts mid-upload and loses its in-memory offset? What happens when the client's clock is wrong and it miscomputes which chunks to retry? What happens when a chunk boundary falls exactly on a UTF-8 character boundary in a text file and you split a multi-byte sequence? TUS has already solved these. You don't want to rediscover them in production at three in the morning.

That's before you even get to integrity verification, which is the part most custom implementations skip entirely.

They shouldn't. Here's the scenario. Your upload gets interrupted at sixty percent. The server has a partial file. You reconnect and resume. But what if a bit flipped in the partial data during the interruption? Maybe a cosmic ray, maybe a memory error, maybe the server's disk controller had a bad day. The byte offset says everything is fine, but the data is silently corrupted. You upload the remaining forty percent, and now you have a complete file that's wrong.

Which is worse than a failed upload, because at least a failure tells you something went wrong.

This is why Backblaze B2 uses SHA-1 checksums per chunk. When you resume an upload, the server can verify that the partial data it already has matches the checksum the client originally sent. No re-hashing the entire file — just the chunk in question. rclone takes it further with SHA-256 per chunk, which is overkill for most use cases but exactly what you want for backup integrity.

The checksum is the proof that the bytes you think you have are the bytes you actually have. Without it, you're just hoping.

Hope is not an engineering strategy. The interesting contrast here is Dropbox. Their desktop client doesn't use HTTP-based resumption at all. They built a custom binary protocol with block-level sync — it splits files into four megabyte blocks, hashes each one, and only transfers blocks that changed. It's optimized for LAN performance, where bandwidth between devices on the same network is essentially free. Totally different design philosophy from TUS, which assumes the network is the bottleneck.

Different problem, different solution. Dropbox is optimizing for "this file changed slightly, sync the delta." TUS is optimizing for "this file is huge and the connection is terrible, just get it there eventually.

That's the broader insight. The protocol you choose shapes what your app can promise users. If you pick TUS, you're promising eventual delivery over hostile networks. If you pick something like Dropbox's block sync, you're promising fast deltas over reliable LANs. There's no universal best answer — just the one that matches what your users actually need.

Which loops back to Daniel's NAS. If he's pushing files over a Wi-Fi bridge, he doesn't need block-level delta sync. He needs eventual delivery over a hostile network. That's a TUS-shaped problem, or an rsync-shaped problem, not an SMB-shaped problem.

If you're building something new, the takeaway is pretty clear. Don't roll your own chunked upload. It's at tus dot io, version one point zero, client libraries in over twenty languages, and it's already deployed at scale by Vimeo, Google Photos, and most major CDNs. The edge cases have been solved. You get to sleep.

If you're not building something new — if you're Daniel with a NAS and a flaky Wi-Fi bridge — the fix is rsync with dash dash partial and dash dash append-verify instead of dragging files through SMB and praying.

SMB assumes a reliable link. Your Wi-Fi bridge is not a reliable link. Rsync over SSH doesn't care — it'll resume from the last intact block and checksum what's already there. Or if you want something that runs itself, set up Synology Drive or NextCloud. They implement chunked resumption at the application layer, so the user never sees the underlying network flakiness.

For everyone in between — the listener who just wants to know if their stuff actually works — there's a simple test. Start a large upload on whatever cloud app you use. Kill the connection halfway through. Bring it back. Time how long the resume takes, and check whether the file is intact at the end. If it restarts from zero, your app doesn't support resumption. If it picks up where it left off and the file is fine, you're good.

For self-hosted setups, check whether your upload endpoint speaks HTTP Range or TUS. If it doesn't, and you're pushing large files over unreliable connections, you're one dropped packet away from frustration.

The invisible engineering turns out to be pretty visible once you know where to look.

There's an open question here that I think about a lot. HTTP/3 and QUIC are supposed to fix connection migration at the transport layer. If your Wi-Fi drops and you switch to cellular, QUIC can theoretically keep the session alive without the application even noticing. So does that make TUS obsolete?

I'd argue no, and here's why. QUIC handles connection migration within a session. But if your phone dies entirely, or you close the app and reopen it three hours later, that session is gone. QUIC can't resurrect a dead session any more than TCP can. TUS works across sessions, across reboots, across days. It's session-level state, not transport-level state.

QUIC gives you resilience within a single session. TUS gives you resilience across sessions. They solve different problems, and I think they'll coexist. The transport layer gets better at not dropping, and the application layer gets better at recovering when it does.

The other piece is edge computing. You've got IoT devices — environmental sensors, agricultural monitors, wildlife cameras — running on solar panels and intermittent satellite links. They wake up, take a reading, try to upload, and might not get a connection for another six hours. Resumable uploads aren't a nice-to-have there. They're the only way the system works at all.

WebTransport is the protocol I'm watching for that space. It's built on QUIC, it supports unreliable and reliable streams in parallel, and it could enable a new class of resumable transfer that's more efficient than TUS for certain workloads. But it's still early. TUS is here now, it's proven, and it solves the problem Daniel is actually dealing with.

The answer to Daniel's question — what's doing the lifting — turns out to be a stack. HTTP range requests for downloads, TUS or chunked uploads for the hard direction, local state persistence for crash survival, and checksums so you're not just trusting the offset. None of it is magic. It's just careful engineering that most people never need to know exists.

Until they do. And now they know.

Now: Hilbert's daily fun fact.

Hilbert: The dual number in Slovene was once thought to have been documented first by a linguist working in Bhutan in the early nineteen hundreds, but this was a cataloguing error — the manuscript was actually from a Slovenian dialect study in Austria, misfiled under Bhutan in a Vienna archive and only corrected in the nineteen eighties.

That's a very specific error to make.

Misfiled under Bhutan. Happens to the best of us.

Something to watch in the protocol space — WebTransport, and whether it reshapes how we think about resumable transfers. For now, if you're pushing big files over bad connections, the tools exist. You just have to know which layer to fix.

Thanks as always to our producer Hilbert Flumingtop. This has been My Weird Prompts. If you enjoyed this, leave us a review wherever you listen — it helps other people find the show.

We're back next week.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#4080: How Resumable Uploads Actually Work

Downloads

You Might Also Like

#4080: How Resumable Uploads Actually Work