#3995: Can Your Phone Upscale Photos Without the Cloud?

Local upscaling on phones is here — but which approach actually works?

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-4174
Published: Jun 30
Duration: 21:40
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: edge-computing image-generation gpu-acceleration

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A listener asks whether he can run a local upscaling model on his OnePlus phone, no cloud involved. The question turns out to be a perfect test case for a much bigger debate: how should AI deploy on edge devices? Two philosophies are competing. One takes a general model like Real-ESRGAN and quantizes it down to INT8 so it runs on whatever NPU a phone happens to have. The other builds a model from scratch for a specific chip, exploiting hardware features like Qualcomm's dedicated AI upscaler block inside the ISP pipeline or Apple's ANE fusion instructions.

Both approaches are shipping today. A quantized Real-ESRGAN on a OnePlus 13 through Qualcomm's SNPE SDK takes about 200 milliseconds per 1080p frame — usable for photos, pointless for video. Apple's three-megabyte Core ML upscaler on the A18 Neural Engine runs in 30 milliseconds. The specialized ISP upscaler on Snapdragon 8 Gen 4 does sub-10ms per frame at 60 fps, but only on camera pipeline input — not arbitrary gallery images.

The catch is that quantization isn't lossless for pixel-generation tasks. Upscaling models are sensitive to banding and texture smoothing because the human eye notices exactly where integer math breaks down. Meanwhile, the specialized path locks you into hardware you can't update. The industry is converging on runtimes like ONNX with QNN backend that support multiple NPUs from a single export, but chip vendors keep building proprietary features like Samsung's weight streaming that generic models can't exploit. The edge AI ecosystem hasn't decided whether it wants to be Esperanto or a collection of native languages.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3995: Can Your Phone Upscale Photos Without the Cloud?

Daniel sent us this one — he's been thinking about edge AI, specifically upscaling. He's got a OnePlus phone and he's wondering: can he run a local upscaling model on it, right there on the device, no cloud involved? He's not expecting desktop-grade performance, but he's asking the bigger question underneath that. We've seen two approaches to getting AI onto devices — either you take one big model and quantize it down to run on less powerful hardware, or you build something from scratch specifically for a particular chip. Which way is this whole thing going?

The timing on this is actually perfect, because the hardware just crossed a threshold that makes this a real question instead of a hypothetical. The Snapdragon 8 Gen 4 and MediaTek's Dimensity 9400 are both shipping NPUs that push past forty trillion operations per second at INT8. That's enough compute to do meaningful upscaling locally. The silicon is ready.

The software isn't.

And that gap is the whole story. You've got this absurdly capable neural processing hardware sitting in millions of phones, and the question of what actually runs on it efficiently is still wide open. Do you take Real-ESRGAN, quantize it down to INT8, and hope it runs on whatever NPU the phone happens to have? Or do you build something bespoke for Qualcomm's Hexagon, or Apple's Neural Engine, and accept that it won't work anywhere else?

The phone's got the muscle, but nobody's quite figured out the choreography.

That's it. And the user's question about upscaling a photo on a train — that's the perfect test case. It's not real-time video, it's not mission-critical, but it's the kind of thing where you'd actually notice whether it works or not. A blurry whiteboard snap that becomes readable. That's the promise. And right now, whether you can do that depends entirely on which deployment strategy wins out.

Let's make that train scenario concrete, because I think it helps. Daniel's on a train, he's got spotty cell service, he's just taken a photo of a schedule board at the station that came out unreadable. He wants to upscale it right there, no cloud, no waiting for a cell tower. What actually happens when he taps that button?

Right now, it depends on which app he's in. If he's using the stock camera app or the gallery app that shipped with the phone, and that phone is a flagship from the last twelve months, he probably gets his readable schedule board in under a third of a second. The specialized hardware path kicks in. But if he's in a third-party app — say he's using Snapseed or some indie photo editor — that same upscale might take a full second, or even two. And on a mid-range phone from two years ago, it might take five seconds or just fail outright. Same task, same user intent, completely different experience.

What Daniel's really asking, whether he framed it this way or not, is whether the edge AI world converges or splinters. We've got two philosophies that are fundamentally incompatible in how they approach the problem. One says: build one great model, then squeeze it down to fit wherever it lands. The other says: every chip is a unique snowflake, so build a model that marries that specific silicon.

Upscaling turns out to be a surprisingly good way to test which philosophy actually works. It's not a chatbot where you can hide latency behind a typing animation. You either get your sharper image in under a second or you don't.

And the stakes here go way beyond whether your vacation photos look crisper. If the industry converges on something like ONNX Runtime with Qualcomm's QNN backend, you train once and deploy everywhere — Snapdragon, MediaTek, Exynos. Developers love that. But if every chip vendor insists on their own proprietary stack, we end up with an ecosystem where your upscaling app works great on a Galaxy but crawls on a OnePlus, or vice versa.

The upscaling question is a proxy for whether the whole edge AI thing becomes a functional ecosystem or a patchwork of walled gardens.

That's it. And the thing is, both approaches are shipping right now. You can already run a quantized Real-ESRGAN on a OnePlus 13 through Qualcomm's SNPE SDK, and you can already use Apple's three-megabyte Core ML upscaler that was built from scratch for the A18 Neural Engine. They both produce a sharper image. But the paths that got them there, and what those paths mean for the next five years, are completely different stories.

We've got two shipping products, two totally different philosophies, and they both kind of work. How do we even start to pick a winner here?

To understand which of these paths actually makes sense, we need to get into the weeds of what quantization actually does to an upscaling model. Because it's not the same as quantizing a language model, and the differences are where things get interesting.

I'm ready to be in the weeds.

Take Real-ESRGAN, which is a convolutional neural network — layers of filters that progressively reconstruct detail. In its full FP32 form, every weight is a thirty-two-bit floating point number. The model's maybe sixty-five megabytes, and running it means doing floating-point math across every layer. Mobile GPUs can handle that, but they're not optimized for it. NPUs, on the other hand, are designed to crunch integer math — INT8, INT4 — extremely fast. So you quantize: you map those thirty-two-bit weights down to eight-bit integers, which shrinks the model to about a quarter of its size and lets the NPU do the heavy lifting.

The catch is that mapping isn't lossless.

With language models, you can often get away with aggressive quantization because the output is probabilistic — a slightly fuzzier weight distribution still produces a coherent sentence. But upscaling is different. You're generating pixels. If an intermediate feature map has a high dynamic range — say, a bright edge next to a dark shadow — and you quantize the activations too aggressively, you get banding. Smooth gradients turn into visible steps.

The thing your eye is most sensitive to is exactly where the quantization breaks first.

That's the problem. And this is why the MediaTek research from last year was so revealing. They took ESRGAN, quantized it to INT8, and ran it on the Dimensity 9300's NPU. They got a four-point-two-times speedup over GPU inference. That's real. But they also measured a zero-point-seven decibel drop in peak signal-to-noise ratio, and on faces and text, that translated to visible texture smoothing. Not catastrophic, but noticeable if you looked.

You get your upscaled whiteboard photo faster, but the text on it is slightly mushier. Which kind of defeats the purpose.

And that's where the specialized approach takes a completely different route. Qualcomm's Snapdragon 8 Gen 4 doesn't just have a general NPU — it has a dedicated AI upscaler block inside the image signal processor pipeline. This isn't a programmable neural processor. It's a fixed-function hardware accelerator that runs one specific proprietary model, trained from scratch for the Hexagon architecture.

Wait — so it's not even a quantized version of something else. It's a model that was born on that silicon.

Born on it, married to it, can't leave it. Sub-ten-milliseconds per ten-eighty-p frame. Sixty frames per second. Power draw is negligible because the data never leaves the ISP pipeline. But here's the trade: you can't update that model. You can't swap it for a different upscaler. If Qualcomm improves their algorithm next year, you're buying a new phone to get it.

It's the difference between a Swiss Army knife and a scalpel. One does many things adequately, the other does one thing perfectly and can't do anything else.

On the OnePlus 13 specifically, both of these are actually available. You can run a quantized Real-ESRGAN through Qualcomm's SNPE or QNN SDK — that's the general path. You'll get about two hundred milliseconds per ten-eighty-p frame. That's five frames per second. Totally usable for a photo. Pointless for video. Meanwhile, that specialized ISP upscaler is doing sixty frames per second, but it only works on camera input — live viewfinder, captured frames in the camera pipeline. You can't point it at an arbitrary image in your gallery.

Daniel's blurry whiteboard photo — if he snapped it himself and upscaled it through the camera app, the specialized hardware would handle it instantly. But if someone emailed him that same photo and he tried to upscale it in a third-party app, he's on the slower quantized path.

That's exactly the fragmentation we were talking about. And Apple's approach makes the contrast even sharper. Their Core ML upscaling model in iOS 18 is three megabytes. It was trained specifically for the A18 Neural Engine, and it runs in thirty milliseconds per image. Compare that to a generic ONNX upscaling model on the same phone — a hundred and fifty milliseconds. Five times slower.

Three megabytes is absurdly small. What's actually in there?

That's the thing — when you train for a specific architecture, you can exploit hardware features that a generic model can't assume exist. Apple's ANE fusion, for instance, can combine a convolution and an activation function into a single hardware instruction. If you know that's available, you design your model graph around it from the start. A quantized general model has to be conservative — it can't bake in those assumptions.

The specialized model isn't just faster because it's smaller. It's faster because it's speaking the chip's native language.

The quantized model is speaking a kind of hardware Esperanto — universally understood, but nobody's first language.

That's a great metaphor, but let me push on it. Esperanto was designed to be easy to learn, and in theory it works everywhere. The problem was never the language — it was that nobody grew up speaking it, so it was always slightly awkward. Is that what's happening here? The quantized model can run anywhere, but it's always slightly awkward on every chip?

That's exactly what's happening. And the awkwardness shows up in specific, measurable ways. Take memory access patterns. A generic ONNX model assumes a flat memory hierarchy — it loads weights, does math, stores results. But a chip-specific model knows that the Hexagon NPU has a particular SRAM buffer size and arranges its operations to keep data in that buffer as long as possible before spilling to main memory. That one optimization can be worth a thirty percent speedup, and the generic model simply can't do it because it doesn't know the buffer size exists.

It's not even about the math being wrong. It's about the logistics of moving data around being suboptimal.

And the more specialized the hardware gets, the more those logistics matter. It's like having a warehouse where you know exactly where every item is and the shortest path to grab it, versus walking in with a generic map that gets you to the right aisle but makes you check every shelf.

That hardware Esperanto versus native speaker gap is where the industry has to make a real choice. And if you look at where the tooling is actually going, the convergence argument is strong. As of early this year, ONNX Runtime's QNN backend supports Qualcomm, MediaTek, and Samsung Exynos NPUs in a single runtime. Google's MediaPipe does the same thing. You train once, export once, and the framework handles the translation layer.

Which sounds like the obvious answer until you look at what the chip vendors are actually doing on the side.

Right, and that's the counterweight. Samsung's Exynos 2400 has this thing called weight streaming — it loads model weights on-the-fly during inference to avoid SRAM bottlenecks. That's not a standard feature. If you're writing a generic ONNX model, you can't exploit that. You don't even know it's there. But if you're building a model specifically for that Exynos, you design your memory access patterns around it, and you can get a two to three times performance gain over the generic path.

The chip has this secret handshake, and the generic model doesn't know it.

Every chip vendor has their own version of the secret handshake. Apple's got ANE fusion, Qualcomm's got Hexagon-specific tensor instructions, Google's Tensor G4 has TPU optimizations that only their own models use. The hardware teams are building features that only first-party or very close partner software ever touches.

Which pushes us toward a split, doesn't it? Not a clean winner-take-all.

And I think what we're actually going to see is a tiered ecosystem. For first-party apps — the camera, the photos gallery, the video editor that ships with the phone — chip-specific models will dominate. Samsung's Galaxy S25 Ultra already does this: its Gallery app has a specialized upscaling model that can take a ten-eighty-p image to 8K, and it runs fast because it was built for that exact Exynos or Snapdragon in that phone. But take that same model, quantize it to INT8, and run it on a Galaxy A55 — it's three times slower.

The flagship gets the scalpel, the mid-range gets the Swiss Army knife.

For third-party developers, the math is brutal. If you're building a photo editing app and you want chip-specific performance, you're looking at maintaining separate model variants for Qualcomm, MediaTek, Samsung, Apple, and Google's Tensor. That's five code paths. Most small teams can't afford that. So they'll ship a quantized general model through ONNX or MediaPipe, accept the performance hit, and call it a day.

Which means Daniel's OnePlus is going to end up with both. The camera app will use whatever Qualcomm baked into the silicon, and Snapseed or whatever third-party editor he uses will run the generic one.

And that's fine for photo upscaling — both paths produce a usable result in under a second. But here's where the split starts to bite. Video calls, game streaming, anything that needs real-time upscaling at thirty or sixty frames per second — the quantized general model can't keep up. It's doing five to ten frames per second. The specialized model is doing sixty.

Real-time upscaling becomes a flagship feature by default. Not because anyone decided it should be, but because the software ecosystem can't bridge the hardware gap.

That is the knock-on effect that I think is going to define the next two or three years. We're going to see a capability gap open up between flagship and mid-range phones for anything AI-related that needs low latency. It won't be about whether the feature exists — both phones will technically have upscaling. It'll be about whether it's fast enough to be invisible. And invisible AI is the only kind people actually use.

Let me ask the uncomfortable question here. Is this actually a problem for users, or is this a problem for developers that users will never notice? Because I've used plenty of mid-range phones, and I've never once thought "my upscaling is running on the quantized general path and I'm sad about it.

That's fair. For photo upscaling specifically, I think you're right — the difference between two hundred milliseconds and thirty milliseconds is imperceptible to most people for a single image. Where it becomes perceptible is when you're scrolling through a gallery and every thumbnail is being upscaled on the fly, or when you're on a video call and the upscaling is supposed to be running continuously. On the flagship, it's seamless. On the mid-range, you get stutters, or the feature just silently degrades to a lower resolution. The user notices the stutter, not the missing upscaling.

They don't know what they're missing, but they feel that something is less smooth.

It's the uncanny valley of performance. Not broken enough to complain about, not smooth enough to ignore.

Given all that, what should you actually do if you're building for this or just trying to use it?

Yeah, let's make this practical. If I'm a developer and I want to ship an upscaling feature in my app, where do I even start?

Start with ONNX Runtime plus the QNN backend. That single stack targets Qualcomm, MediaTek, and Samsung Exynos NPUs without you writing a line of chip-specific code. You'll get about eighty percent of the performance of a bespoke model for maybe ten percent of the engineering cost. For most apps, that's the right trade.

The point where that trade stops making sense?

When you need sub-thirty-millisecond inference. Real-time video upscaling, game streaming, augmented reality — if the frame has to be done before the user perceives latency, then the generic path isn't enough. That's when you bite the bullet and build chip-specific variants.

The decision point is basically: is this feature something the user waits for, or something that has to feel instantaneous?

That's the line. And for users, the practical difference right now depends entirely on what you're upscaling. For photos, both approaches work. Snap a picture, wait a quarter second or half a second — you won't care. The output looks good either way. But for video, the gap is a chasm. Specialized models are hitting thirty frames per second at ten-eighty-p to 4K. Quantized general models are stuck at five to ten frames per second.

If you're wondering which world your phone lives in, there's actually a way to check. The ML Benchmark app on the Play Store runs both generic and chip-specific models and lets you compare the speed and quality side by side.

You'll see the specialization gap immediately. Same task, same phone, completely different numbers.

Which is either reassuring or annoying, depending on which phone you bought.

That's the tiered ecosystem in one app.

We've mapped the two paths, the tiered ecosystem, the whole Swiss Army knife versus scalpel thing. But there's a bigger question underneath all of this that I don't think anyone's really answered yet.

Whether the hardware itself converges. Right now every chip vendor is building their own secret handshake — Qualcomm's Hexagon instructions, Apple's ANE fusion, Samsung's weight streaming. The question is whether that's a permanent state of affairs or just the messy early days before someone imposes order.

The ARM Scalable Vector Extension for AI is the closest thing to an answer. ARM's trying to define a universal NPU instruction set — the idea being that you write your model once, target SVE for AI, and it runs on any NPU that implements it. If that gets traction, the hardware Esperanto problem goes away. Every chip speaks the same language natively.

If it doesn't?

Then the fragmentation gets worse, not better. Because the incentive for chip vendors is to keep differentiating on proprietary accelerators. Qualcomm wants you to notice that the upscaler on a Snapdragon phone is faster than the one on a MediaTek phone. That's a selling point. Standardization erases that.

The business incentive pushes toward fragmentation, and the developer incentive pushes toward convergence.

I don't know which wins. But I do know that by twenty-twenty-eight, the hardware will be so far ahead of where it is now that on-device upscaling should feel as invisible as JPEG compression does today. You won't think about it. You'll just expect your photos to be sharper, your video calls to be clearer, and it'll happen.

Because that future only arrives if the software ecosystem catches up. The silicon is already capable. The tooling is getting there. The open question is whether we standardize the bridge between them, or keep building custom bridges for every chip.

Something to watch. And now: Hilbert's daily fun fact.

Hilbert: In the nineteen-eighties, scientists attributed the iridescent blue of Morpho butterfly wings to pigmentation. It was later corrected — the color comes from nanoscale ridges that cause light to interfere with itself, a structural effect now being adapted for low-power color e-ink displays.

The butterfly was doing physics, not chemistry.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop, and thanks to Daniel for the question that sent us down this whole rabbit hole. If you enjoyed this, do us a favor and leave a review wherever you listen — it genuinely helps more people find the show. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3995: Can Your Phone Upscale Photos Without the Cloud?

Downloads

You Might Also Like

#3995: Can Your Phone Upscale Photos Without the Cloud?