#1807: Why GPU Containers Force You to Build

Docker promised "run anywhere," but GPU images make you compile for hours. Here’s why the abstraction breaks down.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1961
Published: Mar 31
Duration: 22:31
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: gpu-acceleration docker dependency-management

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The "Build Anywhere" Promise Meets Silicon Reality

If you’ve ever stared at a terminal window watching a PyTorch ROCM build crawl along for forty-five minutes, you’ve likely questioned the entire premise of modern DevOps. We have near-instantaneous global communication, yet compiling a linear regression feels like building a small operating system from scratch. This frustration isn't a failure of Docker; it’s a fundamental clash between software abstraction and hardware specificity.

The core issue is that GPU-accelerated containers aren't the self-contained boxes we're used to shipping. While a standard Node.js app stays safely within user-space, ML libraries like PyTorch are trying to reach through the container to grab the hardware by the throat. This creates a "parasitic" relationship where the container requires a very specific host to survive.

The ABI Nightmare
The primary technical wall is the Application Binary Interface (ABI). When you build PyTorch against a specific version of ROCM, you are establishing a rigid contract with the host’s GPU driver and Linux kernel. Unlike standard software, this is notoriously brittle. If your host is running ROCM 5.7 and you pull a container built for 6.0, the system calls fail immediately. The host kernel simply doesn't recognize the commands coming from the container. This completely undermines the isolation Docker is famous for; instead of a universal box, it’s a heart transplant requiring a perfect tissue match.

Legal Walls and "Fat Binaries"
Even if the technical compatibility were perfect, legal barriers prevent easy sharing. NVIDIA’s EULA restricts the redistribution of CUDA components in public images. Consequently, many Dockerfiles are actually just long scripts that download necessary components during the build process because they cannot legally be baked into a shared layer. It’s like buying a Lego set where the manufacturer isn't allowed to put the bricks in the box, forcing you to source them individually.

Furthermore, to ensure compatibility across various hardware generations (like AMD’s MI250X vs. MI300X), pre-built images must include "fat binaries" containing kernels for every supported architecture. This bloat can result in images exceeding 50GB. Building locally strips away this excess, tailoring the image to your specific silicon.

The Optimization Trade-off
There is also the matter of performance. A pre-built image is a "one size fits all" suit; it covers you, but it doesn't fit perfectly. By building locally, the compiler can detect specific CPU flags (like AVX-512) and PCIe bandwidth constraints, optimizing math kernels for peak performance. While a 5% speedup might not matter for a hobbyist, for a multi-week training run on a corporate cluster, it translates to thousands of dollars saved in electricity and compute time.

The Vendor Moat
Why hasn't this been fixed? In a way, it has—by creating lock-in. Cloud providers offer pre-configured Deep Learning VMs where the host kernel and drivers are perfectly matched to the container environment. However, this convenience comes at the cost of portability. Moving your workload from an AWS instance to an on-premise AMD server returns you to the "Dependency Hell" of local compilation.

Ultimately, this friction hurts AMD more than NVIDIA. NVIDIA’s closed CUDA ecosystem acts as a de facto standard, making it easier to target. AMD’s open ROCM approach supports more configurations, which ironically makes the "pre-built image" problem harder to solve. The result is that for serious ML work, the "run anywhere" promise is often sacrificed for the stability of a custom, locally built environment.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1807: Why GPU Containers Force You to Build

You know, Herman, I spent about forty-five minutes yesterday staring at a terminal window watching a PyTorch ROCM build crawl along, and I had this moment of pure developer existential dread. I’m sitting there thinking, we are in twenty-six, we have near-instantaneous global communication, and yet I’m essentially compiling a small operating system from scratch just to run a linear regression.

The classic "Docker is supposed to save me from this" trap. It’s the ultimate irony of modern DevOps, isn’t it? Herman Poppleberry here, and honestly, Corn, that frustration is the starting point for one of the most misunderstood parts of containerization.

It really is. Today’s prompt from Daniel is about exactly this paradox. He’s asking why, if Docker’s whole mission is to provide stable, portable, ready-to-go environments, we still find ourselves building these massive, hardware-specific images like PyTorch ROCM locally instead of just pulling a finished product from a registry. It feels like the "run anywhere" promise has a very expensive asterisk attached to it when GPUs get involved.

It’s a massive asterisk, and by the way, it’s worth mentioning that today’s deep dive into these build logs is powered by Google Gemini 3 Flash. But back to your pain, Corn. The reason you’re waiting forty-five minutes isn’t because Docker failed; it’s because the "environment" we’re trying to containerize isn’t just code anymore. It’s a delicate negotiation between software and very specific silicon.

Right, because when we talk about a "standard" Docker image—say, a simple Node dot js app or a Python script—we’re usually talking about things that stay within the user-space. But as soon as you touch machine learning libraries like PyTorch or JAX, you’re trying to reach through the container, through the abstraction layer, and grab the hardware by the throat. And the hardware, especially on the AMD side with ROCM, is very picky about who’s touching it.

That’s the core of it. We have this mental model that a Docker image is a self-contained unit, like a physical box you can ship anywhere. But a GPU-accelerated image is more like a parasite that requires a very specific host to survive. If the host has the wrong kernel version or the wrong driver, the container is just a collection of useless binaries.

So let’s peel back the layers on why this breaks down. If I go to Docker Hub and try to find a pre-built PyTorch image for ROCM, why can’t I just find one that "just works" for my specific setup? Is it just that the maintenance burden is too high, or is there a fundamental technical wall?

It’s a bit of both, but the technical wall is the real killer. Let’s talk about the Application Binary Interface, or ABI, nightmare. When you build PyTorch against ROCM, you aren’t just linking to libraries; you’re establishing a contract with the GPU driver and the Linux kernel. If you’re using ROCM six point zero, for example, the binaries inside that container are expecting the host machine—your actual physical computer—to be running the exact same version of the ROCM kernel modules.

And this isn't like a "backward compatible" situation where ROCM six point one can handle a six point zero container?

With ROCM, it’s notoriously brittle. AMD has made strides, especially with PyTorch two point two back in early twenty-twenty-four, which improved ROCM six point zero support, but the coupling is still incredibly tight. If your host is running version five point seven and you pull a container built for six point zero, the system calls will literally fail. The containerized application tries to talk to the GPU, and the host kernel says, "I don't recognize that command."

That’s wild because it completely undermines the "isolation" part of Docker. Usually, the whole point is that the container doesn't care what the host is doing as long as it has a Linux kernel. But here, the container is basically saying, "I need to know your exact home address and your blood type before I’ll even boot up."

Think of it like a heart transplant. You can’t just take any heart and put it in any body; the blood type and the tissue match have to be perfect, or the body rejects it. In this case, the GPU driver is the immune system. If that driver on the host doesn't see a perfect match in the container's library calls, it just shuts down the communication.

And then you hit the NVIDIA side of the fence, which has its own set of problems. You’d think NVIDIA, being the market leader, would have a more "universal" solution, but they have a legal wall. The CUDA End User License Agreement—the EULA—actually prohibits the redistribution of certain CUDA components in public container images without a specific, often expensive, licensing agreement.

In many cases, yes. This is why you see NVIDIA pushing their own registry, NGC—the NVIDIA GPU Cloud. They want you to pull from their walled garden where they control the licensing and the optimization. If you try to build a truly portable, open-source version, you end up having to download the installers during the build process because you can’t legally "save" the installed state and share it as a public image layer.

Wait, so that’s why when I look at some of these GitHub repos for ML projects, the Dockerfile is eight hundred lines long and half of it is just downloading things from NVIDIA’s servers?

Precisely. They can’t ship the image with those parts already baked in, so they force your machine to do the heavy lifting of assembly. It’s like buying a Lego set where the company isn't allowed to put the bricks in the box, so they just give you a map to fifty different locations where you can find the individual pieces yourself. It’s a massive waste of bandwidth and time, but it’s the only way to stay legal.

That explains why so many Dockerfiles in the ML space look like a giant shopping list of "curl" and "wget" commands. They aren't just being inefficient; they’re navigating a legal minefield. But even beyond the legal stuff, there's the sheer physical size. I saw an issue on the ROCM Docker GitHub where people were complaining that the "latest" PyTorch image was fifty-four gigabytes. Fifty-four! You could fit the entire history of human literature in that space, and instead, we’re using it to store some math libraries.

Fifty-four gigabytes is staggering. And that bloat comes from "fat binaries." Because the developers don’t know if you’re running an older AMD Instinct MI-two-fifty-X or a brand new MI-three-hundred-X, they have to include the compiled kernels for every single supported architecture. It’s like carrying around a set of tires for every car ever made just in case you find a vehicle to drive.

So if I build it locally, I’m essentially saying, "Hey, I only have this one car, just give me the tires for the MI-three-hundred." And that shrinks the image down to something manageable?

Somewhat. But more importantly, it ensures that the "tires" actually fit the "axle," which is your host kernel. There’s also the "glibc" problem. The GNU C Library is the foundation of almost everything in Linux. If your container is built on an older Ubuntu version with an older glibc, but your host is a bleeding-edge Arch Linux install with a much newer kernel, you can run into these subtle, silent failures where a system call behaves differently than the container expects.

But wait, I thought glibc was supposed to be backward compatible? Isn't that the whole point of the versioning system they use?

In theory, yes. In practice, when you're doing high-performance compute, you're often calling into very specific, low-level threading libraries or memory management functions that might have slight variations in behavior between versions. When you're training a model for three weeks, a tiny difference in how a thread is scheduled or how memory is paged can lead to a race condition that crashes the whole job on day nineteen. Building locally ensures the container's glibc and the host's kernel are singing from the same songbook.

It feels like we’re reinventing "Dependency Hell," just at a higher level of the stack. We moved it from the OS level to the Container level.

It’s absolutely Dependency Hell, but with more layers of YAML. And let's talk about the hardware-specific optimizations. If you’re running a high-performance computing cluster, you aren’t just looking for "functional" code; you’re looking for peak performance. A generic pre-built image can’t know if your CPU supports AVX-five-twelve or if your PCIe bus has specific bandwidth constraints. By building locally, the compiler can detect those flags and optimize the math kernels specifically for your silicon.

So it’s the difference between a suit you bought off the rack at a department store and one that was custom-tailored for you. The department store suit technically "covers your body," but the custom one is the only one you can actually run a marathon in without it falling apart.

That’s a rare analogy for us, but it works! And for ML, performance is everything. If a custom build gives you a five percent speedup on a training job that takes a week, that’s hours of time and potentially thousands of dollars in electricity and compute costs saved.

But let's be real, Herman. Does five percent really matter for a dev just trying to run a Llama three inference on their local workstation? Is the forty-five-minute build really worth five percent?

For a single dev? Probably not. But for the person Daniel is asking about—the one who's trying to set up a reproducible environment for a team—it matters immensely. If you have ten devs and they all have slightly different GPU variants or driver versions, and you give them a "standard" image that works for eight of them but causes random "out of memory" errors for the other two, you’ve just created a debugging nightmare that will cost way more than forty-five minutes of build time.

Okay, so we’ve established that the "why" is a mix of kernel-module mismatch, legal restrictions from NVIDIA, and the need for hyper-optimization. But what about the business side of this? Why haven’t the cloud providers or the hardware vendors solved this? It seems like a massive friction point for their customers.

They have, but in a way that creates more lock-in. Amazon, Google, and Microsoft all provide "Deep Learning VMs" or pre-configured environments where they’ve done the hard work of matching the host kernel to the container. But the catch is, you have to use their VM, on their hardware, using their specific version of the driver.

Right, so the "portability" of Docker is being traded for "convenience" within a specific vendor’s ecosystem. It’s the "Vendor SDK Moat" we’ve talked about before. If you want it to be easy, you have to stay in their backyard. If you want to move your workload from an AWS instance to an on-prem AMD server, you’re back to square one, staring at a build log for forty-five minutes.

And this is why it’s so much harder for AMD than NVIDIA right now. NVIDIA has the "CUDA" brand which acts as a standard, even if it's a closed one. AMD’s ROCM is trying to be more open, but being open means supporting more combinations of kernels and distros, which actually makes the "pre-built image" problem harder, not easier.

That’s a fascinating point. By being more flexible, AMD actually makes it harder for a third party to provide a "one size fits all" container. It’s the paradox of choice. If you support everything, you can’t pre-package anything efficiently.

And that’s where the "builder pattern" in Docker becomes so important. For these massive ML images, you’ll often see these sophisticated multi-stage Dockerfiles. The first stage is a massive, bloated environment where all the compiling happens—it might be a hundred gigabytes on its own. But then, it spits out just the compiled binaries and the necessary libraries into a second, much smaller "runtime" image.

I’ve seen this, but even then, that first stage is where the pain is. And if you aren't using BuildKit—Docker’s modern build engine—you aren't caching those layers effectively. I think a lot of solo devs or smaller teams don't realize they can mount a cache directory so that when they change one line of Python code, they don't have to re-compile the entire ROCM stack.

BuildKit’s "mount type equals cache" is probably the most underrated feature in the entire Docker ecosystem for ML engineers. It allows you to persist things like the pip cache or the C-compiler cache across builds. Without it, every time you fix a typo, you’re essentially starting from a blank slate.

It’s like having to rebuild your entire house because you decided to change the color of the front door. It’s madness. But let's look at a real-world case study. Look at how Meta handles PyTorch for their own internal clusters. They don't just use the public PyTorch image.

No, they can’t. Meta is running some of the largest AMD Instinct clusters in the world—we’re talking tens of thousands of GPUs. They maintain their own internal forks of ROCM and PyTorch with custom patches that are specifically tuned for their network topology and their specific hardware revisions. For them, a "generic" image is literally useless. It wouldn't even be able to talk to their high-speed interconnects like InfiniBand.

So they’re basically their own hardware vendor and software vendor at that point. But for the rest of us, the "mortals" who are just trying to run a fine-tuning script on a single GPU, what are we supposed to do? Is the answer just "embrace the build time"?

The answer is to treat your Docker image as a "deployment artifact" rather than a "development environment." This is a key distinction. During development, you should probably be using a persistent environment—maybe even a Dev Container—where you only install the heavy stuff once. Then, once your code is ready, you trigger a CI/CD pipeline that handles the massive hardware-specific build in the background.

That makes sense. Don't make the build part of your "inner loop" of development. If you’re waiting for a Docker build every time you want to test a logic change, you’re doing it wrong. You should be developing against a stable base and only "containerizing" for the final push.

And when you do build that final image, you have to be incredibly disciplined about versioning. You can’t just use "latest" as your base image tag. You need to pin the exact ROCM version, the exact OS version, and even document the host kernel requirements in the comments of the Dockerfile.

"Latest" is the most dangerous word in the English language for a DevOps engineer. It’s a ticking time bomb. You go to sleep, the vendor pushes an update, and suddenly your "stable" environment is throwing Segfaults because the kernel module on your server is now one minor version behind the container's expectations.

It’s the reason why so many senior engineers seem so grumpy and obsessed with pinning versions. We’ve all been burned by that one "invisible" update that broke a production pipeline at three in the morning.

I also wonder about the emergence of things like WebGPU or more hardware-agnostic frameworks. Do you think we’ll ever get to a point where this "hardware-to-container" coupling is loosened? Where I can just send a pile of code to a "GPU-cloud" and it doesn't matter if it's an NVIDIA chip or an AMD chip or some new custom ASIC from a startup?

We’re seeing some movement there. Frameworks like Mojo or the work being done with OpenAI’s Triton language are trying to create a higher-level abstraction that compiles down to whatever hardware is available. Triton, in particular, is interesting because it allows you to write "GPU code" in a way that is relatively portable between CUDA and ROCM. But even then, the runtime—the thing that actually executes that code—still needs to be built for the specific system.

So we’re just moving the "build" step. Instead of building the whole image, we’re building the kernels on the fly.

It’s JIT—Just-In-Time—compilation for GPUs. It solves the "portability" problem but introduces a "cold start" problem. The first time you run your code, it might hang for two minutes while it compiles everything for your specific chip.

I’d take a two-minute cold start over a forty-five-minute Docker build any day of the week.

Most people would. But for high-scale production, that two-minute delay is unacceptable. If you’re scaling a cluster of a thousand inference nodes to handle a spike in traffic, you can’t have them all sitting there compiling code for two minutes before they can serve a single request. So, you’re back to pre-building the images.

It’s a circle. We keep coming back to the same trade-offs. Speed of development versus speed of execution. Portability versus performance.

And that’s really the takeaway for anyone listening who’s frustrated by this. If you’re building these large images locally, you aren’t doing something "wrong." You’re actually engaging with the reality of high-performance computing. Docker was originally built for microservices—little pieces of logic that don't care about the hardware. We’ve co-opted it for ML, which is the exact opposite of that.

It’s like trying to use a shipping container to transport a live, temperamental whale. Sure, the container fits on the boat, but you have to build a whole life-support system inside it that’s specifically tuned to that whale's needs, or it’s not going to survive the trip.

And the whale is the GPU.

The whale is definitely the GPU. In this case, a very large, expensive AMD whale.

By the way, speaking of expensive whales, did you see the recent stats on the power consumption for these builds? Sometimes the energy cost of compiling these massive ML stacks on a high-end workstation is actually higher than the cost of running the actual inference for the first thousand requests. We're literally burning coal just to get the software ready to run.

That is a staggering thought. We're essentially paying a "carbon tax" for our lack of standardized binaries. But let's get practical before we lose people in the existential dread again. If someone is setting this up today, what’s the move?

First, check if there’s a "runtime-only" image. Often, vendors like NVIDIA or AMD provide a "dev" image which has all the compilers and headers, and a "runtime" image which is much smaller. Build your code in the "dev" one, but only ship the "runtime" one.

Does that actually solve the ABI mismatch though? If I build in a dev image and ship in a runtime image, aren't they still tied to that specific version?

Yes, they are. You have to make sure the "dev" and "runtime" tags match exactly—like both being ROCM six point zero point two. It doesn't solve the host-matching problem, but it does solve the "my image is fifty gigabytes" problem. You might get it down to three or four gigabytes, which is much easier to push to a registry.

And what about the host? If I'm running Ubuntu on my dev machine but my production server is running Red Hat, am I just asking for trouble?

In the ML world, yes. You really want to keep the host OS as similar as possible. If you can, use the same base distribution for your host and your container. It minimizes the distance the system calls have to travel through different layers of abstraction.

Also, investigate local caching. If you’re using a tool like "ccache" inside your Docker build, you can mount a persistent volume to it. It makes those forty-five-minute builds turn into five-minute builds on the second run.

And finally, accept that for GPU work, the host matters. You cannot ignore the underlying operating system. If you’re managing a fleet of machines, keep them on a unified kernel and driver version. The more heterogeneity you have in your hardware, the more pain you’re going to have in your Docker builds.

It’s a good reminder that "abstraction" doesn't mean the thing underneath disappears. It just means you’ve pushed the complexity somewhere else. In this case, we pushed it into the Dockerfile.

Well said. This is one of those topics where the "weirdness" Daniel points out is actually a window into how the entire modern stack is held together with duct tape and very specific versions of glibc.

It’s duct tape all the way down, Herman. All the way down. I guess the real lesson for Daniel is: don't fight the build, just optimize the build. And maybe buy a faster CPU so that forty-five minutes becomes twenty.

Or just get a very long book to read while you wait.

Before we wrap up, I want to give a shout out to our producer, Hilbert Flumingtop, who I’m sure has spent his fair share of time waiting for Docker builds to finish.

And a big thanks to Modal for providing the GPU credits that power this show. If you want to avoid some of this headache, using a serverless platform like Modal can actually offload a lot of this environment management, which is a nice "out" for some use cases.

It really is. They handle the "whale's life support system" so you can just focus on the whale.

This has been My Weird Prompts. If you found this deep dive into the guts of containerization helpful, or if you just want to commiserate about build times, find us at myweirdprompts dot com. You can find all our previous episodes and links to subscribe there.

We’re also on Telegram—just search for My Weird Prompts to get notified whenever a new episode drops. It’s a great way to stay in the loop without having to check your podcast app every day.

Thanks for listening. We’ll see you in the next one.

Catch you later.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1807: Why GPU Containers Force You to Build

Downloads

You Might Also Like

#1807: Why GPU Containers Force You to Build