#2657: How Background Removal Actually Works (and Why It Matters for AI Art)

Background removal isn't magic — it's multiple AI systems working in sequence. Here's what's actually happening under the hood.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2816
Published: May 5
Duration: 32:53
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: image-generation computer-vision generative-ai

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Background removal looks like a single click, but it's actually several AI systems working in sequence. The foundational architecture behind modern background removal is U²-Net (U-squared-Net), a salient object detection model from the University of Alberta (2020). Unlike most neural networks that process images at one scale, U²-Net uses a nested architecture with a six-stage encoder and five-stage decoder, fusing information from multiple resolutions simultaneously. This lets it see both fine details — individual hairs, the edge of a sleeve — and the overall shape of an object.

The key innovation was "residual U-blocks," which capture context at multiple scales internally before stacking. The model outputs a probability map for every pixel, predicting how likely it is to be foreground. You threshold that and get a mask. This represents a fundamental shift from chroma keying (green screens), which subtracts a known uniform color but struggles with green spill, fine hair, and translucent materials. U²-Net doesn't care about color — it's learned what objects look like from thousands of manually labeled images.

For LoRA training specifically, background removal is critical because inconsistent backgrounds can contaminate the model's understanding of the subject. The debate between gray backgrounds and true alpha transparency remains unsettled — gray provides soft, consistent edges while transparency prevents background contamination but can introduce edge artifacts. On the video side, models like RVM (Robust Video Matting) use recurrent neural networks to maintain temporal coherence across frames, while Meta's SAM (Segment Anything Model) handles both images and video with promptable segmentation trained on over a billion masks.

The episode draws a parallel to puppetry's evolution, particularly the shift from hiding puppeteers behind curtains to making them visible — a change championed by Frank Ballard at UConn's Ballard Institute. The argument: hiding the mechanism creates a limited illusion, while visible craft invites audiences to engage with the skill itself. This mirrors what happens when AI artists make their process visible — the training data, the background removal choices, the model architecture become the strings that breathe life into digital creations.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2657: How Background Removal Actually Works (and Why It Matters for AI Art)

Daniel sent us this one, and it's a bit of a winding path — he's been experimenting with making plushie versions of me and Herman, training LoRAs on them, and he's run into the whole question of background removal. What's actually happening when you click "remove background" and suddenly get transparent pixels? And he's drawing this connection to puppetry — the moment puppeteers stopped hiding behind curtains and became visible, which changed the art form. I want to dig into both threads, but let's start with the tech. What's actually under the hood?

Before we dive in — quick note, DeepSeek V four Pro is writing our script today. Alright, so background removal. The thing most people don't realize is that what looks like a single click is actually several different AI systems working in sequence. The foundational architecture behind a lot of modern background removal is something called U-squared-Net, or U-two-Net, which came out of the University of Alberta in twenty twenty. It's a salient object detection model.

Salient object detection. Meaning it figures out what the main thing in the image is.

And the way it does that is clever. Most neural networks for image tasks process information at one scale — they look at the whole image, or they look at patches. U-two-Net has this nested architecture where it's processing the image at multiple scales simultaneously. It's got a six-stage encoder and a five-stage decoder, and at each stage it's fusing information from different resolutions. So it can see both the fine details — individual hairs, the edge of a sleeve — and the overall shape of the object.

It's like having six different people looking at the same image, each from a different distance, and then pooling what they see.

That's actually a decent way to think about it. And the "U" shape comes from the fact that the architecture goes down in resolution and then back up — it looks like a U when you draw it out. The key innovation was what they called "residual U-blocks." Each block captures context at multiple scales internally, and then the blocks are stacked. It's a lot of compute, but it produces really clean masks.

A mask is just a black and white image that says "these pixels stay, these go.

The model outputs a probability map — for every pixel, it predicts how likely it is to be foreground. You threshold that, and you've got your mask. But here's where it gets interesting for Daniel's question about green screens. The old way of doing this — and I mean the pre-deep-learning way — was either chroma keying, which is the green screen approach, or it was graph cut algorithms where you'd manually scribble on foreground and background and the algorithm would propagate your labels.

Chroma keying works because you've got a known, uniform color that you can mathematically subtract. If a pixel is within a certain range of green, it goes.

That's the idea, but the reality was always messier. Green spill — where green light bounces off the screen onto the subject — was a constant problem. Fine hair, translucent materials, motion blur at the edges. All of that made clean keys really difficult. And you needed an evenly lit screen, no wrinkles, the right distance between subject and background. It was a whole discipline.

What Daniel's actually asking about is the jump from chroma keying to semantic understanding. The model isn't looking for a specific color — it's looking for objects.

That's the fundamental shift. U-two-Net and similar models don't care what color your background is. They've learned what objects look like — what a person looks like, what a plushie looks like — and they separate foreground from background based on that semantic understanding. The training data is thousands and thousands of images where humans have manually labeled every pixel as foreground or background. The model learns the visual patterns that distinguish subject from surround.

When Daniel's photographing plushie versions of us, the model is recognizing "this is a stuffed animal, this is an object with volume and edges" and separating it from whatever's behind it.

That's actually a harder problem than it sounds for plushies specifically, because plush toys often have fuzzy edges, they've got texture that can blend with certain backgrounds, and they're not as common in training datasets as humans. There's a reason Daniel mentioned a paucity of training data for stuffed animals. Most of these models are trained heavily on portraits, product photography, that kind of thing. A sloth plushie on a messy desk is a different challenge than a person on a plain background.

That's probably why he's asking about gray backgrounds versus true transparency. He's trying to figure out what gives the LoRA the cleanest signal during training.

That's an empirical question that people in the LoRA community argue about. The theory behind using a gray background is that it provides a neutral, consistent context that the model can learn to ignore. If every training image has a different background — a park, a living room, a kitchen — the model might accidentally learn that the background is part of the concept. You tell it "learn Corn the sloth," and it learns "Corn the sloth plus a bunch of random furniture.

Whereas if you strip the background entirely with an alpha channel, you're giving the model nothing but the subject.

But in practice, alpha transparency introduces edge artifacts. The boundary between the subject and the transparent area can look unnatural, and the model might learn those artifacts as part of the subject. Some people swear by gray backgrounds because they provide a soft, consistent edge. Others insist that transparency is the only way to prevent background contamination. There's no settled answer yet.

Which I suspect is exactly why Daniel's experimenting with both. Now, you mentioned U-two-Net, but that's not the only architecture out there. What about the video side? He mentioned trying video background removal.

Video is a whole different beast. If you run a single-image model on each frame independently, you get temporal inconsistency — the mask flickers from frame to frame because the model makes slightly different decisions on each one. It's the same problem as frame-by-frame rotoscoping versus modern video inpainting. You need temporal coherence.

The solution is to include information from previous frames in the prediction for the current frame.

There are models like RVM — Robust Video Matting — that use recurrent neural networks to carry information forward. The model sees the current frame plus a hidden state from previous frames, so it can maintain a consistent understanding of what's foreground and what's background even as things move. Google's got a model called MediaPipe that does this for video calls. Meta released one called SAM — the Segment Anything Model — which handles both images and can be extended to video.

SAM was a big deal when it dropped. The "anything" part is doing a lot of work there.

SAM was genuinely impressive because it's promptable. You can give it points, boxes, or masks as input, and it'll segment whatever you're pointing at. And it was trained on a dataset of over a billion masks — eleven million images, each with multiple annotations. The scale of that dataset is what makes it so generalizable. It handles objects it's never seen before because it's seen so many different kinds of objects during training.

Daniel's plushie problem would be less of a problem for SAM, because it's seen enough variety to handle fuzzy stuffed animals.

But SAM is heavy — it's a big model. The RembG tool Daniel mentioned, that's more of a packaged solution. RembG, which as Daniel noted is clearly short for remove background — I love when naming is that literal — RembG is essentially a wrapper around U-two-Net and similar models. It provides a simple API or command line interface. You throw an image at it, you get a transparent PNG back. It's become the go-to for people doing LoRA training precisely because it's simple and reliable.

It handles the alpha channel properly, which is not trivial. A lot of cheaper implementations just composite onto white and call it a day.

Right, and if you're training a LoRA, you need real transparency, not just a white background that looks transparent. The alpha channel actually matters for the training pipeline. RembG and tools like it — there's also remove dot bg, which is a commercial API that uses a proprietary model — they all output proper RGBA images with the alpha channel intact.

Let's connect this back to puppetry, because Daniel's prompt had this interesting thread about visibility. He mentioned that a key moment in puppetry's evolution as a serious art form was the decision to make the puppeteer visible rather than hidden.

This is something I actually know about from my time in Storrs. The Ballard Institute and Museum of Puppetry — which is part of UConn — was founded by Frank Ballard, who was the first person to establish a puppetry degree program at an American university. That was in the nineteen sixties. Before Ballard, puppetry was seen as children's entertainment or folk art. He treated it as a serious theatrical discipline.

He's the one who pushed for visible puppeteers?

Not exclusively — the visible puppeteer tradition has roots in Japanese Bunraku, where the puppeteers are on stage in black robes, and in various experimental theater movements. But Ballard absolutely championed it in the American context. The argument was that hiding the puppeteer behind a curtain creates a kind of illusion that's inherently limited — you're trying to convince the audience the puppet is alive, which is always going to fail at some level. But if you put the puppeteer on stage, visible, the audience isn't asked to believe an illusion. They're asked to engage with the craft.

It's a Brechtian move. Don't pretend — show the strings, show the hands, and let the audience appreciate the skill.

And that shift — from hiding the mechanism to celebrating it — is what elevated puppetry in the eyes of theater critics and arts institutions. It became about the relationship between puppeteer and puppet, the visible effort, the collaboration between human and object. The puppet isn't pretending to be alive — it is alive because the puppeteer is visibly breathing life into it.

Which is a fascinating parallel to what Daniel's doing with AI characters. He's not trying to convince anyone that his plushie LoRAs are real. He's making the process visible. The strings are the training data, the background removal, the model architecture.

There's something honest about that approach. When you see a behind-the-scenes video of how an AI character was made, you appreciate it differently than if someone just presented the output and said "look what the AI did." The human decisions — what backgrounds to use, how to frame the shots, what data to include — those are the strings.

I think that's why Daniel framed the prompt this way. He's not just asking for a technical explainer on background removal. He's saying: here's the craft. Here are the decisions. The technology is the puppet, and I'm the puppeteer trying to figure out how visible I should be.

Which brings us back to the green screen point. In traditional puppetry for film, you'd use green screen gloves to hide the puppeteer's hands. That's the equivalent of chroma keying — you remove the mechanism by targeting a specific color. The AI approach is different. The AI doesn't need a specific color. It understands what hands look like, what puppets look like, and it can separate them semantically.

Here's the thing — sometimes you want the hands visible. Bunraku puppeteers wear black, but their hands are right there, manipulating the puppet's expressions. The audience sees the craft. So the question isn't just "how do we remove the background?" — it's "what do we want the audience to see?

That's a creative decision, not a technical one. The tools give you options. You can strip everything but the subject. You can keep a natural background. You can use a gray background for training and then composite onto anything you want later. The technology doesn't dictate the aesthetic — it enables choices.

Daniel mentioned he's been experimenting with fishing wire to make three-D models of us. That's such a puppetry move — literally rigging characters for animation.

Fishing wire is the poor man's armature wire. Puppet builders have been using it for decades. The fact that he's physically constructing models and then using those as the basis for digital LoRAs — that hybrid workflow is interesting. It's not pure digital generation. There's a physical object that gets photographed, and those photographs become training data.

Which loops back to the background removal question. If you're photographing a physical plushie against different backgrounds, you need clean extractions to build a consistent dataset. Any background contamination and your LoRA learns the wrong thing.

This is where the technical details really matter. When you're training a LoRA — which is a low-rank adaptation, essentially a small set of weights that get added to a base model — you're typically working with ten to thirty images. That's not a lot of data. Every image counts. If three of your thirty images have messy edges where the background wasn't fully removed, the model will learn those artifacts.

The quality of your background removal directly determines the quality of your character model.

And it's not just about removing the background — it's about how you remove it. A hard edge looks unnatural. A feathered edge can look blurry. The best tools do what's called alpha matting, which estimates partial transparency at the boundary. A single pixel at the edge of a plushie might be sixty percent foreground and forty percent background. That's what gives you realistic compositing later.

Alpha matting is the hard problem. Binary segmentation — foreground or background — is relatively straightforward. But real edges have transparency. Hair, fur, fuzz on a plushie — those all have pixels that are a mix.

The classic approach to alpha matting required a trimap — you'd manually label regions as definitely foreground, definitely background, and unknown. The algorithm would then estimate alpha values for the unknown region. But modern deep learning approaches can do this without the trimap. They learn to predict alpha directly from the image.

Daniel's question about gray backgrounds versus transparency — the alpha channel approach preserves that edge information, while the gray background approach bakes in a specific background color at the edges.

Which might actually be better for certain training scenarios. If you're going to composite your character onto various backgrounds later, you want clean alpha. But if you're just trying to teach the model what the character looks like, the gray background provides a neutral context that doesn't confuse the model with transparency artifacts. It's not settled which approach produces better LoRAs.

Let's talk about the video aspect, because Daniel mentioned that video background removal is still early. What's the state of the art there?

The fundamental challenge is temporal consistency. If you process each frame independently, you get flickering masks. The edges jitter. The solution is to incorporate optical flow — tracking how pixels move from frame to frame — or to use recurrent architectures that maintain a memory of previous frames.

Optical flow being the vector field that describes how each pixel moves between frames.

If you know that pixel A moved from position X to position Y, you can propagate the mask accordingly. But optical flow itself is an estimation problem — it's not perfect, especially with fast motion or occlusions. So you're stacking one AI prediction on top of another, and errors compound.

Which is why the results are "tentatively promising," as Daniel put it.

For a talking-head video with a static camera, the results can be quite good. Runway ML has a background removal tool for video that works well in those conditions. Adobe's got something in After Effects. But if you've got a moving camera, complex motion, or multiple subjects, it gets messy fast.

Daniel's trying to do character animation, which means the characters will be moving, the backgrounds will be changing, and he needs consistent masks across potentially hundreds of frames.

That's a production challenge. The way most people handle this today is they do the background removal on keyframes, then interpolate. Or they use a tool like Rotobrush in After Effects, which propagates a mask across frames using optical flow and some machine learning. It's semi-automated — you make corrections on problematic frames, and the tool adjusts the interpolation.

We're not at the point where you can just hit "remove background" on a video and get production-quality results.

Not reliably, no. For a LoRA training dataset, where every frame matters, you'd probably want to do frame-by-frame review and correction. It's tedious, but it's the kind of craft work that separates good character models from mediocre ones.

Which, again, is puppetry. The puppeteer doesn't just show up and wave the puppet around. There's hours of construction, rigging, rehearsal. The visible performance is the tip of the iceberg.

I think that's the deeper point Daniel's getting at. The AI tools are impressive — you can remove backgrounds with a click, generate characters from text prompts — but the craft is still there. The decisions about what data to use, how to process it, what to keep and what to discard. Those are human decisions. The tools don't eliminate the puppeteer — they change what the puppeteer does.

There's a quote I like — I think it was from a puppeteer at the Ballard Institute actually — "the puppet is not an object that moves, it's an object that is moved." The distinction matters. The AI model isn't making creative decisions. It's being moved.

The visible puppeteer tradition says: don't hide that. Show the process. Let the audience see the craft. Daniel's approach — photographing physical plushies, training LoRAs, experimenting with backgrounds — that's all process. He's not trying to pretend the characters are real. He's showing how they're made.

Let me pull on another thread Daniel mentioned. He talked about the Ballard Institute and Storrs being the "ideological epicenter of modern puppetry." You grew up in Storrs. You've got skin in this game.

I do, and I'll defend Storrs to the death. It's easy to dismiss it as a sleepy Connecticut village — UConn's there, some farms, not much else. But Frank Ballard chose Storrs for a reason. He wanted to build a serious academic program in puppetry, and UConn gave him the institutional support to do it. The Ballard Institute now houses thousands of puppets from around the world. It's a research collection, not just a museum.

You worked there, moving boxes, as Daniel pointed out.

I did, and I'll say this — Daniel's right that I undersold my contribution. I didn't just move boxes. I led tours. I helped catalog the collection. I spent hours in the archives reading Ballard's production notes. The man was meticulous. He documented everything — puppet construction, performance techniques, audience reactions. He treated puppetry as a scholarly discipline, and that's what made Storrs the center of gravity.

When Daniel talks about "unpeeling the layers of Storrs and Mansfield history," he's tapping into something real. There's depth there.

There really is. Mansfield, which is the town that contains Storrs, has this fascinating history of experimental arts. The Mansfield Training School — which was originally for people with developmental disabilities — had a theater program where residents performed with puppets. That's a complicated and sometimes troubling history, but it fed into the local puppetry culture. Ballard was aware of it. He drew on local traditions while building something new.

That's the thing about Storrs — it's a place where unlikely things happen precisely because it's not New York or LA. There's space to experiment without the pressure of commercial success.

You can fail quietly in Storrs. You can try things that wouldn't survive in a more competitive environment. That's true of puppetry, and it's true of a lot of the creative work people are doing with AI now. The interesting stuff often happens at the edges, in the places where nobody's watching.

Which brings us back to Daniel's plushie experiments. He's in Jerusalem, not Silicon Valley. He's photographing stuffed animals and training LoRAs. That's the kind of edge-case experimentation that produces novel results.

He's asking the right questions. Background removal isn't glamorous. It's not the thing that gets headlines. But it's the kind of foundational tool that determines whether your character model works or doesn't. Getting the details right — the alpha channel, the edge quality, the temporal consistency — that's the difference between something that looks amateur and something that looks professional.

Let's talk about one more technical angle, because I think it's important for understanding how these tools actually work in practice. You mentioned that U-two-Net and similar models are trained on human-annotated data. But what happens when the model encounters something it wasn't trained on?

That's the generalization problem. A model trained primarily on humans and common objects might struggle with unusual subjects — like, say, a sloth plushie photographed from an odd angle under mixed lighting. The model might misclassify parts of the plushie as background, or vice versa.

That's where the "anything" in Segment Anything Model really matters. The billion-mask dataset means it's seen enough variety to handle edge cases.

Even SAM has limits. If you give it an image of a plushie on a textured rug that's a similar color to the plushie's fur, it might struggle. The semantic understanding can only go so far — at some point, the visual signal is ambiguous. A green pixel next to a slightly different green pixel. Is that the edge of the plushie, or is it the rug?

That's exactly the kind of situation where the old chroma key approach had an advantage. If you know the background is a specific green, there's no ambiguity — you just subtract that color. The AI approach is more flexible, but it introduces uncertainty.

Which is why some people in the LoRA community still use green screens for their training data. It's a hybrid approach — use a green screen to get a clean key, then use AI to clean up the edges and handle any spill. Best of both worlds.

Daniel's question about gray backgrounds versus transparency — there's actually a third option, which is to use a controlled physical background and do a traditional key, then refine with AI.

That's probably the most reliable approach if you're serious about quality. Green screen, good lighting, a decent camera, and then run the result through RembG or a similar tool to handle the fine details. It's more work, but you get cleaner training data.

The puppeteer's approach. Control what you can control, and use the tools to handle the rest.

And that's the through-line of this whole conversation. The technology changes — from green screens to U-two-Net to SAM to whatever comes next — but the craft remains. You still need to understand light, composition, edge quality. You still need to make decisions about what to show and what to hide.

I think that's the answer to Daniel's prompt, at least the philosophical part. The connection between puppetry and AI character creation isn't just a clever analogy. It's the same fundamental question: how do you bring an inanimate object to life? And how visible do you make the mechanism?

The technical part — what's actually happening when you click "remove background" — is that a neural network trained on millions of human-annotated images is making a per-pixel prediction about what's foreground and what's background, using multi-scale features to capture both fine details and overall shape. It's not magic. It's math. But it's math that was trained on human judgment.

The human is still in the loop. The puppeteer is still visible.

Even when you can't see them.

Alright, before we wrap — Hilbert's daily fun fact.

Now: Hilbert's daily fun fact.

Hilbert: In eighteen forty-seven, a British naval surgeon stationed in Belize nearly discovered the medical value of horseshoe crab blood when he noticed that wounded crabs in his collection produced a strange blue clotting agent, but he dismissed it as a "curious marine excretion" and threw the specimens overboard during a storm.

We were one storm away from a century-early medical breakthrough.

Threw them overboard. That is a very specific kind of scientific near-miss. This has been My Weird Prompts. I'm Herman Poppleberry.

I'm Corn. If you're experimenting with character LoRAs or background removal, we'd love to hear about your workflow — what's working, what's not. You can find us at myweirdprompts dot com. Thanks to our producer Hilbert Flumingtop, and thanks to Daniel for the prompt.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2657: How Background Removal Actually Works (and Why It Matters for AI Art)

Downloads

You Might Also Like

#2657: How Background Removal Actually Works (and Why It Matters for AI Art)