Hey everyone, welcome back to My Weird Prompts. We are hitting a pretty big milestone today, episode five hundred and forty-one. I am Corn, and as always, I am joined by my brother and resident technical deep-diver.
Herman Poppleberry, at your service. It is good to be back in the studio, Corn. Although, calling it a studio might be a bit of a stretch since we are just in our living room in Jerusalem.
It is a studio in our hearts, Herman. And speaking of our living room, our housemate Daniel sent us a really meat-and-potatoes prompt today. He has been playing around with generative AI models, specifically training LoRAs, and he had a bunch of questions about the mechanics of it all.
I love this topic. Low-Rank Adaptation, or LoRA, has completely democratized the way we interact with these massive models. You do not need a server farm anymore to teach an AI what your face looks like or how a specific architectural style should feel. Daniel is really tapping into the core of what makes current AI so personal.
Exactly. He mentioned three specific use cases he is interested in: character consistency, geographic locations, and stylistic applications like architectural renderings. There is a lot to unpack there, from image counts and resolutions to the whole debate over captions and trigger words.
And he is right to be curious, because the best practices have shifted quite a bit, especially with the newer architectures like Flux one point one and the latest Stable Diffusion iterations. It is not the same game it was even a year or two ago.
So let us start with the basics of character consistency. Daniel mentioned he used about fifty images of himself, mostly selfies in different lighting and perspectives. Is fifty the magic number, or is he overdoing it?
It is a great starting point, but the answer is usually, it depends. For a character LoRA, especially a face, you are trying to teach the model a very specific set of features that need to remain consistent. If you have fifty high-quality, diverse images, you are in a good spot. But with the flow-matching models we are using in early twenty-six, I have seen incredible results with as few as fifteen to twenty images if those images are varied enough.
Varied enough is the key phrase there, right? I mean, if Daniel takes fifty selfies in the same room with the same lighting, he is not really giving the model fifty pieces of information. He is giving it one piece of information fifty times.
Precisely. That is a classic mistake called over-fitting. The model ends up memorizing the background or the specific shadow on his nose rather than learning the essence of his face. You want what we call a diverse dataset. You want some close-ups, some mid-shots, different clothing, different expressions, and definitely different environments. If the model sees Daniel in a kitchen, a park, and a cafe, it realizes that the only constant in those images is Daniel himself. That is how it learns to separate the subject from the surroundings.
Which leads perfectly into one of Daniel's other questions: background removal. He asked if it is better to use background removal or train on diverse backgrounds. What is the consensus there?
This is where the nerds really get into it. Personally, I am a fan of diverse backgrounds over artificial background removal. When you remove the background and replace it with pure white or transparency, you are essentially telling the model, this person exists in a vacuum. The problem is that when you then try to generate that person in a new scene, the model can struggle with how light should hit the subject or how the edges should blend. You get this weird cutout look, like a bad green-screen effect from the nineteen-nineties.
That makes sense. It is like the model loses the context of how a human interacts with light and space. But what about those automated tools that blur the background?
Bokeh or shallow depth of field can be helpful because it keeps the focus on the subject while still providing some tonal context for the lighting. But honestly, the best results in twenty-six come from high-quality, natural images where the captions handle the heavy lifting of telling the model what to ignore.
Okay, let us talk about those captions then. Daniel mentioned the importance of captions and trigger words. He used the example of Daniel-Rosso as a single-word trigger. Is that the right way to go?
Using a unique trigger word is standard practice. You want something that the model does not already have a strong association with. If you just used the name Daniel, the model might get confused because it already knows thousands of famous Daniels. But Daniel-Rosso, as one word, is a blank slate.
Right, but what goes into the actual caption file? Are we just writing Daniel-Rosso in every file, or are we describing the whole scene?
This is the most important part of the training process that people often gloss over. There are two main philosophies here. One is the rare token method, where you just use the trigger word. But the more robust method, especially for complex models like Flux, is descriptive natural language captioning. You want to describe everything in the image that is NOT the subject you are training.
Wait, that sounds counter-intuitive. Why describe the things you are not training?
Because you want the model to associate those other elements with their own words, so it does not accidentally bake them into your trigger word. If Daniel is wearing a red hat in ten of his fifty photos, and you do not mention the red hat in the captions, the model might start thinking that Daniel-Rosso naturally includes a red hat. By captioning it as Daniel-Rosso wearing a red hat, you are telling the model, the red hat is a separate thing, and the face is Daniel-Rosso. It is a process of mathematical elimination.
That is fascinating. So the more detail you give about the environment, the clothing, and the lighting in the captions, the more the model can isolate the actual person. It is like you are helping the model perform a subtraction.
Exactly. You are saying, here is the total image, subtract the red hat, subtract the sunny day, subtract the park bench, and what remains is the subject. This is why people use vision-language models like Florence-two or GPT-four-o-mini to auto-caption their datasets now. Doing it by hand for fifty images is exhausting, but having an AI describe the scene for another AI to learn from is very efficient.
It is AI all the way down. Now, Daniel also asked about resolution. He mentioned five hundred and twelve by five hundred and twelve versus higher resolutions. Given that we are in early twenty-six, what is the standard now?
If you are training on a modern base model, five hundred and twelve is ancient history. You really want to be at one thousand and twenty-four by one thousand and twenty-four at a minimum. Most of the high-end trainers, like the ones on Replicate or Civitai that Daniel mentioned, handle this internally with aspect ratio bucketing. The higher the resolution, the more detail the model can pick up on things like skin texture or the specific weave of a fabric. If you train at a low resolution and then try to generate a high-resolution image, you lose that fine-grained consistency.
And what about aspect ratios? Does everything need to be a square?
Not anymore. Modern trainers support bucketed training, which means you can feed it portraits, landscapes, and squares, and it will group them together. This is actually better because it teaches the model how the subject looks in different frame compositions.
Let us pivot to his second use case: geographic locations. This one is really interesting to us because we live in Jerusalem, and as Daniel pointed out, if you just prompt for Jerusalem in a base model, you almost always get the Dome of the Rock. It is like the AI thinks the entire city is just that one landmark.
It is a classic case of dataset bias. Most of the photos of Jerusalem in the original training sets are tourist photos or news photos, which focus on the Old City. Daniel wants to capture the vibe of everyday Jerusalem: the specific texture of the Jerusalem stone, the way the light hits the buildings in the late afternoon, the narrow streets of Nachlaot.
So if he is training a LoRA for a location, how does that differ from training a person? I imagine the dataset needs to be much larger.
It does. For a specific person, you are focusing on a very small set of features. For a location or a vibe, you are trying to capture a whole aesthetic. You probably want closer to eighty or even one hundred and fifty images for a really robust location LoRA. And instead of just one trigger word, you might want to use a set of consistent descriptive words in your captions, like Jerusalem-Stone-Style.
And I guess the variety becomes even more important there. You need Jerusalem at night, Jerusalem in the rain, Jerusalem in the summer heat.
Right. And you have to be careful not to include too many unique landmarks unless you want those landmarks to appear every time. If every photo in your Jerusalem LoRA has a specific street sign in it, that street sign is going to show up in every generation. You want to capture the common denominators: the arched windows, the specific type of greenery, the way the balconies look.
Daniel mentioned something philosophical here: that his version of Jerusalem is different from a Palestinian resident's version or an ultra-orthodox resident's version. That really highlights how LoRAs are a form of personal perspective, right?
It is a subjective lens. A LoRA is essentially a mathematical representation of a specific point of view. By training his own, Daniel is effectively saying to the AI, forget what the internet thinks Jerusalem looks like, this is what it looks like to me. That is incredibly powerful for storytelling. You are no longer at the mercy of the average of the internet.
It makes me think about the third use case he mentioned: stylistic applications for architectural renderings. His wife is an architect, and she wants to use LoRAs to keep her renderings consistent with her firm's specific style. This feels like a more professional, high-stakes version of the location LoRA.
It is. In architecture, style is everything. It is about the way light interacts with materials, the specific color palette, the choice of furniture. If a firm has a signature style, say, a specific type of Mediterranean brutalism, they can train a LoRA on their past successful projects.
I can see how that would save a huge amount of time. Instead of trying to describe that style in a fifty-word prompt every time and hoping the AI gets it right, you just trigger the LoRA.
Exactly. It moves the effort from the prompting stage to the training stage. Once the LoRA is dialed in, the prompting becomes much simpler. You just say, a library in the style of My-Firm-Style, and the LoRA handles the heavy lifting of the aesthetics.
But Daniel asked about diminishing returns on dataset size. At what point are you just wasting your time adding more images to the architectural LoRA?
There is definitely a plateau. For a style LoRA, once you hit around one hundred and fifty to two hundred high-quality images, adding more often leads to diminishing returns unless those new images are bringing something truly different to the table. If you add another fifty images that look exactly like the first hundred, you are just increasing the risk of over-training.
Over-training is when the model becomes too rigid, right? Like it can only generate exactly what it saw in the training data?
Precisely. It loses its ability to be creative or to adapt to new prompts. If you over-train an architectural LoRA, it might only be able to generate buildings that look exactly like the ones in the photos, and it will struggle if you ask it to design something in a different shape or size. You want the model to learn the rules of the style, not just memorize the images.
So how do you find that sweet spot? Is it just trial and error?
To some extent, yes. But that is why we talk about epochs and steps. During training, the model passes through the dataset multiple times. Usually, you save a version of the LoRA, a checkpoint, every few hundred steps. Then you test them. You might find that at step one thousand, the style is there but it is a bit weak. At step two thousand, it is perfect. At step three thousand, it has become too stiff and started to break.
It sounds like a lot of baking. You have to keep checking the oven to make sure the cake hasn't burnt.
That is exactly what it is like. And for someone like Daniel's wife, who is a perfectionist, she will probably want to test those checkpoints very carefully. The difference between a good LoRA and a great one is often just finding the right stopping point in the training.
Let us go back to the technical side for a second. Daniel mentioned that a LoRA is usually a safe-tensors file that you plug into something like Comfy-U-I. Can you explain, in a way that doesn't require a computer science degree, what is actually happening when that LoRA is added to the base model?
Sure. Think of the base model as a massive library of knowledge. It knows how to draw almost anything, but it is a generalist. It knows what a person looks like, it knows what a building looks like. When you add a LoRA, you are not rewriting the whole library. You are adding a very small, very specific supplement to the end of certain books.
Like a specialized index or a set of sticky notes?
Exactly. The LoRA contains small matrices of numbers that intercept the signals as they pass through the model. When the model is trying to draw a face, the LoRA nudges those signals and says, hey, make the eyes a little more like this, or make the jawline a little more like that. Because the LoRA is low-rank, meaning it only affects a small number of parameters, it is very efficient. It doesn't break the model's general knowledge, it just refines it in a very specific direction.
That is why you can use a Daniel LoRA and still ask for him to be an astronaut or a medieval knight. The base model knows what an astronaut looks like, and the LoRA just tells it how to make that astronaut look like Daniel.
Exactly. It is a collaborative process between the base model's general intelligence and the LoRA's specific expertise.
You mentioned something earlier about regularization images. We should probably explain those, because they are often the missing piece for people who are struggling with their training.
Good catch. Regularization images are basically a way to prevent the model from forgetting what a general version of your subject looks like. If you are training a LoRA on Daniel, you might also include a set of images of just general men.
Why would you do that? Wouldn't that confuse the model?
It is actually the opposite. It provides a baseline. It says to the model, here is what a generic man looks like, and here is what Daniel looks like. This helps the model identify exactly what makes Daniel unique. It also prevents the model from drifting. Without regularization, if you train too hard on one person, the model might start thinking that every person should look a little bit like that person.
Oh, I have seen that! Where you use a LoRA and suddenly every character in the background starts to have the same nose as the subject.
Exactly. That is called model bleed. Regularization images help keep the LoRA in its own lane. However, with some of the newer training techniques for models like Flux, the need for regularization has actually decreased because the models are much better at isolating concepts now. But for older models, it is still a vital part of the process.
So, for Daniel's specific use cases, what would be your top three practical takeaways for him?
First, for the character LoRA, focus on quality and diversity over pure quantity. Thirty amazing, varied photos are better than a hundred repetitive ones. Second, spend the time on your captions. Use an AI to help you, but make sure you are describing the stuff you don't want the model to associate with the trigger word. And third, for the style and location stuff, don't be afraid to experiment with the rank and alpha settings.
Rank and alpha, here we go. Give us the thirty-second version of what those are.
Rank is basically the capacity of the LoRA. A higher rank means the LoRA can store more information, but it also makes the file larger and increases the risk of over-fitting. For a face, a rank of sixteen or thirty-two is usually plenty. For a complex architectural style, you might go up to sixty-four or even one hundred and twenty-eight. Alpha is essentially the strength of the LoRA. Usually, you set alpha to half of the rank or equal to the rank. It is like the volume knob for the training.
Okay, that is a lot of technical detail, but I think it really helps demystify what is happening under the hood. It is not just magic, it is math and careful curation.
It really is. And I think the most exciting part is what Daniel mentioned about the architectural renderings. We are seeing this shift where AI isn't just a toy for making weird images, it is becoming a professional tool that reflects a specific designer's voice.
It is about control. The early days of AI were all about the slot machine aspect: you pull the lever and see what you get. Now, with LoRAs, we are moving into the era of the paintbrush. You are deciding the colors, the strokes, and the subject matter.
Well said, Corn. And honestly, if Daniel's wife starts using these for her firm, I would love to see the results. Maybe she can even train a LoRA on our house so we can see what it would look like if we actually renovated the kitchen.
I think even AI might struggle with the reality of our kitchen, Herman. But hey, it is worth a shot.
Before we wrap up, I want to touch on one more thing Daniel asked about: the diminishing returns. He mentioned that he was flying blind on how many photos he needed. I think it is important for people to realize that the quality of the photos is exponentially more important than the quantity. One blurry, low-light photo can actually do more damage to a LoRA than ten good ones can do to help it.
That is a great point. It is the old garbage in, garbage out rule. If you give the model junk data, it is going to learn junk patterns.
Exactly. If you have fifty photos but ten of them are out of focus or have weird lighting that obscures your features, just delete those ten. Your LoRA will be better for it. The model is looking for patterns, and if you give it inconsistent patterns, it gets confused.
This has been a really deep dive, and I hope it helps Daniel and anyone else out there who is trying to fine-tune their own corner of the AI universe. It is a fascinating time to be doing this.
It really is. We are in the middle of a creative explosion, and the tools are getting better every single week.
Well, if you have been enjoying our deep dives into the weird and wonderful world of AI and beyond, we would really appreciate it if you could leave us a review on your podcast app. It genuinely helps other people find the show and keeps us motivated to keep digging into these topics.
Yeah, it really does make a difference. We see every one of them.
You can find all five hundred and forty-one of our episodes at my-weird-prompts dot com, where we also have an R-S-S feed for you subscribers. And of course, we are on Spotify. If you have a prompt you want us to tackle, there is a contact form on the website. We love hearing from you.
Thanks to Daniel for the great prompt. It is always fun to talk about what is happening in our own living room.
Absolutely. This has been My Weird Prompts. I am Corn.
And I am Herman Poppleberry.
We will catch you in the next one. Goodbye!
Bye everyone!