#2115: Why AI Answers Differ Even When You Ask Twice

You ask an AI the same question twice and get two different answers. It’s not a bug—it’s physics.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2271
Published: Apr 7
Duration: 25:06
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-inference gpu-acceleration ai-non-determinism

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

If you ask an AI the same question twice, you might get two different answers. This isn't a glitch or a feature of "creativity"—it’s a fundamental property of how modern AI is built and run. For developers and engineers, this variability is a massive hurdle. You can't build a bank transfer system or a medical record update on a "maybe." Yet, the core technology behind Large Language Models is inherently probabilistic, not deterministic.

The core of the issue lies in how these models process information. While users are familiar with the "Temperature" setting—a parameter meant to control randomness—the idea that setting it to zero guarantees identical results is a fallacy. In theory, "greedy decoding" should always pick the token with the highest probability score. In practice, the math behind those probability scores isn't static.

The problem stems from the hardware itself. Modern AI runs on GPUs, which perform massive matrix multiplications using parallel processing. This involves floating-point math, which is subject to something called "non-associativity." In basic math, the order of addition doesn't matter (A+B+C is the same as C+B+A). In a GPU, where thousands of threads are adding numbers simultaneously, the order in which those additions happen can change the final rounding of tiny decimals.

This creates "numerical drift." When two potential tokens are neck-and-neck in probability—say, 0.45678 versus 0.45679—a minuscule shift in how the GPU sums the weights can flip the winner. Because LLMs are auto-regressive, meaning each chosen token becomes part of the next prompt, this tiny initial flip cascades, leading to a completely different sentence by the end.

This variability is further compounded by server optimization. In a production environment, your request is rarely processed alone; it's batched with others to save time and money. If your prompt is batched with five other requests in one run but fifty in the next, the memory alignment on the GPU shifts. This changes how the kernels execute, which shifts the probabilities and, consequently, the output. Your result is technically linked to the total server workload at that exact millisecond. Achieving true determinism would require disabling these parallel optimizations, making inference two to five times slower and far more expensive—a trade-off most providers won't make.

So, how do you build reliable software? The industry is shifting its approach: stop trying to force the model to be a calculator and treat it as a probabilistic engine with a deterministic post-processor. Think of the LLM as a messy intern who gathers data, and your code as a strict manager who validates it. For example, in invoice processing, the LLM might output "one hundred dollars," "100.00," or "$100." The post-processing code then strips symbols, converts it to a float, and ensures it matches a regex before entering it into the database. The final output is deterministic, even if the LLM's path to get there varied.

For code generation, the risk is higher, as variability can break builds. This is where caching becomes a solution. API providers can serve a cached result for identical prompts, giving the illusion of determinism. However, this is a "best effort" approach. As hardware or software kernels are updated, the underlying math changes, and old seeds or caches may no longer work.

Ultimately, building with AI requires accepting its probabilistic nature and designing systems that enforce determinism at the final step, not the model level. It's a foundational shift in how we think about software reliability in the age of AI.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2115: Why AI Answers Differ Even When You Ask Twice

Imagine you ask a friend the same question one hundred times. If they are a reasonably consistent person, you probably expect the same answer, or at least a very similar one, every single time. But with AI, that is not how the world works. You can ask an LLM the same question twice and get two different answers. It is like a digital coin flip that happens millions of times a second. Today, we are diving into the heart of that chaos. We are looking at the fundamental tension between AI’s probabilistic nature and our desperate engineering desire for deterministic, predictable outputs. Can we actually force these models to be consistent, or is variability just baked into the silicon?

Herman Poppleberry here, and Corn, you have hit on the trillion dollar question for anyone actually building with this stuff. In the lab, a little bit of creativity and "vibe" is great. But in a production pipeline where you need a specific JSON object to trigger a bank transfer or a medical record update, "vibes" are a liability. We want machines that act like calculators, but we have built machines that act like jazz musicians.

Well, before we get too deep into the music, I should mention that today’s episode is powered by Google Gemini three Flash. It is the model behind the curtain for this specific script. And speaking of prompts, Daniel sent us a great one to get us moving. He writes: AI models are probabilistic. But how close to being deterministic can we force them to be? For example, if we constrain temperature on a structured output workflow on an instructional model, how close can we get to being able to get a replicable result for a given prompt? Or is variability simply innate to the technology?

Daniel is touching on the holy grail of AI engineering. And the short answer is that while we can narrow the path, the "innate" variability is much deeper than most people realize. It is not just about the settings you choose in your API call; it is about the way electricity flows through a GPU and how floating-point math actually works at scale.

So, let’s start with the basics of that "narrowing the path." Most people who have tinkered with an API know about Temperature. The general wisdom is: set Temperature to zero if you want the model to stop being "creative" and just give you the facts. In theory, Temperature zero means "greedy decoding," right? The model looks at the probability of the next token and just picks the one with the highest score. If the word "The" has a sixty percent probability and "A" has a twenty percent probability, it picks "The" every single time. So, why isn't that enough to make it deterministic?

Because the probabilities themselves are not static. This is the "Temperature zero fallacy." We assume the model’s internal math produces the exact same probability scores every time we run the same prompt. But in a modern, high-scale production environment, those scores can shift. Even a tiny shift in the tenth decimal place can flip the "winner" of that token competition. And because LLMs are auto-regressive—meaning each token chosen becomes part of the prompt for the next token—that one tiny flip at the start cascades into a completely different sentence by the end.

Wait, why would the math change? It is the same weights, the same input tokens, the same code. If I add two plus two on a calculator, it does not occasionally give me four point zero zero zero zero zero one because it is "feeling different" today. What is happening inside the GPU that makes the math wiggle?

It comes down to something called floating-point non-associativity. In basic math, we are taught that the order of addition does not matter. A plus B plus C is the same as C plus B plus A. But in high-performance computing on a GPU, where you are adding thousands of numbers simultaneously across parallel threads, the order actually does matter for the final rounding of those tiny decimals. Depending on which thread finishes first, or how the GPU kernels are scheduled, the sequence of additions changes.

So you are saying the hardware is literally racing itself, and whoever wins the race slightly changes the rounding of the result?

Well, not exactly—I mean, that is the mechanism. It is called numerical drift. When you are doing massive matrix multiplications, these tiny rounding errors accumulate. If two tokens are neck-and-neck in probability—say, one is zero point four five six seven eight and the other is zero point four five six seven nine—a tiny change in how the GPU summed the weights can flip them. Suddenly, your "deterministic" model picks a different word.

That feels like a betrayal of the word "computer." We think of computers as these rigid, logical boxes, but you are describing something that is almost biological in its inconsistency. Is this just an OpenAI problem, or is it universal?

It is universal to how we currently do parallel computing. There is research from late twenty-four, specifically from the Thinking Machines Lab, looking at "batch invariance." In a production setting, your request is rarely processed alone. It is batched with other users' requests to save money and time. If your prompt is batched with five other requests in one run, but fifty requests in the next, the memory alignment on the GPU changes. The way the kernels execute those parallel sums changes. That shifts the floating-point results, which shifts the probabilities, which flips the tokens.

So my prompt’s output might change just because some guy in another country decided to ask the AI for a poem at the same time mine was being processed?

Precisely. Your output is technically linked to the total workload of the server at that exact millisecond. It is what researchers call "non-deterministic by-products of optimization." If providers wanted to make it truly deterministic, they would have to disable a lot of these parallel optimizations. But doing that makes the inference two to five times slower and much more expensive. For a company like OpenAI or Google, the cost of absolute determinism is a massive hit to their profit margins and user experience.

That is wild. We are basically trading accuracy for speed, but at a level so granular most people don't even know it is happening. I remember seeing OpenAI’s documentation about their "seed" parameter and the "system fingerprint." They basically say it is "best effort." It is like they are giving you a "maybe" button.

It is a "best effort" because even if they keep the seed the same, if they upgrade the hardware from an A-one-hundred to an H-one-hundred, or even just update the software kernel that manages the GPU, the math changes. The "system fingerprint" is their way of telling you, "Hey, the underlying hardware or software changed, so don't expect your old seeds to work anymore."

Okay, so we have established that the hardware is working against us. But Daniel also asked about structured output workflows. This is a huge trend right now—using JSON schemas to force the AI to return data in a specific format. If we use a very rigid schema, does that act as a leash? Does it force the model back into a deterministic corner because it simply has fewer "legal" tokens to choose from?

It definitely helps, but it is more like a guardrail than a leash. When you use structured outputs, the system uses "constrained decoding." Essentially, at each step, it masks out any tokens that would break the JSON format. If a comma is required next, it literally won't let the model pick a letter. This significantly reduces the "solution space," which naturally reduces the chances for variability. If there is only one logical way to fill a JSON field, the model will likely hit it every time.

But if the field is a "description" or a "summary," we are right back in the soup, aren't we?

We are. The structure is deterministic, but the content inside the strings is still subject to all that GPU drift we talked about. You might get the exact same JSON keys every time, but the value for "summary" might start with "The" in run one and "This" in run two. And once that first word flips, the rest of the summary will likely diverge.

I’ve seen some research on this—I think it was from December twenty-four—that looked at Mixtral and GPT models on math tasks. Even at Temperature zero, the "Total Agreement Rate"—meaning getting the exact same string every time—was often zero percent for complex tasks. Zero. As in, it never happened twice.

That study is fascinating. They found that while the "Answer Agreement Rate"—like, did it eventually pick "Option B" in a multiple-choice question—remained high, the actual reasoning steps it wrote down to get there varied wildly. This creates a huge problem for "vibe coding" or automated agents. If the logic it uses to reach an answer changes every time, you can't reliably debug it. You might have a prompt that works perfectly for a week, and then suddenly, because the server load spikes or the model is batched differently, it takes a logical shortcut that breaks your downstream code.

It is like trying to build a skyscraper on top of a foundation made of Jell-O. It looks solid until the temperature in the room changes. So, if we can't trust the model to be deterministic, how are people actually building reliable software with this? If I’m a developer and I need to extract data from an invoice, and I need it to be the same every time I re-run that invoice, what is the play?

You have to move the determinism "downstream." This is the big shift in AI engineering right now. You stop trying to force the model to be a deterministic calculator and you start treating it as a probabilistic engine with a deterministic post-processor. For example, instead of just taking the LLM’s word for it, you have it output its reasoning into a structured format, and then you use traditional, "boring" code to validate and normalize that output.

So, the LLM is the messy intern who gathers the data, but you have a very strict manager—the code—who checks the work against a set of hard rules.

That is a great way to put it. Think about a fintech company processing invoices. They might use an LLM to find the "Total Amount" on a messy PDF. The LLM might return "one hundred dollars" or "one hundred point zero zero" or "$100." The variability is there. But the post-processing code takes those strings, strips the symbols, converts them to a float, and ensures it matches a specific regex. The final output into the database is deterministic, even if the LLM's path to get there was slightly different each time.

That makes sense for data extraction, but what about something like code generation? If I’m using a tool like Claude Code or Gemini to build an app, and the model suddenly decides to use a different library or a different naming convention because of a GPU rounding error, that could break my entire build.

That is the "vibe coding" risk Daniel mentioned in his notes. This is where "idempotency" becomes a nightmare. In traditional software, an idempotent operation is one that can be performed multiple times without changing the result beyond the initial application. With LLMs, true idempotency is almost impossible to guarantee at the prompt level. The industry is moving toward "caching" as a solution. If you send the exact same prompt and the exact same settings, the API provider might just serve you the cached result from their database rather than re-running the model. That gives the illusion of determinism.

Ah, so it is not that the model became consistent, it is just that they are showing us a recording of the first time it worked. That is clever, but it feels like a cheat. What happens when the cache expires or I change one character in the prompt?

Then the "Recording" stops, and you are back in the jazz club, hoping the drummer hits the same beat. And this leads to a second-order effect that I think is really dangerous: brittle failure modes. When you constrain a model too tightly—say, with a very complex JSON schema and Temperature zero—you sometimes trade "variability" for "hallucination."

How so?

If the model is having one of those "drift" moments where its internal math is pushing it toward a word that isn't allowed by your schema, but it can't find a "legal" word that also makes sense, it might just start making things up to satisfy the structure. It will prioritize "valid JSON" over "accurate information." You get a perfectly formatted object that is completely wrong.

It is the digital equivalent of a student who doesn't know the answer to a multiple-choice question but knows they have to bubble in something to pass the test. They’ll pick 'C' just to move on.

And that is why the "QA for probabilistic systems" is so much harder than traditional software testing. In the old days, you wrote a unit test: "If Input is X, Output must be Y." In the LLM world, your test has to be: "If Input is X, Output must be Y... ninety-eight percent of the time, within a certain semantic range, and follow these twelve structural rules."

We actually talked about this a bit in Episode nineteen thirty-two, about how you can't just use standard unit tests. You have to use "evals," which are basically other, bigger AI models grading the smaller model’s homework. It is AIs all the way down.

It really is. And for the listeners who are building things, the takeaway here is: do not trust Temperature zero. It is a useful tool, but it is not a guarantee. If your business logic depends on the model being one hundred percent consistent, you are setting yourself up for a "production ghost"—a bug that only appears when the server load is high or when the provider switches GPU clusters, and you'll never be able to replicate it in your local environment.

That sounds like a horror story for a developer. "It worked on my machine" becomes "It worked at three in the morning when the latency was low."

It really is! There was a case study I read about a company that was using LLMs to categorize customer support tickets. They had a prompt that was ninety-nine percent accurate in testing. But when they launched, the accuracy dropped to eighty-five percent. They realized that in testing, they were sending requests one by one. In production, their system was batching them. That batching changed the floating-point math just enough to flip the classification of "edge case" tickets.

So, what is the practical advice? If I’m Daniel, and I’m looking at these structured output workflows, how do I actually test for this? Do I just run the same prompt one hundred times and see what happens?

Honestly? Yes. That is called "variance testing," and it should be part of every AI deployment pipeline. You take your most important prompts and you run them through a "Monte Carlo" style test. Run it fifty times at Temperature zero. If you get fifty identical results, great. If you get forty-five of one and five of another, you now know your "confidence interval." You can build your application to handle that five percent variance.

And if you get fifty different results?

Then your prompt is too "loose" or your task is too complex for the model’s current reasoning capabilities. You need to break the prompt down into smaller, more deterministic steps. Instead of "Analyze this whole document and give me a summary," you do "Extract the names. Now extract the dates. Now extract the amounts." The smaller the task, the less room there is for the math to drift.

It is like giving the intern one task at a time instead of a whole project. They are less likely to get distracted. I want to go back to something you mentioned earlier—the "Jazz Musician" analogy. I think it was a researcher from a study on "Non-Determinism of Deterministic Settings" who said that asking an LLM to be deterministic is like asking a jazz musician to play the exact same solo twice. Even if they try, the physical environment, the timing, the "feel" of the room makes it impossible.

It’s a beautiful analogy because it captures the "physicality" of computing. We think of code as abstract logic, but it’s actually electrons moving through gates. In a GPU with eighty billion transistors, the "environment" is incredibly complex. The heat of the chip, the power draw, the other tasks running alongside it—all of these create a "timing" that is never exactly the same twice.

So, will we ever get true determinism? Or is this a permanent feature of the LLM landscape?

There are people working on "deterministic kernels." These are specifically designed GPU operations that force the math to be associative—meaning the order of operations is locked in. But as I said, the speed penalty is huge. Most people would rather have an answer in two seconds that is ninety-nine percent consistent than an answer in ten seconds that is one hundred percent consistent. We’ve collectively decided that "fast and mostly right" is better than "slow and perfectly predictable."

It’s the classic engineering trade-off. But I wonder if that changes as we move toward "Agentic" workflows. If an AI agent is browsing the web and making purchases on your behalf, a one percent variance could mean the difference between buying a plane ticket to Paris or a plane ticket to Peoria. At that point, the "speed" doesn't matter as much as the "accuracy."

I think we’ll see a tiering of models. You’ll have your "Creative Engines" for writing and brainstorming where Temperature is high and variance is a feature, not a bug. And then you’ll have "Logic Engines" that use deterministic kernels, verified hardware, and maybe even redundant checking—where two different GPUs run the same math and compare notes—to guarantee a result. But you’re going to pay a premium for that "Logic Engine" run.

It’s like the difference between buying a regular bolt at a hardware store and buying a "certified" bolt for an airplane wing. They look the same, but one has a paper trail of testing and a guarantee that it won't fail under specific conditions.

We are currently in the "hardware store" phase of AI. Everything is cheap, fast, and "good enough" for most things. But as this tech moves into critical infrastructure—medicine, law, finance—the "aerospace" grade AI is going to become a huge market.

That brings up a funny thought—if we ever do get perfectly deterministic AI, will we lose the "magic"? Part of why these models feel so human is that they are a little bit unpredictable. If it gave the exact same answer every time, would it start to feel more like a boring old database?

Probably. There is a "uncanny valley" of predictability. If a human gave you the exact same three-paragraph explanation for a concept every time you asked, word for word, you’d think they were a robot. The variability is actually what makes the interaction feel natural. But again, that is the tension: we want the interaction to be human, but we want the result to be mechanical.

We want a robot that acts like a person but thinks like a calculator. It’s a tall order. Before we wrap up the main discussion, I want to touch on one more thing from Daniel’s notes—the cost of determinism. He mentioned that for OpenAI, forcing determinism would hit their profit margins. This is something we don't talk about enough: the economics of randomness.

It is the hidden tax on reliability. To make these systems deterministic, you have to reduce their efficiency. You have to wait for threads to sync up, you have to avoid certain parallel shortcuts. In a world where every millisecond of GPU time costs money, "randomness" is actually a cost-saving measure. It’s cheaper to let the math wiggle a little than it is to keep it perfectly straight.

That is a cynical but probably very accurate take. "Accuracy is expensive, randomness is free."

Or at least, "Consistency is expensive." The model is still "accurate" in the sense that it is following its training. It’s just not "replicable." And in the world of science, if it’s not replicable, it’s not a fact. That is the crisis AI is bringing to the world of software engineering. We are trying to build "facts" on top of "probabilities."

Well, let’s look at some practical takeaways for the folks listening who are actually staring at a terminal right now. If you are building a production system and you need consistency, what are the three things you should do?

First, set Temperature to zero, but treat it as a "strong suggestion" to the model, not an absolute command. Use it in conjunction with "top_p" set to one to limit the pool of tokens. Second, always use structured outputs—whether that is OpenAI’s JSON mode or a library like Instructor or Pydantic. Forcing the structure is the most effective way to keep the model from wandering off into the woods, even if the individual words vary.

And the third?

Implement a deterministic post-processing layer. Don't let the LLM have the final say. If you need a date, take the LLM’s string and run it through a proper date-parsing library. If you need a number, cast it to a float in your own code. Use the AI for the "fuzzy" work of understanding the context, but use traditional code for the "hard" work of generating the final data.

And I’ll add a fourth: audit your workflows for "hidden variability." Run those Monte Carlo tests. If your system breaks because the model said "Yes" instead of "Affirmative," that is a failure of your code, not the AI. You need to build your systems to be "variance-aware."

Variance-aware. I like that. It’s about building resilient systems that expect a little bit of noise in the signal.

So, back to Daniel’s original question: "Is variability simply innate to the technology?" Based on everything you’ve said, Herman, it sounds like the answer is a resounding "Yes"—at least at the current scale of parallel computing. We are essentially catching lightning in a bottle, and you can't expect the same bolt to strike the same way twice.

That is exactly it. The lightning is the intelligence, but the bottle is the GPU. And as long as we are using these massive, parallel, floating-point-heavy chips to run these models, that tiny wiggle in the math is going to be there. It is the "ghost in the machine," and we just have to learn to live with it.

Or learn to code around it. It’s a whole new paradigm of engineering. It’s less like architecture and more like gardening. You can’t control exactly where every leaf grows, but you can trim the hedge into the shape you want.

I love that. We are all just digital gardeners now, trying to keep the AI bushes from overgrowing the sidewalk.

Well, my shears are getting dull, so we should probably start to wind this down. This has been a fascinating deep dive into the "wobble" of AI. It’s one of those things that seems like a small technical detail until you realize it’s the reason your app is crashing on Tuesday but worked on Monday.

It’s the difference between a toy and a tool. And if we want these to be tools, we have to respect the physics of the hardware they run on.

This brings us to a final open question for the listeners: As models get larger and more capable—moving from billions of parameters to trillions—does this fundamental probabilistic nature become more or less of a constraint? Do bigger models have "stronger" internal logic that can override the hardware drift, or does more complexity just mean more room for things to go sideways?

That is the frontier. Some people think "Scale is all you need" to fix this. Others think we need a fundamental change in how chips are designed. I’m leaning toward the latter.

We’ll have to see where we are by twenty-seven. But for now, we are stuck with the drift. Thanks as always to our producer, Hilbert Flumingtop, for keeping our own variability in check. And a big thanks to Modal for providing the GPU credits that power this show—we promise we aren't letting the floating-point errors get too out of hand.

Mostly.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your podcast app helps us reach new listeners and keeps the "probabilities" of our success high.

Check out the website at myweirdprompts dot com for the full archive and all the technical links from today’s episode.

We’re also on Telegram—just search for My Weird Prompts to get a notification the second a new episode drops.

See you next time.

Stay consistent, folks. Or try to.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2115: Why AI Answers Differ Even When You Ask Twice

Downloads

You Might Also Like

#2115: Why AI Answers Differ Even When You Ask Twice