Imagine you are running a deep research agent to analyze a complex legal filing or a medical breakthrough. It scans ten thousand documents, synthesizes the data, and hands you a report. But before you even read the first sentence, the system has already filtered out forty percent of its own initial claims because it flagged them as low confidence. That is the holy grail of automated research, and as of the first quarter of twenty twenty-six, it is becoming the default setting for enterprise AI. Today's prompt from Daniel is about the mechanics and the potential pitfalls of confidence scoring in these structured output workflows. He is asking if it is really as simple as asking an AI to rate its own certainty, and if so, how on earth we make that rigorous enough for high-stakes work.
Herman Poppleberry here, and I have been diving into the documentation for the new structured output APIs from OpenAI and Anthropic all week. This is such a timely question from Daniel because we are seeing a massive shift in how these models are deployed. We are moving away from just getting a block of text and hoping for the best. Now, we are demanding that the model return a JSON object where every single claim has a numerical confidence score and a direct link to a source. But the big "but" here—and it is a massive one—is whether those numbers actually mean anything. By the way, a quick shout out to Google Gemini three Flash, which is actually powering our script today. It is a meta moment, considering we are talking about model self-awareness.
It is a bit like asking a teenager how confident they are that they cleaned their room. They might say "one hundred percent," but your definition of "clean" and theirs might be light-years apart. When we talk about confidence scoring in LLMs, what are we actually measuring? Is it the model's internal state, or is it just a vibe check it is performing on its own prose?
That is the core of the problem. In traditional machine learning, like a simple image classifier identifying a cat, "confidence" has a very specific mathematical definition. It is usually based on the softmax output of the final layer—basically, how much of the probability distribution landed on the "cat" label versus the "dog" label. But LLMs are probabilistic word generators. When you ask an LLM for a confidence score, you are usually doing one of two things. Either you are looking at the token log-probabilities—the mathematical likelihood of each word chosen—or, more commonly now, you are just asking the model to perform "verbalized self-report." That is literally just the model saying, "I think I am an eight out of ten sure about this."
And that is where the skepticism kicks in. We have all seen models hallucinate with extreme prejudice. They will tell you a flat-out lie with the confidence of a seasoned trial lawyer. If the model does not know it is lying, how can it give itself a low confidence score? It feels like a recursive loop where the error is baked into the evaluation itself. I mean, if the base model is convinced that the sky is neon green because of some weird training data quirk, it’s going to give that "green sky" claim a 1.0 confidence score every single time, right?
It's the "blind spot" problem. If the model has a fundamental misunderstanding of a fact, its self-assessment of that fact is also going to be flawed. You hit on the "sycophancy" and "overconfidence" biases right out of the gate. There was a fascinating benchmark released in January twenty twenty-six by Stanford’s Human-Centered AI Institute. They looked at GPT-four-o’s self-reported confidence scores on multi-hop question-answering tasks—the kind of stuff Daniel is talking about for deep research. They found a correlation of only zero point four two between the model’s reported confidence and its actual factual accuracy. In statistics, zero point four two is... well, it is better than a coin flip, but you would not want to bet your company’s strategy on it.
Zero point four two is basically "I'm guessing, but I'm trying to sound smart." It’s like a student who hasn’t studied but is really good at multiple-choice elimination. So, if the simple "rate your certainty" approach is that shaky, why is the industry leaning so hard into it? Is it just because it is easy to pipe into a UI? It feels like we’re putting a shiny "Verified" badge on a car that hasn't actually passed an inspection.
It is easy to implement, but the real power comes when you move beyond that "low rigor" self-reporting. Daniel mentioned piping these scores into "LLM-as-judge" steps. This is where the workflow gets interesting. Instead of trusting the model that generated the answer, you take that answer, the confidence score, and the original source documents, and you hand them to a separate, often more capable model. The "judge" model isn't just checking if the answer sounds good; it is verifying the link between the claim and the source. If the primary model says "Project X cost five billion dollars" and cites a PDF, the judge checks that PDF. If the number isn't there, or if it is actually five million, the judge nukes that confidence score.
So we are essentially building a digital bureaucracy. We have the "Researcher" model doing the legwork, and then the "Auditor" model checking the receipts. But doesn't that just move the goalposts? Now we have to worry about whether the Auditor is being too lenient or if it is just agreeing with the Researcher to avoid conflict—that sycophancy problem you mentioned. If the Researcher says, "I'm 95% sure," does the Auditor feel a sort of algorithmic pressure to agree?
That’s a brilliant question, Corn. It’s actually called "confirmation bias in multi-agent systems." If the Auditor sees the Researcher’s high score first, it can absolutely be influenced. To combat this, advanced workflows use "blind auditing." You send the claim to the Auditor without the Researcher's confidence score. You just say, "Here is a claim and here is a source. Tell me if it's true." Only after the Auditor gives its independent score do you compare the two. If they disagree, you flag it for a human. It’s like having two independent witnesses in a trial who aren't allowed to talk to each other before they testify.
That makes a lot of sense. It removes the "peer pressure" from the silicon. But you also mentioned "Chain-of-Verification" or CoVe. Walk me through that. How does a model "verify" itself without just doubling down on its own hallucinations?
It is a multi-step process. First, the model generates the initial response. Then, it is prompted to generate a set of "verification questions" that would prove or disprove the facts in that response. For example, if it claimed a specific company was founded in nineteen ninety-eight, a verification question would be, "What is the official incorporation date of Company X according to its SEC filings?" Then, the model—or a separate one—answers those questions independently, without looking at the first response. Finally, it compares the answers. If the independent answers contradict the original claim, the confidence score for that claim drops to zero.
That sounds computationally expensive. If I'm trying to triage ten thousand sources, as Daniel suggested, running a full Chain-of-Verification for every single sentence is going to blow my API budget and take forever. Is there a middle ground? I'm imagining the bill for running a hundred verification questions for every paragraph of a thousand-page legal filing. That’s not a research tool; that’s a money pit.
There is, and it is what we call the "Small-to-Large" workflow. This is where you see the real-world engineering coming in. You use a small, fast, cheap model—maybe something like Llama three eight-B or a distilled version of Gemini—to do the heavy lifting and the initial research. That small model is instructed to be very aggressive with its "low confidence" flags. Anything it isn't absolutely certain about, it flags. Then, you only send those flagged "problem nodes" to the expensive, high-rigor model like GPT-four-o or Claude three point five Sonnet for the final judging. It is a funnel. You use the cheap model for breadth and the expensive model for depth.
I like that. It's like having a fleet of interns doing the first pass, and then the senior partner only looks at the red-lined sections. But let's go back to the "rigor" part of Daniel's question. Even with a judge model, we are still dealing with probabilistic systems. How do we turn these "vibes" into something an engineer can actually trust? Is there a way to calibrate these scores so that "eighty percent confident" actually means "right eighty percent of the time"? Because right now, "eighty percent" feels like a model's way of saying "I'm pretty sure, probably."
That is the "Expected Calibration Error" or ECE. It is a metric used to evaluate how well a model's predicted probabilities match the real-world outcomes. If a system is perfectly calibrated, and you take all the times it said it was eighty percent confident, it should be correct exactly eighty percent of those times. Historically, LLMs have been terribly calibrated; they are usually "over-confident." However, a study from March twenty twenty-six—published on ArXiv just a few weeks ago—showed that the newest generation of models, like the GPT-OSS-one hundred twenty-B, are starting to achieve ECE scores of around zero point one zero. That is a massive improvement.
Zero point one zero. For the non-math nerds among us—including myself on a Tuesday afternoon—how does that translate to reliability? Does that mean the "vibes" are finally aligning with reality?
It means that if the model says it is ninety percent sure, it is actually right about eighty to ninety percent of the time. It is much closer to a "honest" assessment. But to get there, developers are using "calibration layers." They take the raw output from the LLM, run it through a smaller, supervised learning model that has been trained on thousands of examples of the LLM being right or wrong, and that "calibration model" adjusts the score. It basically learns that "When this LLM says it is ninety percent sure about a legal topic, it is actually only sixty percent sure, so let's dial that number down." It’s like having a friend who always exaggerates, so you mentally subtract twenty percent from everything they say. This is the mathematical version of that.
So we are literally training a "liar detector" specifically tuned to that model's personality. That is fascinating. It turns the model's consistent biases into a predictable variable that can be accounted for. But what about the role of structured outputs, like Pydantic or the "instructor" library Daniel mentioned? How does forcing a model into a JSON schema help with confidence? Does the structure itself act as a guardrail?
It is all about the "Chain-of-Thought" sequence. One of the most effective tricks in prompt engineering right now is forcing the model to provide an "explanation" or "reasoning" field before it provides the numerical confidence score. If you ask for the score first, the model just picks a number. But if it has to write out, "I am citing the twenty twenty-four annual report, but page forty-two has a footnote that contradicts the main table," by the time it gets to the confidence field, it has "realized" its own uncertainty. The structured output forces a logical order of operations. It’s almost like the JSON schema is a checklist that forces the model to slow down and think.
It’s the "show your work" requirement from math class. If you have to show the steps, you’re much more likely to catch your own mistake before you write down the final answer. Now, let’s talk about the "triage" aspect. Daniel mentioned using these scores to decide which sources to include in deep research. If I’m a researcher, and I see a source with a confidence score of zero point six, do I throw it out? Or is a "maybe" still valuable in a research context? Sometimes the most interesting breakthroughs come from the "low confidence" outliers that everyone else ignored.
In a deep research workflow, a "maybe" is often a trigger for more work. This is the "self-correcting research loop." In the newest systems, a low confidence score doesn't just result in a "reject" flag. It triggers a secondary search query. The system says, "I found this claim about a new battery chemistry, but the confidence is low because the source is a press release, not a peer-reviewed paper. I will now search specifically for the peer-reviewed version of this study." The confidence score becomes the steering wheel for the autonomous agent. It’s not a binary "yes/no" filter; it’s a "how much harder should I look?" dial.
That is a huge shift. We aren't just using AI to find answers; we are using its own self-doubt to refine the search. It's like the AI is saying, "I think I heard something about this, let me go double-check." That feels a lot more like how a human researcher actually works. We don't just know everything; we know what we need to verify. But how does this handle conflicting sources? If I have one high-confidence source saying A and another high-confidence source saying B, does the system just explode?
It doesn't explode, but it does flag a "high-confidence conflict." This is actually a specific field in some of the newer JSON schemas Daniel might be looking at. The model will output something like "Synthesis: Inconclusive. Source 1 (Confidence 0.9) says X. Source 2 (Confidence 0.95) says Y. Conflict detected." For a human researcher, that is pure gold. You’ve just had ten thousand documents distilled down to the one specific contradiction that actually matters. This is why we are seeing the rise of "Trust Scores." Tools like Cleanlab are pioneering this. A Trust Score doesn't just look at what the model says; it looks at "token log-probs"—the mathematical probability of the words—combined with "consistency checks" across multiple generations.
Okay, let's break down that "log-probs" part. You mentioned it earlier as "medium rigor." Why isn't a high mathematical probability of a word enough to say the model is confident? If the model's internal math says there is a ninety-nine percent chance the next word is "Paris," why isn't that a ninety-nine percent confidence score? If the math is sure, why aren't we sure?
Because the model might be ninety-nine percent sure that "Paris" is the most likely word to follow in a sentence, even if that sentence is a lie. Think about a common myth. If I start the sentence, "The Great Wall of China is the only man-made object visible from..." the model is going to be extremely "confident" that the next word is "Space." Its internal math will give "Space" a near one hundred percent probability because that is a very common linguistic pattern. But as we know, that fact is actually false. You can't see the Great Wall from space with the naked eye. So, "log-probs" measure linguistic certainty, not factual truth. It’s the difference between knowing how a sentence should sound and knowing if the sentence is true.
That is an incredible distinction. The model is confident in its grammar and its mimicry of common phrases, but it has no "grounding" in reality unless we provide it. This brings us back to Daniel's point about sources. If you don't have a source, the confidence score is just a measure of how well the model can BS its way through a paragraph. It’s like a person who is very eloquent but has no idea what they’re talking about. They sound convincing, but they’re just following the rules of language.
True rigor in twenty twenty-six requires "Source-Grounding Scores." This is a specific type of confidence scoring where the "Judge" model evaluates the strength of the link between the claim and the source text. It asks, "Does this specific sentence in the source actually support this specific claim in the output?" It looks for things like "entailment"—does Fact A logically lead to Fact B? If the answer is "yes, but it is an inference," the score is medium. If the answer is "yes, it is a direct quote," the score is high. If the answer is "no, the source mentions the topic but not this specific fact," the score is zero. This is how you catch those "hallucinations of omission" where a model cites a paper that exists, but the paper doesn't actually say what the model says it says.
So, we're moving toward a multi-factor authentication for truth. You need the model's self-report, you need the log-probs, you need the cross-model judge, and you need the direct source grounding. When you stack all those up, do we actually get a number that a human can trust? Or are we just building a more complex "vibe check"? Even with four layers, it’s still all happening inside the "black box" of neural networks.
We're getting closer to a "probabilistic guarantee." It's never going to be one hundred percent "true" in the way a mathematical proof is true. But for research, we're reaching a point where the "Trust Score" can reliably flag ninety-five percent of hallucinations. That five percent "gap" is why we still need humans in the loop. I think that's the takeaway for anyone building these systems: confidence scoring isn't a replacement for human review; it's a prioritization tool for human review. It tells you where the ice is thin.
It tells the human where to look first. Instead of fact-checking a hundred-page report, the AI hands you the report and says, "I'm ninety-nine percent sure about pages one through ninety, but page ninety-one relies on a single source that I'm only sixty percent sure about. Start your review there." That is a massive productivity gain. It’s like having a GPS that tells you, "I’m pretty sure this is the way, but maybe keep your eyes on the road for the next two miles because the map data is old."
It really is. And for those who want to go deeper into how we evaluate these probabilistic systems, we actually did a whole breakdown on this back in episode one hundred forty-seven, "How Do You QA a Probabilistic System?" We talked about the shift from unit tests to "evals," which is essentially what these confidence scores are becoming—real-time, per-output evals. Back then, we were talking about it as a theoretical future, but Daniel’s prompt shows that it’s very much the present.
I remember that one. It was all about how you can't just test for "expected output" anymore because the output is always changing. You have to test for "expected behavior" and "expected quality ranges." Confidence scoring is basically the model running its own "eval" on itself in real-time. It’s a dynamic quality control system.
Which is why the "calibration" we talked about is so key. If your real-time eval is biased, your whole system is biased. I think the most exciting development I've seen recently is the integration of these scores into "RAG" pipelines—Retrieval-Augmented Generation. Instead of just showing the user the top three search results, the system uses confidence scoring to "rerank" them based on how well they actually answer the specific question. It’s not just about "relevance" anymore; it’s about "reliability."
Which solves the "too much information" problem. We've all used RAG systems that just dump five paragraphs of irrelevant text on you because they happened to contain the keyword "battery." If the system can say, "This paragraph has the keyword, but I'm only ten percent confident it actually explains the chemistry you asked about," it can just hide that result entirely. It makes the AI feel much more "intelligent" because it isn't wasting your time with low-quality garbage. It’s exercising judgment.
That is the "triage" Daniel mentioned. It is the filter that makes "Deep Research" possible. Without it, you are just drowning in a sea of "maybe-relevant" data. But we should address the "cost-latency" tradeoff one more time, because I think that is the biggest barrier for most people listening. If you add a "Judge" model and a "Chain-of-Verification" step, your research agent just got five times more expensive and three times slower. Is that worth it for every use case?
If you're a hedge fund deciding where to put a hundred million dollars, or a pharmaceutical company looking at a new drug candidate, I think the answer is a resounding "yes." The cost of a hallucination in those fields is measured in the millions or billions. A few extra cents on an API call to ensure the AI isn't making up a clinical trial result? That's the cheapest insurance policy in history. But if I’m just using AI to summarize a recipe for lasagna, I probably don’t need a three-layer verification process to make sure it didn't hallucinate the amount of garlic.
That’s a great way to frame it. The "value" of confidence scoring is directly proportional to the "cost of being wrong." If you're using an AI to write a funny poem for your niece's birthday, you don't need a confidence score. If the AI hallucinates that she's ten instead of nine, the stakes are low. But as AI moves into the "engine room" of the global economy—which is what we're seeing in twenty twenty-six—these "rigor" frameworks stop being academic and start being essential infrastructure. We’re building the "brakes" for the AI car so we can finally drive it at high speeds.
It’s the difference between a toy and a tool. A tool has a predictable margin of error. A toy just does whatever it wants. By adding these layers of scoring, judging, and grounding, we are finally turning LLMs into reliable industrial tools. But I want to push back on one thing, Herman. You mentioned that the newest models are better at "reading their own minds." Does a model actually have an "internal state" of certainty, or is that just a metaphor we're using to describe complex pattern matching? Is there a "doubt neuron" somewhere in the weights?
That is a deep philosophical question, but there is some technical evidence for it. There is a concept called "Internal States of Uncertainty." Researchers have found that if you look at the hidden layers of a transformer—not the final output, but the middle layers—there are often specific "neurons" or directions in the vector space that correlate with the model "knowing" it is about to hallucinate. It’s like a "check engine light" that flickers deep in the circuitry before the wrong word even reaches the output layer. They’ve actually experimented with "probing classifiers" that can predict if a model is going to be wrong just by looking at its activation patterns before it even finishes the sentence.
That is wild. So the model "knows" it's full of it, but by the time it gets to the output, it has smoothed over that uncertainty to sound more "helpful" and "fluent." It’s like a person who starts a sentence, realizes halfway through they don’t know the ending, but just commits to it anyway to avoid looking stupid.
Precisely! Well, not "precisely," but you nailed the concept. The "alignment" training we give these models—to make them polite and helpful—often encourages them to hide their uncertainty because a "helpful" assistant is supposed to have the answer. We are essentially training them to be overconfident. Confidence scoring is our attempt to bypass that "politeness layer" and tap back into that raw, internal "check engine light." We’re asking the model to stop being a "helpful assistant" and start being a "transparent data processor."
We're basically asking the AI to be "less helpful" in the conversational sense, and "more honest" in the data sense. I think that's a trade-off most researchers would take in a heartbeat. I don't need my research agent to be my friend; I need it to be a cold, hard judge of evidence. I want it to be brutally honest about what it doesn't know.
And that is the shift Daniel is seeing. We are moving from "Chatbots" to "Research Agents," and the primary difference between the two is the presence of a rigorous, objective confidence framework. The "structured output" part is just the delivery mechanism—the pipe that lets that rigor flow into our other applications. It’s what allows a developer to say "If confidence < 0.7, trigger human-in-the-loop," and have that actually mean something.
So, to bring it back to Daniel’s original question: "Is it as simple as asking the AI to score its certainty?" The answer is "No, but that's where it starts." To make it rigorous, you need that multi-layered approach—the "liar detector" calibration, the "Auditor" judge model, and the "receipt-checking" source grounding. It’s not one thing; it’s a stack. And it’s a stack that requires a lot of testing and fine-tuning.
It’s a stack that requires real engineering. You can't just "prompt" your way to rigor. You have to build a pipeline that includes calibration, verification, and hopefully, that "Small-to-Large" cost optimization. But the good news is that the tools to do this—like the "instructor" library or the native JSON modes in the big APIs—are making this accessible to everyone, not just the big labs. We’re seeing a democratization of "high-rigor AI."
It feels like we're finally getting the "User Manual" for these models. For the first two years, we were just pressing buttons and seeing what happened. Now, we're learning how to look under the hood, check the gauges, and actually drive the thing with some level of control. We’re moving from the "magic" phase of AI to the "engineering" phase.
And that control is going to be the deciding factor in which companies actually succeed with AI in the next few years. The companies that just "deploy a chatbot" are going to get burned by hallucinations. The companies that build "trust-scored research pipelines" are going to be the ones that actually unlock the value of all this data. It’s about building a system that knows its own limits.
So, for our listeners who are building these systems right now, what is the "Monday morning" takeaway? If they have a RAG pipeline and they want to add this rigor, where do they start? What’s the first brick in the wall?
Step one: Don't just ask for a confidence score. Use a structured output to force the model to provide a "reasoning" or "evidence" field before the score. That alone will give you a massive boost in accuracy because it forces the model to engage its internal consistency checks. Step two: Take a small sample of your data—maybe a hundred queries—and manually check the "confidence" against the "truth." If the AI says "ninety percent" but it's only right "sixty percent" of the time, you know you need that calibration layer. You need to know your model's "overconfidence offset."
And step three: If the stakes are high, bring in a "Judge." Use a second model to check the first model's work. It's the "four eyes principle" for the digital age. It costs more, but it's the only way to get to that "expert-level" reliability. It’s the difference between a rough draft and a peer-reviewed paper.
And keep an eye on those "Source-Grounding" metrics. A confidence score without a source is just a guess. A confidence score with a verified source is a fact. That’s the goal. That’s how we turn these black boxes into transparent, reliable research engines.
I think that’s a perfect place to wrap this one. Daniel, as always, thanks for the prompt. You really hit on the "vanguard" of where LLM engineering is heading. It’s not just about what the model says anymore; it’s about how much we can trust it. We’re moving from the "what" to the "how sure are we?"
It’s the "Trust but Verify" era of AI. And honestly, I’m a lot more comfortable with the "Verify" part being automated than I was even six months ago. The tech is moving so fast. We're getting the tools to hold these models accountable in real-time.
Well, before we get too confident in our own analysis, let’s wrap this up. Huge thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power the generation of this show. Their serverless infrastructure is actually a great example of the kind of "reliable, on-demand" tech we’ve been talking about today.
This has been My Weird Prompts. If you found this dive into confidence scoring useful, or if you’re currently wrestling with these systems in your own work, we’d love to hear from you. You can reach us at show at myweirdprompts dot com. We’re especially interested in hearing about your "hallucination horror stories" and how you’re using these new tools to fix them.
And if you’re enjoying the show, a quick review on your podcast app really does help us reach more people who are trying to make sense of this AI-saturated world. We’re on Spotify, Apple Podcasts, and pretty much everywhere else. Tell a friend, tell a colleague, or just tell your local AI assistant to play the show.
You can also find our full archive and RSS feed at myweirdprompts dot com. We’ve got nearly two thousand episodes now, covering everything from the philosophy of mind to the nitty-gritty of GPU architecture. It’s a deep well of weirdness.
Just don’t ask us for a confidence score on every single one of those episodes. Some of the early ones might be a bit "low rigor." We were still figuring out how to talk to these things ourselves back then!
Speak for yourself, Corn! My donkey brain has been one hundred percent certain since episode one. I’ve never had a doubt in my life!
And that, listeners, is exactly the "overconfidence bias" we were talking about. See you in the next one.
Goodbye everyone. Stay curious, and stay skeptical!