#2494: Active Prompt Engineering: Daniel's Diff-Based Loop

A deep dive into iterative prompt refinement using inter-iteration prediction change as an uncertainty signal.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2652
Published: Apr 27
Duration: 26:16
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: prompt-engineering active-learning few-shot-learning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When Your Prompt Converges: Active Learning Meets In-Context Prompt Engineering

A listener building a dataset for structural decomposition of voice-dictated AI prompts has developed an iterative refinement loop that raises fascinating questions about active learning, uncertainty estimation, and what it means for a prompt to "converge."

The Loop

The task is straightforward: take raw transcripts of voice-dictated prompts and split them into three fields — prompts, context, and host notes. With 5,500 raw transcripts to process, Daniel starts with a 203-row slice, generating silver labels using Claude Sonnet 4.6. His training loop iterates between hand-annotating gold rows, re-running Sonnet with those as few-shot exemplars, then diffing the outputs to find which rows changed most between iterations. He annotates those next, repeating until row churn drops below 2%.

The artifact at the end isn't a fine-tuned model — it's a converged prompt plus a few-shot configuration.

Prior Art: Active Prompt Engineering

This approach has a name: Active Prompt Engineering (APE), published at the DaSH 2024 workshop by Qian and colleagues. APE is a human-in-the-loop tool that iteratively selects the most ambiguous examples for human feedback, then transforms those into few-shot examples within the prompt. It builds on earlier work from Diao et al. (2023) called Active Prompting with Chain-of-Thought, which used uncertainty-based active learning to select which questions to annotate with chain-of-thought reasoning.

Daniel's loop is a specific instantiation of APE, but with a genuinely novel twist. APE uses the language model's own uncertainty as the sampling strategy — running the same prompt multiple times at different temperatures and measuring self-consistency entropy. Daniel uses inter-iteration prediction change: he diffs the silver outputs between iteration N and N+1, and rows that flipped their predictions are treated as high-uncertainty. This is closer to query-by-committee, except the committee is the same model at different stages of prompt refinement.

The Blind Spot: Consistently Wrong Predictions

The diff-based approach has a specific blind spot. Rows whose predictions change most between iterations are genuinely high-uncertainty — the model is flipping its answer as the prompt evolves. But what about rows that are consistently wrong across iterations? The model confidently predicts the wrong answer every single time. Those rows would look perfectly stable in the diff, but they're actually the most informative ones to annotate because they'd reveal systematic errors in the prompt's understanding of the task.

The diff strategy selects for instability, not for error. These overlap but aren't the same set. A hybrid approach would be more robust — diff-driven sampling for the unstable rows, plus a small random sample each iteration to catch the consistently-wrong rows that never surface through churn.

Convergence and Stopping Criteria

Daniel's 2% row churn threshold is hand-wavy, and rightly so. Two percent of 200 rows is 4 rows; 2% of 500 rows is 10 rows. The threshold means different things at different dataset sizes. A more principled alternative is McNemar's test — a statistical test on prediction stability between iterations. You stop when you cannot reject the null hypothesis that the two iterations' predictions come from the same distribution. This gives a dataset-size-invariant stopping rule with a clear statistical interpretation.

What's Actually Converging?

A critical distinction emerges: is the loop converging the prompt, or is it converging the selection strategy for which gold rows to include as few-shots? At 200 rows, Daniel's already well beyond what fits in a single prompt. Every iteration, some selection process is picking which annotated examples go into the few-shot configuration. The loop might be optimizing that selection rather than the prompt wording itself.

If what's converging is the exemplar selection, then the artifact isn't really a "converged prompt" — it's a converged exemplar selection heuristic plus a prompt template. And the prompt template might be relatively stable after just a few iterations, while the exemplar selection keeps evolving.

Few-Shot Leakage and Evaluation

Daniel's instinct to carve out a held-out eval slice is correct, but for explicit reasons. Using gold rows as few-shot exemplars while also evaluating on the silver predictions for non-gold rows within the same working slice is a form of data contamination. The few-shot exemplars are selected from the same distribution you're evaluating on, and worse, they're selected specifically because they improved performance on that distribution during iterative refinement.

The standard practice in in-context learning is to select examples from the training set only, tune hyperparameters on a validation set, and report final numbers on a held-out test set. If you're iteratively refining based on performance on your working slice, you're effectively doing hyperparameter optimization on your test set.

The cleanest approach: freeze the prompt after convergence, then evaluate on a held-out set that was never used as few-shot exemplars and never influenced the exemplar selection process. This gives an honest accuracy number. The in-distribution agreement on the working slice is still a meaningful signal — it tells you the loop is stabilizing — but it's not an accuracy metric you can report.

The Economics of Active Learning

Daniel's approach changes the optimal sampling strategy compared to classical active learning. In classical active learning, re-scoring the pool was expensive because you had to re-train the model. Here, re-scoring is nearly free. The constraint is human annotation time, not compute. Each full re-run on 200 rows costs about $1-2 with prompt caching, or $0.50 on the batch API. At that price point, you could run multiple uncertainty estimation strategies in parallel — diff-based, self-consistency entropy, label injection — and compare which one actually selects the most informative rows. The experiment costs maybe $5.

This is the kind of empirical validation you'd never do in classical active learning where re-scoring meant re-training a model on a GPU.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2494: Active Prompt Engineering: Daniel's Diff-Based Loop

Daniel sent us this one — he's building a dataset for structural decomposition of voice-dictated AI prompts. The task is taking raw transcripts and splitting them into three fields: prompts, context, and host notes. He's got about five thousand five hundred raw transcripts, working with a two hundred three row slice, generating silver labels with Claude Sonnet four point six. And he's developed this training loop that iterates between hand-annotating gold rows, re-running Sonnet with those as few-shot exemplars, then diffing the outputs to find which rows changed most between iterations, and annotating those next. He repeats until row churn drops below two percent. The artifact at the end isn't a fine-tuned model — it's a converged prompt plus a few-shot configuration. And he wants us to dig into whether this has a name, where it breaks, and what a cleaner formalization looks like.

Before we jump in — today's script is being generated by DeepSeek V four Pro. Just putting that out there.

Alright, so there's a lot here. Seven specific questions. I want to start with the naming question because I think it's the load-bearing one. Everything else flows from understanding what this actually is in the literature.

It absolutely has prior art. The closest thing is called Active Prompt Engineering, or A. — published at the DaSH twenty twenty-four workshop by Qian and colleagues. It's a human-in-the-loop tool that iteratively selects the most ambiguous examples for human feedback, then transforms those into few-shot examples within the prompt. The paper's on arXiv, August twenty twenty-four. itself builds on an earlier paper from Diao and others in twenty twenty-three called Active Prompting with Chain-of-Thought, which got into A. twenty twenty-four. That's the foundational work — it proposed using uncertainty-based active learning to select which questions to annotate with chain-of-thought reasoning, and it hit state-of-the-art on eight complex reasoning tasks.

Daniel's loop is essentially a specific instantiation of the A. But there's a genuinely novel twist here that I think deserves its own framing. uses the language model's own uncertainty as the sampling strategy — they run the same prompt multiple times at different temperatures and measure self-consistency entropy. Daniel's using inter-iteration prediction change. He diffs the silver outputs between iteration N and iteration N-plus-one, and the rows that flipped their predictions are treated as high-uncertainty. That's not entropy-based uncertainty sampling. That's closer to query-by-committee, except the committee is the same model at different stages of prompt refinement.

Which is a really clever computational hack. You're not running multiple inference passes at different temperatures to estimate uncertainty. You're just comparing two inference runs you already did as part of the loop. The diff is free.

You've already paid for both runs. But the point stands — you're extracting an uncertainty signal from data you already generated, rather than running dedicated uncertainty estimation passes. The question is whether that signal is actually capturing the right kind of uncertainty.

This is where the literature has something interesting to say. There's a tension in the active learning for in-context learning space. Margatina and colleagues published a paper in twenty twenty-three that found standard uncertainty sampling actually underperforms similarity-based and diversity-based methods for selecting in-context learning demonstrations — across twenty-four tasks. That was a pretty damning result for uncertainty-based approaches.

Then Unc-T. came along in twenty twenty-four — Huang and colleagues — and revived uncertainty-based selection by using language-model-specific uncertainty metrics rather than traditional active learning uncertainty measures. They run the model under three conditions: no label, right label, wrong label. Then they classify uncertainty based on output inconsistency across those three settings. And they got average accuracy improvements of three point seven percent, one point two percent, and one point nine percent over the previous best strategy for Llama two, Mistral, and G. three point five respectively. So uncertainty sampling does work for in-context learning, but only if you use the right kind of uncertainty signal.

Which brings us back to Daniel's diff-based approach. Is inter-iteration prediction change a good uncertainty proxy? I think it's defensible but it has a specific blind spot. Rows whose predictions change most between iterations are high-uncertainty — the model is flipping its answer as the prompt evolves. That's a real signal. But what about rows that are consistently wrong across iterations?

The model confidently predicts the wrong answer every single time. Those rows would look perfectly stable in the diff, but they're actually the most informative ones to annotate because they'd reveal systematic errors in the prompt's understanding of the task.

The diff strategy is essentially selecting for instability, not selecting for error. Those overlap but they're not the same set. A hybrid approach would be more robust — diff-driven sampling for the unstable rows, plus a small random sample each iteration to catch the consistently-wrong rows that never surface through churn.

The Unc-T. approach is a more principled alternative. Label injection to probe uncertainty catches both cases — it'll flag rows where the model is confidently wrong because the output changes when you inject the wrong label. But it costs roughly three times per example since you're running three inference passes. At Daniel's scale — two hundred rows — that's still trivial. Three times fifty cents on the batch A. is a dollar fifty per full re-run instead of fifty cents. That's nothing.

That connects to one of Daniel's other questions — the cost dynamics. He's right that this changes the optimal sampling strategy compared to classical active learning. In classical active learning, re-scoring the pool was expensive because you had to re-train the model. Here, re-scoring is nearly free. The constraint is human annotation time, not compute. So the optimal strategy shifts toward more frequent, smaller-batch sampling rounds. You can afford to be wasteful with inference if it saves annotator hours.

Daniel mentioned each full re-run on two hundred rows costs about a dollar or two with prompt caching, or fifty cents on the batch A. At that price point, you could run multiple uncertainty estimation strategies in parallel — diff-based, self-consistency entropy, label injection — and compare which one actually selects the most informative rows. The experiment costs maybe five dollars. That's the kind of empirical validation you'd never do in classical active learning where re-scoring meant re-training a model on a G.

Let me pull on the thread of where this breaks. Daniel's gut says around five hundred rows, and I think he's in the right ballpark, but the actual transition point isn't about row count — it's about prompt capacity. A prompt can hold maybe ten to twenty few-shot exemplars before you hit context limits. Once you have more gold annotations than you can fit in a prompt, the marginal value of additional few-shots drops to zero. You're not using all your gold data anymore — you're selecting which exemplars to include.

That changes the nature of what's converging. Is the loop actually converging the prompt, or is it converging the selection strategy for which gold rows to include as few-shots? Those are different things. At two hundred rows, Daniel's already well beyond what fits in a single prompt. So every iteration, some selection process — whether it's manual curation or automated — is picking which annotated examples go into the few-shot configuration. The loop might be optimizing that selection rather than the prompt wording itself.

That's a really important distinction. If what's converging is the exemplar selection, then the artifact isn't really a "converged prompt" — it's a converged exemplar selection heuristic plus a prompt template. And the prompt template might be relatively stable after just a few iterations, while the exemplar selection keeps evolving.

Which leads to the question of few-shot leakage. Daniel's instinct to carve out a held-out eval slice is correct, and I want to be explicit about why. He's currently using gold rows as few-shot exemplars and also evaluating on the silver predictions for non-gold rows within the same working slice. That's a form of data contamination. The few-shot exemplars are selected from the same distribution you're evaluating on, and worse, they're selected specifically because they improved performance on that distribution during the iterative refinement.

The standard practice in the in-context learning literature is to select examples from the training set only, tune hyperparameters on a validation set, and report final numbers on a held-out test set. If you're iteratively refining based on performance on your working slice, you're effectively doing hyperparameter optimization on your test set. The convergence looks good because you're overfitting to the slice.

The cleanest approach: freeze the prompt after convergence, then evaluate on a held-out set that was never used as few-shot exemplars and never influenced the exemplar selection process. That gives you an honest accuracy number. The in-distribution agreement on the working slice is still a meaningful signal — it tells you the loop is stabilizing — but it's not an accuracy metric you can report.

Daniel should probably carve out that held-out slice now, before the prompt converges further, precisely because the convergence is a form of fitting to the working data. The later you hold out, the more leakage has already occurred.

Let's talk about the convergence criterion. Less than two percent row churn between iterations — Daniel himself calls this hand-wavy, and he's right. Two percent of two hundred rows is four rows. Two percent of five hundred rows is ten rows. The threshold means different things at different dataset sizes, which is a red flag.

There are principled alternatives in the active learning literature. The simplest one for Daniel's setting would be a statistical test on prediction stability between iterations. Specifically, McNemar's test — you're comparing paired categorical predictions from two iterations on the same rows. You stop when you cannot reject the null hypothesis that the two iterations' predictions come from the same distribution, at say p greater than zero point zero five.

That gives you a dataset-size-invariant stopping rule with a clear statistical interpretation. If you have two hundred rows and four of them changed predictions, McNemar's test might say that's not statistically distinguishable from noise. If you have two thousand rows and forty changed, same proportion but the test might say that is significant. The threshold adapts to your sample size automatically.

There's also the S. procedure — a conservative set of stopping heuristics for active learning, published in twenty twenty-four — and older work from Tomanek and Hahn in two thousand eight on stopping when the gradient of prediction uncertainty approaches zero. But McNemar's test is the most directly applicable and the easiest to implement. You're already computing the diff between iterations. Running a statistical test on that diff is maybe three lines of code.

I want to spend some time on the question of publishing a prompt as an artifact, because this is tricky and the field is still figuring it out. Daniel's artifact isn't a model checkpoint. It's a prompt template plus a few-shot configuration plus inference parameters. How do you version that? How do you let other researchers reproduce it without re-paying the inference bill?

The reproducibility literature from twenty twenty-four and twenty twenty-five emphasizes that model drift in proprietary A. s means you cannot truly freeze the model. Even if you specify the exact model version — say, Claude Sonnet four point six — the A. provider may update the model behind the same endpoint. Your prompt that worked in April twenty twenty-six might produce different outputs in July twenty twenty-six.

The emerging practice is to treat the artifact as a procedure rather than a static object. You release the final prompt template, the few-shot exemplar I. s or the exemplars themselves, the model version and inference parameters like temperature and max tokens, the full pipeline code, and ideally a hash of the prompt plus exemplar configuration. Some venues are starting to accept prompt cards analogous to model cards.

The key insight: you're not claiming the output is perfectly reproducible. You're claiming the procedure is fully specified, and anyone with A. access and the same model version can run it and get approximately the same results. The hash serves as a version identifier — if you update the prompt or swap exemplars, the hash changes.

On the cost question for reproducibility — other researchers will have to re-pay the inference bill. There's no way around that with a proprietary A. Unless you release the silver predictions themselves as part of the dataset, which Daniel might want to consider. If the prompt is converged and the outputs are stable, publishing the silver predictions alongside the gold annotations lets people evaluate the approach without running inference. They only need to re-run if they want to modify the pipeline.

Which actually circles back to the earlier name discussion. If Daniel publishes this, I think he should frame it as "A. with prediction-change sampling" rather than inventing a new name. framework is already in the literature. The novel contribution is the sampling strategy — using inter-iteration diffs as an uncertainty proxy rather than self-consistency entropy. That's a specific, citable methodological choice, not a whole new paradigm.

"Iterative few-shot refinement with diff-driven sampling" is descriptive but it obscures the connection to active learning literature. Calling it A. with prediction-change sampling immediately tells a researcher what family of methods this belongs to and what the specific innovation is. It also makes it searchable — someone looking for active prompt engineering will find it.

Let me push on something Daniel might be getting wrong. He describes this as a "training methodology" and says the artifact is a "trained" in-context-learning prompt. But no parameters are being updated. The model weights are frozen. What's happening is iterative prompt engineering with active example selection. Calling it training implies a kind of optimization that isn't actually occurring.

That's a fair push, though I'd defend the framing a bit. The loop does produce measurable improvement in agreement with the gold annotations across iterations. The prompt-plus-exemplar-configuration is being optimized, just not through gradient descent. It's optimization through iterative refinement with human feedback — which is a kind of training in the broadest sense, even if it's not machine learning training in the technical sense.

But the distinction matters for how you evaluate it. If you call it training, people expect train-validation-test splits, learning curves, generalization metrics. If you call it prompt engineering with active sampling, the evaluation framework is different — you're measuring prompt quality and exemplar selection quality, not model performance.

Which brings us to the held-out evaluation question again. If this is prompt engineering, the held-out set evaluates the prompt's generalization. If this is training, the held-out set evaluates the trained system's generalization. The practical steps are the same — carve out a test set, freeze everything, evaluate once — but the conceptual framing changes what you're claiming.

I want to flag one more thing about the diff-as-uncertainty approach that I haven't seen discussed much. The diff captures prediction changes between iterations, but those changes could come from two sources. One is genuine model uncertainty — the prompt refinement revealed ambiguity in how to handle that row. The other is prompt perturbation — you changed the few-shot exemplars or the prompt wording, and the model's behavior shifted even on rows where it was previously confident for good reason.

Disentangling those is hard. A row that flips predictions because you added a new exemplar that changed the model's understanding of the task boundary is informative. A row that flips because you reworded the prompt in a way that shifted behavior on edge cases that were previously handled fine might just be noise. The diff doesn't tell you which is which.

The Unc-T. approach partially addresses this because it probes uncertainty within a single prompt configuration rather than across configurations. You're measuring the model's consistency under label perturbation, not its sensitivity to prompt changes. That's a cleaner uncertainty signal.

Alright, let me try to synthesize what a cleaner formalization of this loop might look like, incorporating everything we've discussed.

Go for it.

Step one: initial annotation. Annotate maybe ten to fifteen gold rows by hand — enough to seed the prompt with diverse exemplars. Step two: carve out a held-out test set immediately, before any iterative refinement. Twenty percent of the working slice, say forty rows, set aside and never used as few-shots or for sampling decisions. Step three: the iterative loop. Each iteration, you select few-shot exemplars from the current gold set — using diversity-based selection, not just the most recent annotations — and run inference on the remaining non-gold, non-held-out rows. Step four: compute multiple uncertainty signals. The inter-iteration diff is one. You could also run self-consistency checks or label-injection probes on a subset. Step five: select rows for annotation using a hybrid strategy — high-churn rows from the diff, plus a small random sample to catch consistently-wrong cases, plus maybe the highest-uncertainty rows from your other uncertainty signals. Step six: annotate and add to the gold set. Step seven: check convergence using McNemar's test between the current and previous iteration's predictions. Stop when you can't reject the null at p greater than zero point zero five. Step eight: freeze the prompt and exemplar selection strategy, then evaluate on the held-out test set exactly once.

That's a much cleaner loop. The key improvements over Daniel's current approach: held-out set from the start, statistical convergence criterion instead of an arbitrary percentage, hybrid sampling to catch consistently-wrong rows, and explicit separation of exemplar selection from prompt refinement.

I'd add: document which exemplars were used in each iteration. That lets you analyze whether the loop is actually converging the prompt or just converging the exemplar selection. If the prompt wording stabilizes after three iterations but the exemplar set keeps changing for ten more iterations, you've learned something important about where the uncertainty actually lives.

The cost of all this is still trivial. Adding label-injection probes triples the per-row inference cost for the probed subset, but if you're only probing the high-churn candidates — say twenty rows per iteration — that's sixty inference calls instead of twenty. At batch A. pricing, we're talking about pocket change.

One thing we haven't touched on is whether the three-field decomposition itself — prompts, context, host notes — is the right ontology for the task. But that's probably a separate discussion. Daniel's loop is agnostic to the specific decomposition; it would work for any structured extraction task.

The methodology generalizes, yeah. Any task where you have raw text going in and structured fields coming out, and where you can define agreement metrics between iterations. The diff-driven sampling works as long as "prediction changed" is a meaningful signal, which it is for most structured output tasks.

I think the most practically useful thing Daniel could do right now is run the comparison experiment. Take his current working slice, run one iteration with his diff-driven sampling, run one iteration with self-consistency entropy sampling, run one iteration with label-injection uncertainty, and compare which approach selects rows that, once annotated, produce the largest improvement in the next iteration's agreement with gold. That's maybe ten dollars of inference and an afternoon of annotation.

Publish that comparison. The field needs more empirical work on which uncertainty signals actually work for active prompt engineering. The Margatina paper cast doubt on uncertainty sampling; the Unc-T. paper revived it with better metrics. Daniel's diff-based approach is a new point in that design space, and a head-to-head comparison would be useful.

Alright, let's land the plane on the seven questions. One: yes, this has a name — it's a variant of Active Prompt Engineering with a novel prediction-change sampling strategy. Two: it breaks when your gold set exceeds prompt context capacity, which forces exemplar selection, which changes what's actually converging. The transition is smooth, not a hard threshold, but Daniel's five hundred row gut feel is reasonable as a rough upper bound for when fine-tuning starts looking appealing. Three: yes, carve out a held-out set now. The in-distribution agreement signal is meaningful for monitoring convergence but not for reporting accuracy. Four: the diff is defensible as an uncertainty proxy but has a blind spot for consistently-wrong rows. Hybrid sampling or label-injection probes fix that. Five: replace the two percent threshold with McNemar's test. Six: publish it as a procedure — prompt template, exemplar set, model version, inference parameters, pipeline code, and a configuration hash. Seven: cheap re-scoring means you should sample more aggressively and run more iterations with smaller annotation batches. The constraint is human time, not compute.

That's a solid synthesis. The one thing I'd add is that Daniel should look at the A. paper's self-consistency-based sampling — running the same prompt multiple times at different temperatures — as a complementary uncertainty signal. It's more expensive than the diff but cheaper than full label injection, and it's the standard approach in the A. Running all three in parallel on a small subset would give him empirical data on which signal actually predicts annotation value for his specific task.

Now: Hilbert's daily fun fact.

The collective noun for a group of porcupines is a prickle.

If you're listening and working on something similar — active prompt engineering, few-shot optimization, structured extraction from transcripts — the actionable takeaways are pretty clear. Carve out your test set immediately. Use a statistical convergence criterion. Hybridize your sampling strategy. And frame your work in terms of the existing A. literature rather than inventing new terminology. The field moves faster when we build on each other's naming conventions.

Also, run the cheap comparison experiments. When re-scoring your entire pool costs fifty cents, you can afford to be empirical about which uncertainty signal actually works best for your specific task. That's a luxury classical active learning researchers never had, and it'd be a shame not to use it.

One open question I'm left with: as prompt context windows keep growing — we're seeing models with hundred-thousand-token contexts now — does the transition point where you need fine-tuning keep moving outward? If you can fit fifty few-shot exemplars in a prompt instead of fifteen, does that push the viable range of this approach to a thousand rows? There's probably a scaling law hiding in there.

The limiting factor becomes the model's ability to attend effectively across that many exemplars, not the raw context capacity. There's some evidence that in-context learning performance degrades when you pack too many exemplars into the prompt, even if they technically fit. But that's an empirical question for another episode.

Thanks to our producer Hilbert Flumingtop. This has been My Weird Prompts. Find us at myweirdprompts dot com or wherever you get your podcasts.

If you enjoyed this, leave us a review — it helps other people find the show.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2494: Active Prompt Engineering: Daniel's Diff-Based Loop

When Your Prompt Converges: Active Learning Meets In-Context Prompt Engineering

The Loop

Prior Art: Active Prompt Engineering

The Blind Spot: Consistently Wrong Predictions

Convergence and Stopping Criteria

What's Actually Converging?

Few-Shot Leakage and Evaluation

The Economics of Active Learning

Downloads

You Might Also Like

#2494: Active Prompt Engineering: Daniel's Diff-Based Loop