#2693: Format Adherence in AI: Beyond the Benchmarks

Why your AI ignores formatting instructions and how to fix it with pipeline architecture, not model swaps.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2854
Published: May 7
Duration: 37:50
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: prompt-engineering fine-tuning ai-reasoning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A developer running an automated situational report pipeline noticed a frustrating pattern: the AI model produced solid content but consistently ignored precise formatting instructions for bullet points and indentation. This exposes a capability gap that standard AI benchmarks don't measure—format constraint adherence.

Models struggle most with surface-level formatting rules rather than semantic instructions. Each additional formatting constraint compounds the probability that at least one gets dropped. Few-shot examples help when they demonstrate generalizable principles, but fail when formatting is highly specific because the model pattern-matches surface features that break with new content.

Three production approaches exist for solving this. Post-processing with deterministic parsers works when the model gets structure approximately right. Constrained decoding enforces formatting at the token level but requires specific provider support or self-hosting. The recommended approach for most pipelines is a multi-pass "writer-editor" pattern: a first pass generates content with loose formatting instructions, then a second pass handles only the mechanical reformatting according to the style guide.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2693: Format Adherence in AI: Beyond the Benchmarks

Daniel sent us this one — and he almost apologizes for asking it, because on the surface it sounds like the most basic question in AI. Which is the best model? But the way he frames it, there's actually a gap here that nobody talks about much. He's got this automated daily situational report pipeline running at CitRepISR.com, twice a day, pulling in news about Israel's security challenges, synthesizing it into a classic CitRep format. The agentic architecture is solid — retrieval, internal synthesis, then three separate agents producing website post, podcast script, and email newsletter from the same base. Works nicely, costs almost nothing thanks to DeepSeek. The problem is the writing agent for the website specifically. He gave it very precise formatting instructions with few-shot examples — use bullet points with indented sub-bullets within each subheading, here's exactly how it should look — and the model just ignored it. Information's there, it's readable, but the style guide went out the window. So his question is, when you need extremely faithful adherence to a tightly prescribed writing format, where do you go? Is this a model selection problem, a prompting problem, a parameter problem? And does it actually require a frontier model like Opus or Sonnet, or is that overkill for what's essentially a thousand words in, thousand words out?

This is such a good question because it exposes something that I think gets glossed over in almost all the benchmark discussions. Everyone fixates on reasoning, on math, on coding, on multi-turn conversation. But pure instruction adherence for structured text generation — that's a different capability entirely, and it's not well captured by the standard leaderboards. By the way, fun fact — DeepSeek V four Pro is writing our script today.

DeepSeek, if you're listening, Daniel would like a word about his bullet points.

Alright, let's dig into this. The first thing I want to say is that Daniel's frustration is not a model quality problem in the way most people think about model quality. The information is there, the prose is fine, it's readable. The model is failing at a very specific thing, which is format constraint adherence. And this is actually a known weak point across almost all large language models, including the frontier ones. There was a really interesting paper from researchers at Johns Hopkins and Microsoft last year that looked at exactly this — they called it "instruction hierarchy adherence" — and they found that even the strongest models degrade significantly when you have multiple competing constraints in a prompt. You say "use bullet points, keep it concise, maintain this tone, indent sub-bullets like this example" — each additional constraint increases the probability that at least one of them gets dropped.

It's not that the model is being lazy or the model is dumb. It's that the probability of dropping any given constraint compounds.

And it's worse when the constraints are about surface-level formatting rather than semantic content. Models are much better at "write in the style of a formal military brief" than they are at "use exactly this bullet indentation pattern." The semantic instruction gets processed at a deeper level of the network. The formatting instruction is shallower, and it's more easily overwritten by the model's default generation patterns.

Which makes me think — Daniel mentioned he's using DeepSeek for this, and he wonders whether switching to Opus or Sonnet would fix it. What does the actual data say about format adherence across models?

Here's where it gets interesting. Anthropic has been fairly transparent about their internal evaluations on what they call "instruction following" as a distinct metric. In Claude's system card and their model evaluations, they break this out separately from general capability scores. Claude Opus four scores extremely high on complex instruction following — it's one of the things they specifically optimized for. But — and this is the crucial nuance — the advantage is most pronounced on instructions that require reasoning about the instruction itself. Like "summarize this, but only include points that would be relevant to someone who already read the previous report." That kind of thing. For pure formatting constraints — "use this exact bullet structure" — the gap narrows considerably.

Paying for Opus tokens might not actually solve Daniel's problem.

I think it's unlikely to be a silver bullet. It might reduce the failure rate from, say, thirty percent to fifteen percent, but it's not going to eliminate it. And for a pipeline that runs twice a day unattended, fifteen percent failure rate is still way too high. You'd be spot-checking constantly.

Let me push on something. Daniel mentioned he did few-shot examples — he hand-edited a few paragraphs to show the model exactly what he wanted, then did a backfill. And it worked on the backfill. Then on new data, it ignored the formatting. That pattern is really telling.

And it points to something about how few-shot learning actually works in these models. Few-shot examples are most effective when they demonstrate a pattern that the model can generalize as a semantic rule. "When the topic shifts, start a new section with a bolded header." The model can extract that as a principle. But when the formatting is highly specific — "use a dash then a space for first-level bullets, an indented asterisk for second-level bullets, and always put a line break before sub-bullets" — the model isn't really learning a generalizable rule. It's pattern-matching the surface features of your examples, and when the new content has different surface features — different sentence lengths, different numbers of sub-points, different topic words — the pattern match breaks.

That's a really useful distinction. The model learns the vibe of what you want, not the exact typography.

And this is where I think Daniel's instinct about the architecture being solid is correct, but the solution might not be at the model selection level. It might be at the pipeline level.

Okay, walk me through that. What does a pipeline-level solution look like for format adherence?

There are a few approaches that people are using in production, and they have different trade-offs. The first and simplest is post-processing. If your format is highly structured — bullet points with specific indentation, specific characters for different levels — you can write a deterministic parser that takes the model's output and reformats it to match your style guide exactly. This works when the model gets the structure approximately right but messes up the details. Like if it uses asterisks instead of dashes, or it forgets to indent sub-bullets. A regex or a simple state machine can fix that.

That's the "good enough content, fix the wrapping" approach.

And for Daniel's use case — a CitRep with predictable subheadings and bullet structures — this is actually very feasible. You know the sections are going to be Iran Diplomacy, China, whatever the topics are for that cycle. You can parse on those headers, then within each section, detect what looks like bullets and normalize them.

That assumes the model produces something parseable. What if it just dumps everything into a paragraph despite the instructions?

That's the failure mode where post-processing can't save you. If the model ignores the bullet instruction entirely and gives you a wall of text, no amount of regex is going to restructure that. For that case, you need either a second pass or a different approach entirely. And this is where the second strategy comes in — constrained decoding.

Alright, explain constrained decoding for people who haven't used it.

Constrained decoding is when you restrict the model's token generation to only tokens that match a predefined grammar or schema. Instead of the model freely generating whatever token it wants, you give it a set of rules — "at this point in the output, you can only generate a dash followed by a space, or a newline followed by an indented asterisk." The model still chooses which content to write, but the formatting is enforced at the token level. It literally cannot produce incorrectly formatted output because those tokens aren't in the allowed set.

That sounds like it solves the problem completely. What's the catch?

The catch is that it's not uniformly available across all model providers. OpenAI has structured outputs with JSON schema enforcement, which is great if your output format is JSON. But for freeform text with specific formatting rules, it's trickier. Some inference engines like vLLM and llama.cpp support grammar-based sampling, where you define a formal grammar — often in something like GBNF or EBNF notation — and the sampler enforces it. But this requires you to be running the model yourself, either self-hosted or through a provider that exposes grammar constraints.

Daniel's using DeepSeek through an API, if I'm understanding his pipeline correctly. Does DeepSeek's API offer constrained decoding?

Not in the way that would solve this problem directly. DeepSeek's API has some structured output capabilities, but they're oriented toward JSON mode and function calling, not toward enforcing arbitrary text formatting grammars. For what Daniel needs — "every response must follow this exact bullet formatting pattern" — he'd either need to move to a provider that supports grammar-constrained sampling, or self-host a model where he can control the inference parameters directly.

Which brings us to the third approach, which I suspect is where you're heading — a multi-pass pipeline.

You know me too well. And this is actually the approach I'd recommend for Daniel's specific setup because it doesn't require switching providers or self-hosting. Here's the idea. You keep the first agent that produces the internal CitRep — that's working fine. Then instead of having one agent that writes the website post and is responsible for both content quality and formatting, you split that into two passes. The first pass generates the content with a looser formatting instruction — "organize this into sections with bullet points, don't worry about exact formatting." The second pass takes that output and its only job is to reformat it according to the style guide.

You're separating the semantic task from the typographic task.

And this works because the second agent has a much simpler job. It doesn't need to think about what information to include or how to phrase things. It just needs to take existing text and apply formatting rules. That's a much easier instruction to follow, and the failure rate drops dramatically.

Have you seen actual numbers on this? Because it sounds logical, but I'm curious if it holds up in practice.

I don't have a published paper to cite on this exact setup, but I've talked to enough developers building production pipelines to say this is emerging as a best practice. The pattern is sometimes called "writer-editor" or "drafter-polisher." You use a fast, cheap model for the drafting — which is what Daniel's doing with DeepSeek — and then a second pass, potentially with the same model or a different one, that only handles the polishing. And here's the key insight: for the polishing pass, you can use a much more prescriptive prompt because you're not asking the model to be creative or to synthesize information. You're giving it very mechanical instructions.

"Here is text. Reformat it so that every bullet point uses a dash, sub-bullets are indented with two spaces and use an asterisk, and there is exactly one blank line between sections." That's a much easier ask than "read these ten news articles and produce a situational report.

And you can even make the polishing pass idempotent — you can run it multiple times without degrading the output, because it's just applying formatting rules to already-written content.

Let me play devil's advocate for a second. Daniel's whole architecture is built around efficiency — one internal synthesis, then three specialized output agents. Adding a fourth agent per output format means more API calls, more latency, more cost. Is the juice worth the squeeze?

It's a fair question. But let's look at the actual economics. Daniel said the internal CitRep is about a thousand words, and each output is also about a thousand words. With DeepSeek's pricing, we're talking about fractions of a cent per generation. Even if you double the number of API calls for the writing step, you're adding maybe a tenth of a cent per report. Twice a day. We're talking less than a dollar a month in additional cost. The real cost is the engineering time to set up the second pass and the latency — each additional API call adds maybe two to five seconds. For a twice-daily report that's not user-facing in real time, that's trivial.

The cost argument doesn't really hold up. What about the latency? If Daniel's pipeline runs on a schedule, an extra few seconds is meaningless.

And here's another thing worth mentioning. There's a parameter that Daniel hinted at but didn't explicitly name, and it might be part of what's tripping him up.

He mentioned playing around with temperature. What's the relationship between temperature and format adherence?

It's not straightforward, which is why people get confused. Lower temperature makes the model more deterministic — it's more likely to pick the highest-probability token at each step. For factual content, lower temperature generally means fewer hallucinations and more consistency. But for formatting, the effect is more subtle. At very low temperature — like zero or point one — the model can get stuck in repetitive patterns. If it starts generating a paragraph format, it might keep generating paragraphs even when you wanted bullets, because the low temperature makes it harder to "break out" of the pattern. At higher temperatures, the model is more willing to switch formats mid-generation, but it's also more likely to produce inconsistent formatting.

There's a sweet spot.

There is, and it varies by model and by task. For structured generation like this, I've seen people have good results with temperatures around point three to point five. High enough to allow format transitions, low enough to maintain consistency once the format is established. But temperature alone won't solve the problem Daniel's describing.

Let's circle back to the model selection question, because I think that's what Daniel was really asking, and we've talked around it a bit. If he's going to try a different model for the writing agent, where should he look?

Let me give a more direct answer. For pure format adherence in structured text generation, based on what I've seen in benchmarks and from developer reports, Claude Sonnet is probably the strongest option that doesn't require self-hosting. It scores very well on instruction following, and Anthropic has put specific effort into making it reliable for structured outputs. Opus is even better but probably overkill for a thousand-word reformatting task. GPT-4o is comparable to Sonnet on this specific capability — it's very strong on following explicit formatting instructions, though I'd give a slight edge to Sonnet based on the most recent evaluations I've seen.

Where does DeepSeek fall?

DeepSeek is extremely capable for its cost, but format adherence is one of the areas where the gap between DeepSeek and the frontier models is most noticeable. DeepSeek is great at reasoning, at code generation, at multilingual tasks. But when you give it very precise formatting instructions with multiple nested constraints, it's less reliable than Claude or GPT-4o. That's not a knock on DeepSeek — it's a reflection of where the training priorities were. You optimize for what you measure, and format adherence to arbitrary style guides is not a heavily benchmarked capability.

The straightforward answer is: try Sonnet for the writing agent, see if the adherence improves. But the smarter answer might be: keep DeepSeek for the drafting because it's cheap and good enough, then add a polishing pass with whatever model you want, and that second pass will be much more reliable regardless of which model you use.

That's exactly where I land. And I'd add one more thing. Daniel mentioned he did few-shot examples by hand-editing paragraphs. That's good practice, but there's a technique that I think works better for format adherence specifically. Instead of showing the model "here's a correctly formatted example," show it the same content in both wrong and right formats, and explicitly say "this is wrong, this is right." Negative examples are incredibly powerful for format constraints because they teach the model what not to do, which is often more informative than only showing what to do.

You'd pair the original wall-of-text paragraph with the hand-edited bullet-point version and say "transform from this to this.

And you can do this programmatically. Take a few of your existing reports, deliberately mess up the formatting in specific ways — remove the indentation, change the bullet characters, merge sub-bullets into the parent bullet — and then pair each messed-up version with the correct version. That gives the model a much richer signal about what the formatting rules actually are.

It's basically contrastive learning at the prompt level.

And this technique works across models — it's not specific to any one provider. I've seen it improve format adherence substantially even on weaker models.

Let me zoom out for a second, because I think there's a bigger question lurking behind Daniel's prompt. He's built this really elegant agentic pipeline. The retrieval works, the internal synthesis works, the multi-format output architecture is clever. But he's hitting a wall on something that seems like it should be the easiest part — "just make it look right." And I think that says something important about where we are with AI in mid two thousand twenty-six. We've gotten really good at the hard stuff — reasoning, synthesis, multi-step planning. But the "easy" stuff — following a style guide, remembering to use bullet points — is still surprisingly brittle.

This is such an important point. There's a phenomenon in AI development that I've started calling the "inverse difficulty curve." The things that are hard for humans — analyzing complex geopolitical situations, writing coherent prose, translating between languages — those are increasingly well-handled by language models. But the things that are trivial for humans — "put a blank line between sections," "use consistent punctuation in your bullet points," "don't change the formatting halfway through" — those remain stubbornly unreliable.

Why is that? I have a theory, but I want to hear yours.

I think it's because language models are fundamentally semantic engines. They model meaning, not presentation. When they're trained, the loss function cares about predicting the next token correctly in terms of content. Whether that token is a dash or an asterisk for a bullet point — that's a very low-weight signal in the training objective compared to whether the token makes semantic sense. The model learns that dashes and asterisks are roughly interchangeable for bullet points, because in its training data, they are. Different sources use different formatting conventions, and the model absorbs all of them without strongly privileging any one.

The model's internal representation of "bullet point" is a fuzzy cluster of formatting options, not a crisp rule.

And when you give it a style guide, you're essentially asking it to override that fuzzy internal representation with a crisp external rule. That's a hard thing for a probabilistic system to do consistently, especially when the crisp rule conflicts with patterns that are deeply embedded in the training distribution.

This connects to something I've been thinking about. When we talk about "instruction following" as a benchmark metric, it's usually measured on things like "write a poem in the style of Shakespeare" or "summarize this article in three sentences." Those are semantic instructions. Format adherence is almost a different capability entirely — it's more like programming than writing. You're specifying exact output patterns, and the model needs to execute them precisely.

This is why I think the ultimate solution for Daniel's use case — and for a lot of production pipelines — is going to be a hybrid approach. Use language models for what they're good at, which is understanding and generating semantic content. Use deterministic systems for what they're good at, which is enforcing rules precisely. The post-processing approach I mentioned earlier, or constrained decoding, or even template-based rendering with model-generated content slotted in — these hybrid approaches are more robust than asking a single model to do everything perfectly.

Let's talk about templates for a second, because you just opened that door. Daniel's CitRep has a predictable structure — subheadings, bullet points under each subheading. Could he just use a template and have the model generate the content for each section?

And this is actually the most reliable approach if your output format is truly fixed. You define the structure in code — the headers, the bullet formatting, the section ordering — and then you prompt the model to generate only the content that goes into each slot. "For the Iran Diplomacy section, write three to five bullet points summarizing today's developments. Output only the bullet text, not the bullets themselves." Then your code wraps that text in the correct formatting.

That's almost cheating. You're reducing the model's job to pure content generation with zero formatting responsibility.

It's not cheating, it's good engineering. Use the right tool for each part of the job. The model handles semantic synthesis, your code handles presentation. And the beauty of this approach is that the formatting becomes literally impossible to get wrong — it's not generated by a probabilistic system, it's generated by deterministic code.

The trade-off is flexibility. If Daniel ever wants to change the format — add a new section type, change the bullet style, whatever — he has to change code rather than just updating a prompt.

True, but how often does a CitRep format change? It's a standardized military intelligence format. The whole point is that it's consistent and predictable. The flexibility cost is minimal for this use case.

Alright, let me try to synthesize what we've covered, because we've gone in a few directions and I want to make sure we're giving Daniel something actionable. The problem: a writing agent in his pipeline is ignoring precise formatting instructions, even with few-shot examples. The diagnosis: format adherence is a distinct capability from semantic instruction following, and it's a weak point across most models, including DeepSeek. The solutions, in order of increasing reliability: one, try a model with better instruction following like Claude Sonnet or GPT-4o for the writing step. Two, add a second polishing pass whose only job is format enforcement. Three, use negative examples in your few-shot prompts to make the formatting rules more salient. Four, use constrained decoding if you can access it through your inference provider. And five, the nuclear option — use templates and have the model generate only content, with all formatting handled deterministically in code.

That's a great summary. And I'd add a sixth option that's somewhere between four and five in terms of invasiveness. You can use structured outputs — JSON mode — to force the model to output content in a parseable structure. Instead of asking for formatted text directly, you ask for a JSON object with fields for each section, and arrays for the bullet points. Then your code renders that JSON into the final formatted output. This gives you the reliability of deterministic formatting with more flexibility than hardcoded templates.

That's clever. And most API providers support JSON mode now, including DeepSeek.

The prompt becomes something like "output a JSON object with the following schema" rather than "format your output with these bullet conventions." The model is much better at following a JSON schema than at following typographic formatting rules, because JSON structure is a semantic constraint that's well represented in the training data.

JSON mode often comes with guaranteed valid JSON, which means you're not going to get malformed output that breaks your parser.

Though I should note that guaranteed valid JSON doesn't mean guaranteed correct schema — the model can still put the wrong fields in the wrong places. But the failure modes are much more predictable and easier to catch programmatically.

Let me ask you something that's been in the back of my mind. Daniel's pipeline uses DeepSeek for cost reasons — he mentioned it costs very little. If he switches to Sonnet or Opus for the writing agent, what's the actual cost difference for his volume?

Let's do the math. A thousand words in, a thousand words out. That's roughly fifteen hundred tokens in, two thousand tokens out with overhead. DeepSeek's API pricing is something like fourteen cents per million input tokens and twenty-eight cents per million output tokens. So per generation, we're talking about point zero two cents for input and point zero six cents for output — less than a tenth of a cent total. Claude Sonnet is more like three dollars per million input, fifteen dollars per million output. That same generation would be about half a cent for input and three cents for output — call it three and a half cents total.

It's roughly fifty times more expensive.

In percentage terms, yes. In absolute terms, we're talking about the difference between running the pipeline for a dollar a month versus fifty dollars a month. For a personal project, fifty dollars a month might be noticeable. For a production system, it's noise.

If he goes with the two-pass approach — DeepSeek for drafting, Sonnet for formatting — the cost is even lower because the formatting pass has less content to process.

The formatting pass is essentially inputting a thousand words and outputting a thousand words with better formatting. At Sonnet prices, that's maybe four cents per report. Twice a day, that's about two dollars and forty cents a month.

Cost shouldn't be the deciding factor here. It's about reliability engineering.

And I think that's the broader lesson from Daniel's question. When you're building an agentic pipeline that runs unattended, reliability is the metric that matters most. A model that produces brilliant output ninety percent of the time and garbage ten percent of the time is worse than a model that produces adequate output a hundred percent of the time. For a CitRep that goes out twice a day, you need it to work every single time without human intervention.

That's the unsexy reality of production AI. Everyone wants to talk about model capabilities and benchmark scores. Nobody wants to talk about bullet point formatting. But bullet point formatting is what determines whether your pipeline actually works or whether you're spot-checking every output by hand.

This is where I think the field is going to mature a lot in the next couple of years. We're going to see better tooling for constrained generation, better evaluation frameworks for format adherence specifically, and probably model-level improvements as providers realize that reliability on "boring" tasks is a key differentiator for enterprise adoption.

One more thing before we move on. Daniel mentioned that the backfill worked — when he applied the new instructions to existing reports, the formatting was correct. But on new data, it failed. That's a really specific failure pattern. What's going on there?

I think that's a distribution shift problem. When Daniel did the backfill, the model was processing reports it had already generated. The content was in-distribution for the formatting examples he provided because he selected examples from that same set of reports. When new data came in with different topics, different sentence structures, different numbers of points per section, the model encountered content that was outside the distribution of its few-shot examples, and the formatting pattern didn't generalize.

The few-shot examples were overfit to the specific reports he used.

And this is a known limitation of few-shot prompting. The examples work best when they're representative of the full diversity of inputs the model will see. If Daniel's reports vary significantly from day to day — which they probably do, because the news cycle isn't uniform — a handful of examples might not cover the range of variation.

Which argues for either more examples covering more cases, or for moving the formatting logic out of the model entirely.

Or for the two-pass approach where the formatting pass has a much simpler job and is less sensitive to content variation.

I think we've given Daniel a pretty comprehensive answer. Let me see if I can boil it down to a recommendation. Daniel, if I were you, I'd try the JSON mode approach first, because it's the smallest change to your existing pipeline. Have your writing agent output structured JSON instead of formatted text, then render that JSON into your CitRep format with a simple script. If that doesn't give you the quality you want, try the two-pass approach with DeepSeek for drafting and Sonnet for formatting. And if you're still seeing failures, go nuclear with templates — define the structure in code and only use the model for content generation within each section.

That's a solid recommendation. And I'd add — whatever approach you choose, build a quick evaluation harness. Take ten past reports, run them through the new pipeline, and check the formatting automatically. You can write a simple script that verifies things like "every bullet starts with a dash," "sub-bullets are indented," "there's exactly one blank line between sections." If the eval passes on all ten, you're good. If it doesn't, you know exactly where to look.

Automated testing for AI pipelines. What a world.

It's not glamorous, but it's what separates the pipelines that work from the ones that are constantly broken.

Now: Hilbert's daily fun fact.

Hilbert: In the eighteen eighties, a naturalist on the Isle of Lewis in the Outer Hebrides documented a species of slime mould, Physarum polycephalum, that had formed a stable cohabitation with a local species of liverwort. The slime mould would engulf the liverwort's rhizoids each night and retreat at dawn, effectively acting as a protective sheath against desiccation during the island's frequent windstorms. The liverwort survived drought conditions significantly better when the slime mould was present, making it one of the earliest documented cases of a mutualistic relationship between a protist and a bryophyte.

I have no idea what to do with that information.

Slime moulds, man. They're everywhere.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop, and to Daniel for a question that's way more interesting than "which model is best." If you enjoyed this episode, leave us a review wherever you get your podcasts — it genuinely helps. We'll be back next time with more weird prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2693: Format Adherence in AI: Beyond the Benchmarks

Downloads

You Might Also Like

#2693: Format Adherence in AI: Beyond the Benchmarks