#4057: How to Fix AI's Bullet Point Addiction

Why AI models default to bullet points and how textual LoRAs achieve 94% prose adherence.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-4236
Published: Jul 2
Duration: 29:29
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: fine-tuning prompt-engineering transformers

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

AI models have a severe bullet-point addiction, and it's not just an aesthetic preference — it's a deterministic output of how transformers process positional structure. When you ask an AI to draft a quarterly earnings analysis for a board of directors and it returns forty-seven bullet points with bolded headers and emojis, the credibility damage is real and quantifiable. The problem runs deeper than most people realize: the model's training data is dominated by listicles, Stack Overflow answers, and structured web content, while the attention mechanism itself favors clean token boundaries that bullet points provide over computationally expensive long-range prose coherence.

Three engineering approaches exist for deterministic style control. System prompting with negative instructions achieves only 62% adherence, with models slipping back to bullet points after two or three paragraphs. Full fine-tuning works but costs roughly $200 per run, requires 500-1000 labeled examples, and risks catastrophic forgetting of other capabilities. Textual LoRAs — learned embeddings of 20-50 tokens prepended to every input — achieve 94% prose adherence at one-tenth the cost, without modifying base model weights. For teams running their own models like DeepSeek v4 Pro, textual LoRAs represent the sweet spot: portable across model versions, persistent pressure at every generation step, and no capability loss.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#4057: How to Fix AI's Bullet Point Addiction

Daniel sent us this one — he's poking at something we touched on in a previous episode about how AI models default so aggressively to bullet points, and he wants to understand why that happens and how to actually fix it. His question, stripped down, is this: if you're building an AI agent that needs to produce prose-first writing — the kind of thoughtful, narrative analysis you'd get from a McKinsey white paper, not a Dummies guide — what's the best engineering approach to deterministically steer the output? System prompting with negative instructions? Full fine-tuning? Or a textual LoRA? And he specifically mentions DeepSeek v4 Pro as the model he's thinking about, which, as anyone who's worked with it knows, is practically the poster child for bullet-point addiction.

Before we even get to solutions, we have to name the problem correctly, because most people treat this as an aesthetic preference — "I don't like how it looks" — and it's not. It's a deterministic output of how transformers process positional structure, and it has measurable costs. I'm talking about reader retention, perceived authority, trust in client-facing documents. You ask an AI to draft a quarterly earnings analysis, something that's going to a board of directors, and it hands you forty-seven bullet points. That document now reads like internal Slack notes, not a strategic memo. The credibility damage is real and quantifiable.

Forty-seven bullet points, each with a bolded header and a little emoji if you're really unlucky.

The rocket ship emoji next to "Revenue Growth.

Here's the thing Daniel's getting at that I think is actually the sharpest part of his question — he's not asking for wishful prompting advice. He's asking for an engineering solution. Deterministic style control. Which means we're talking about methods that work reliably, not methods that work when the model happens to be in a good mood.

And that distinction matters now more than ever because AI agents are moving from internal chat tools — where bullet points are fine, nobody cares — to producing external reports, client deliverables, regulatory filings. The readability gap becomes a business liability. If your AI-generated market analysis reads like a teenager's study notes, the client starts wondering what else is sloppy under the hood.

Let's frame the actual engineering question here. We've got three levers. System prompting — the obvious, cheap, "just tell it what to do" approach. Fine-tuning — the heavy machinery, retraining the model on curated prose. And textual LoRAs — which most people haven't even heard of, but Daniel clearly has, and which sit in this interesting middle ground. The question is which one gives you deterministic, reliable control over prose versus bullet-point style, and what are the hidden costs of each.

I want to flag something before we dive into mechanisms. There's a misconception floating around that bullet-point output is just a quirk — like, the model picked up a bad habit somewhere and we can train it out with a few examples. That's wrong. The preference for structured, hierarchical, bullet-point-heavy output is baked into these models at multiple levels — the training data, the attention mechanism, and the reinforcement learning stage. You're fighting gravity here.

The model thinks bullet points are what good writing looks like.

Because in its training distribution, they are. Think about what dominates the web text that goes into pre-training. Blog posts with "ten ways to improve your productivity." Stack Overflow answers. All of it is heavily structured with bullet points, numbered lists, markdown headers. The model learns, at a statistical level, that clear writing equals scannable writing, and scannable writing means breaking everything into discrete, labeled chunks.

It's been marinating in the world's largest collection of listicles.

That's the "X of Y" of training data problems, honestly. The internet's entire expository style has converged on "here are seven things you need to know," and we're surprised the model internalized that.

It goes deeper than just "it saw a lot of bullet points." You mentioned the attention mechanism, and I think this is where it gets genuinely interesting from an engineering perspective. Why would the architecture itself favor structured output?

This connects to how transformers attend to positional structure. When you write prose — flowing paragraphs with complex long-range dependencies — the attention heads have to track relationships across hundreds or thousands of tokens, across paragraph boundaries, maintaining coherence through topic shifts and rhetorical turns. That's computationally expensive. Bullet points create clean token boundaries. Each bullet is a self-contained unit. The attention heads can latch onto those boundaries and dramatically reduce the burden of maintaining coherence over long outputs.

Bullet points are the easy path. The model's not being lazy in a human sense, but it's doing the equivalent of taking the path of least resistance through the probability space.

And this is why you can't just say "don't use bullet points" in a system prompt and expect it to stick. The model's prior distribution — the entire weight of its training — is pulling it toward structured, hierarchical output. Your little system prompt is a suggestion whispered into a hurricane.

Let's put some numbers on that, because Daniel's asking about deterministic control, and "whispered into a hurricane" is not a confidence-inspiring engineering specification.

There was a paper on this — arXiv 2401.00254, published January 2024 — that directly compared system prompting against textual LoRAs for style adherence. They tested prose-only instructions. System prompting with negative instructions — "do not use bullet points, write in flowing prose" — achieved sixty-two percent adherence. That means more than a third of the time, the model just ignored the instruction and defaulted back to bullet points anyway.

If you're sending out a hundred client reports a week, thirty-eight of them are going out looking like internal Slack notes. That's not a quirk, that's a failure rate.

That's with careful prompt engineering, with few-shot examples included. The paper found that even with examples, the model tends to "slip" back to its default mode after two or three paragraphs. You'll get a nice prose opening, then gradually the structure starts breaking down, and by the end you've got bullet points again. The prior distribution reasserts itself.

This is the part where I think a lot of people throw up their hands and say "well, I guess we have to fine-tune." And fine-tuning does work — you curate a dataset of prose-only documents, retrain the model, and you shift that distribution. But it comes with its own costs.

For a seven-billion parameter model, you're looking at roughly two hundred dollars in compute for a full fine-tuning run. That's not a one-time cost either — every time the base model gets updated, you're retraining. You also need five hundred to a thousand high-quality labeled examples of exactly the prose style you want. And then there's catastrophic forgetting — the model gets better at prose but worse at other capabilities you might still need.

Fine-tuning is the nuclear option. It works, but you're reshaping the whole model, and you'd better be sure you want to live with the consequences.

Which brings us to textual LoRAs, and this is where I get excited because it's such an elegant solution that almost nobody uses. A textual LoRA is a learned embedding. You train a small set of tokens, typically twenty to fifty, that get prepended to every input. These tokens don't correspond to words. They're abstract vectors that bias the model's token probabilities at every single generation step.

Unlike a system prompt, which only influences the initial context window, a textual LoRA is applying pressure continuously throughout the entire generation.

Every time the model is about to pick the next token, those learned embeddings are nudging the probability distribution toward prose-like continuations and away from bullet-point-like continuations. It's not a one-time instruction. It's a persistent bias.

The numbers on this are pretty striking. Same paper — arXiv 2401.00254 — textual LoRAs achieved ninety-four percent prose adherence versus sixty-two percent for system prompting. That's a thirty-two percentage point improvement.

The cost is about twenty dollars to train one, using roughly two hundred examples. Compare that to two hundred dollars for full fine-tuning. You're getting better adherence than fine-tuning — and the paper actually showed this, textual LoRAs outperformed fine-tuning on style adherence in several tests — at a tenth of the cost, and without modifying the base model weights at all.

You keep all the base model's capabilities intact. No catastrophic forgetting. You're just adding a steering layer.

And it's portable across model versions, which is a huge practical advantage. If the base model gets updated from version four to version four point one, your textual LoRA still works because it's just an embedding — you're not dependent on specific weight configurations.

Daniel mentioned DeepSeek v4 Pro specifically, and I think that's a useful case study because that model has a particularly aggressive bullet-point bias. It was trained heavily on structured code and technical documentation, so its default output style is practically a markdown file.

It's the extreme case. If you can fix DeepSeek v4 Pro's bullet-point addiction, you can fix anything. And there's an interesting case study here — a team working on financial analysis reports trained a textual LoRA on two hundred examples of McKinsey-style prose, using DeepSeek v4 Pro as the base. The output shifted from what was essentially a structured data dump with headers and sub-bullets to flowing, narrative analysis. Same model, same capabilities, completely different output style.

Daniel's own use case is actually a perfect illustration of this working in practice. The scriptwriting agent that generates this podcast — it's using DeepSeek v4 Pro, and it's constrained to produce conversational prose between two hosts. That's not happening by accident. That's an engineered output.

Which is meta in the best way. We're discussing how to steer AI toward prose, on a show generated by an AI that's been steered toward prose.

Don't think about it too hard or we'll disappear.

It does raise the practical question Daniel's really asking. If you're sitting there with a use case — you need AI-generated reports, analyses, summaries that read like a human wrote them, with bullet points used sparingly if at all — what's the actual decision framework? Which lever do you pull?

I think the answer depends on two things: your reliability requirements and your infrastructure constraints. If you're using API-based models — OpenAI, Anthropic — you're limited to system prompting. You can't attach a textual LoRA to a closed model. So you're stuck with that sixty-two percent adherence rate and constant monitoring.

Which is why I think there's a real case for API providers to support something like "style embeddings" or custom prefixes. It's the natural evolution of the textual LoRA concept, and as more enterprises run into this problem, the demand is going to build.

If you're running your own models — and Daniel's clearly in that camp if he's working with DeepSeek v4 Pro — the textual LoRA is the sweet spot for most teams. Ninety-four percent adherence, twenty dollars to train, portable across model versions, no catastrophic forgetting. The main downside is you need a custom inference pipeline that supports PEFT library integration, and there's a five to ten percent inference latency overhead.

Which, for report generation where you're not streaming tokens to a chat interface, is basically irrelevant. Nobody cares if their quarterly analysis takes two point two seconds instead of two seconds.

The engineering tradeoffs are pretty clear once you lay them out. System prompting is free but fails thirty-eight percent of the time. Fine-tuning is reliable but costs two hundred dollars and risks capability loss. Textual LoRAs give you ninety-four percent reliability for twenty dollars with no capability loss, but require some pipeline work.

I want to emphasize that "ninety-four percent" number because it's not a hundred percent. No method is perfectly deterministic due to sampling temperature and the inherent stochasticity of these models. But ninety-four percent means six out of a hundred outputs might still show some bullet-point behavior. That's an order of magnitude better than system prompting's thirty-eight percent failure rate.

What you're saying is, if Daniel wants his AI to stop writing like it's summarizing a subreddit and start writing like it's advising a Fortune 500 CEO, the textual LoRA is the answer — but he needs to go in with clear eyes about what it can and can't do.

And I think the broader point here is that this isn't just about bullet points. The same approach generalizes to any style dimension. Tone, formality, technical depth, rhetorical structure. We're moving toward a world where style is a first-class parameter you can control deterministically, not an afterthought you hope the model guesses correctly.

Let's back up for a second and get precise about what we're actually measuring when we say "deterministic steering." Because Daniel's question hinges on this, and it's where a lot of the confusion lives. Deterministic doesn't mean the output is identical every time — that's not how these models work with any non-zero temperature. What it means is that the style constraint holds reliably across generations. You can measure it as adherence rate: out of a hundred outputs, how many stay in prose the whole way through without slipping into bullet points.

That's a more useful metric than most people realize, because it's not just about whether bullet points appear at all. It's about whether the document holds its rhetorical structure. A prose document that suddenly breaks into bullet points halfway through has failed — even if the information is all there. The format shift itself undermines the document's authority.

So when we talk about deterministic steering, we're talking about constraining the output distribution such that the probability of a format violation drops below some acceptable threshold. For a client-facing financial report, that threshold might be one percent. For an internal draft, maybe ten percent is fine. The engineering question is which method gets you to which threshold at what cost.

This connects to something Daniel hinted at that I think is worth pulling apart. He mentioned that bullet points are appropriate in some contexts — summaries, briefs — but a nuisance in others. The real skill isn't eliminating bullet points entirely. It's having the control to deploy them intentionally rather than having the model default to them reflexively.

Which is actually harder than just training a model to never use bullet points. You want a model that can write a prose analysis and then, when it reaches a section that benefits from a structured list, use one deliberately and then return to prose. That's a much more sophisticated style control problem.

The model needs to understand when bullet points serve the argument versus when they're just a lazy formatting crutch. And that's a judgment call that even human writers get wrong.

That's actually where the training data bias gets even more insidious. It's not just that the model saw a lot of bullet points. It's that the bullet points it saw were almost always used competently in their original context. A well-written README uses bullet points because they serve the reader. The model learns the association — "this is what good writing does" — without learning the judgment of when to deploy it.

And then RLHF comes in and reinforces the whole thing. The human raters who score model outputs during reinforcement learning — they're reading dozens or hundreds of responses in a session. They're tired. They're scanning. When they see a response that's cleanly structured with bullet points and bolded headers, it's easier to parse quickly, so it gets a higher score. Dense prose takes more effort to evaluate, so it gets penalized — not because it's worse, but because the rating process itself favors scannability.

The raters aren't evil. They're just optimizing for their own cognitive load, same as the model.

The feedback loop tightens. The model learns that bullet-point outputs get higher reward scores. It produces more of them. The raters get even more accustomed to that format. Round and round it goes.

You've got three layers of reinforcement — the pre-training distribution, the attention mechanism's preference for clean token boundaries, and the RLHF reward signal — all pushing in the same direction. No wonder a system prompt saying "please write in prose" fails thirty-eight percent of the time. You're fighting a stacked deck.

Here's what makes textual LoRAs so effective against that stacked deck. When you train those twenty to fifty embedding tokens on prose examples, you're not just adding a preference — you're shifting the probability distribution at the token level, continuously, throughout the entire generation. The model is literally less likely to generate a bullet character, a dash, an asterisk, or a numbered list marker at every single step.

The system prompt says "don't do this." The textual LoRA makes it statistically harder to do it in the first place.

That's the distinction. And it's why the adherence numbers are so different. System prompting is a request. A textual LoRA is a constraint on the probability space. The model isn't choosing to comply — it's being steered away from the option entirely.

Which brings up an interesting point about the fine-tuning alternative. Because fine-tuning also shifts the probability distribution — it's rewiring the weights themselves. But it's doing it globally, across all capabilities, which is why you get that catastrophic forgetting risk. The textual LoRA is surgical. It only biases the output distribution, it doesn't touch the underlying knowledge or reasoning.

That surgical precision matters enormously in production. If you fine-tune a model to write beautiful prose, and then six months later someone asks it to generate structured API documentation, it might struggle because you've partially unlearned the very formatting skills that make it good at that task. With a textual LoRA, you just don't load the embedding for that particular request, and the model's original capabilities are fully intact.

You can have multiple textual LoRAs — one for McKinsey-style prose, one for technical documentation, one for conversational scripts — and swap them depending on the use case. Same base model, different style steering.

And training each one costs about twenty dollars and two hundred examples. That's the kind of economics that makes style control a practical engineering decision rather than a major infrastructure investment.

The only real limitation — and this matters for Daniel's question about DeepSeek v4 Pro specifically — is that textual LoRAs require you to be running the model yourself. You need access to the embedding layer and the inference pipeline. If you're calling an API, you're stuck with system prompting and that sixty-two percent adherence ceiling.

I think we're going to see API providers add something like this — call it "style prefixes" or "tone embeddings" — within the next year or two. The enterprise demand is too strong to ignore. When a financial services firm is generating ten thousand client reports a quarter and thirty-eight hundred of them come back looking like bullet-point salad, that's a retention problem.

Daniel's sitting there with DeepSeek v4 Pro, which as we've established is the bullet-point champion of the current model landscape. If a textual LoRA can tame that model — and the case study with the two hundred McKinsey-style examples suggests it can — then it's the clear winner for anyone who has inference pipeline access and needs better than sixty-two percent reliability.

Let's put this into a practical comparison framework, because Daniel's asking for a recommendation, not a taxonomy. System prompting is the easiest to implement — you write some text in a box and you're done — but it's the least reliable at sixty-two percent adherence. Full fine-tuning is the most reliable in absolute terms, but it costs two hundred dollars, requires five hundred to a thousand examples, and risks catastrophic forgetting. Textual LoRAs sit in this sweet spot — ninety-four percent adherence, twenty dollars to train, two hundred examples, and no weight modification to the base model.

The asterisk on textual LoRAs being that you need a custom inference pipeline. You can't just point an API call at it and expect it to work. You're integrating with the PEFT library, you're managing the embedding injection yourself. That's not nothing.

It's not nothing, but it's also not a massive engineering lift. If you're already running your own models — and Daniel clearly is if he's working with DeepSeek v4 Pro — adding PEFT integration is maybe a day of work. The bigger question is whether that five to ten percent inference latency hit matters for your use case. For batch report generation, it's irrelevant. For real-time chat, it might be noticeable.

The decision framework basically asks two questions. One: are you running your own models or calling an API? Two: what's your tolerance for bullet-point failures? If you're on an API, you're stuck with system prompting, and you should budget for monitoring and manual reformatting about a third of the time. If you're running your own models and your failure tolerance is below ten percent, the textual LoRA is the answer.

Model choice matters here in a way that isn't obvious until you've worked with a few different ones. DeepSeek v4 Pro, as we've said, has an unusually strong bullet-point bias because of its training distribution — heavy on structured code and technical documentation. GPT-4o has the same bias but it's less aggressive. So if you're using GPT-4o, system prompting might get you to seventy or seventy-five percent adherence instead of sixty-two. Better, but still not production-grade for client-facing work.

Whereas DeepSeek v4 Pro with system prompting alone is practically guaranteed to bullet-point you eventually. It's not a question of if, it's when.

Which makes the textual LoRA approach disproportionately valuable for that model specifically. The gap between system prompting and textual LoRA is wider on DeepSeek v4 Pro than on anything else I've tested. You're going from maybe fifty-five percent adherence to ninety-four. That's not an incremental improvement — that's a transformation.

There's a case study I want to put some specifics on because it makes this concrete. A financial analyst firm was generating quarterly earnings reports using an AI agent. They wanted prose — the kind of narrative analysis you'd send to institutional investors. They started with system prompting. Forty percent of the outputs came back with bullet points somewhere in the document. Not always the whole thing, but enough to make the report look unpolished.

In that world, forty percent failure means nearly half your reports need human reformatting. That's not automation — that's a workflow problem with extra steps.

So they trained a textual LoRA on a hundred and fifty examples of the exact prose style they wanted — dense, analytical, paragraph-driven, bullet points only where warranted. Adherence jumped to ninety-eight percent. Two percent of outputs still had some structural issues, but that's within acceptable editing range.

The training cost was negligible. Maybe fifteen to twenty dollars in compute. Compare that to what they were spending on analyst time to reformat AI outputs, and the return on investment is basically immediate.

Which brings us to the forward-looking question. As models get larger — GPT-5, Gemini Ultra 2, whatever comes next — does this problem get better or worse?

My bet is it gets worse. The training corpora for these larger models are increasingly structured. More code, more documentation, more markdown, more semi-structured web content. The bullet-point prior gets stronger, not weaker, as you scale. Which means textual LoRAs become more valuable over time, not less. You're not fixing a temporary quirk — you're building a permanent steering mechanism for a bias that's going to intensify.

The engineer who invests in learning textual LoRAs now is ahead of a curve that's only going to steepen. That's a rare thing in this field — a skill that appreciates rather than depreciates.

If you're sitting there with a use case in front of you — and I know Daniel is — here's the path I'd actually recommend. Start with system prompting plus few-shot examples. It's free, it takes ten minutes, and for some models and some tasks, it might be good enough. But measure adherence rigorously. Don't eyeball it. Run a hundred generations and count how many stay in prose the whole way through.

If that number comes back above eighty percent, you might decide the occasional bullet-point slip is acceptable for your use case. Internal drafts, quick summaries, things that aren't going to clients. But if it's below eighty percent — and on DeepSeek v4 Pro it almost certainly will be — you move to a textual LoRA. That's the threshold where the reliability-to-effort ratio flips decisively.

Once you've made that call, the implementation is surprisingly straightforward. Collect a hundred to two hundred examples of the exact writing style you want. Not "good writing" in general — the specific voice, sentence length, paragraph density, and structural patterns you're targeting. Train for ten to twenty epochs with a low learning rate — one e negative four is the standard starting point. Use twenty to fifty embedding tokens. I've seen people try to crank that up to a hundred or more thinking more tokens means more control, and it doesn't. The performance plateaus hard around fifty.

Don't overengineer the embedding size. Twenty to fifty tokens, two hundred examples, twenty dollars. That's the recipe.

If you're stuck on API-based models — OpenAI, Anthropic, anything where you can't touch the inference pipeline — you're limited to system prompting, and you should be loud about wanting more. Push for API support for style embeddings or custom prefixes. This isn't a niche feature request. It's the natural evolution of how enterprises are going to need to interact with these models. The demand is building whether providers are ready for it or not.

The broader point underneath all of this is that the bullet-point bias isn't some fundamental limitation of transformer architecture. It's a feature of the training distribution. Which means it's tractable. With the right steering mechanism, you can hit ninety-four percent or better prose adherence for twenty dollars and a five to ten percent latency overhead. That's not a research prototype — that's an engineering solution you can deploy this week.

Here's what I think is the bigger implication that Daniel's question points toward. This isn't really about bullet points. It's about whether style becomes a first-class engineering parameter or remains an afterthought you cross your fingers about. The textual LoRA approach generalizes — tone, formality, technical depth, rhetorical structure. You could have a library of style embeddings and swap them the way you'd swap a font.

Style as infrastructure. Not "write this in a professional tone" as a prompt, but "load the McKinsey embedding" as a pipeline configuration. That's where we're heading, and the teams that treat it that way now are going to have a real advantage.

Daniel, if you're building agents that write reports — don't accept bullet points as inevitable. The tools exist. System prompting gets you sixty-two percent of the way there for free. Textual LoRAs get you to ninety-four percent for twenty dollars. Pick the one that matches your failure tolerance and your pipeline access, but pick one. Don't just live with the default.

Now: Hilbert's daily fun fact.

Hilbert: In the 1810s, Nepalese mountaineers used a knot called the "double overhand noose with a draw-loop" to secure loads on steep terrain. A properly tied one could hold roughly the weight of a fully grown yak — about five hundred and fifty kilograms — which is comparable to the working load limit of a modern ten-millimeter static climbing rope.

A yak-rated knot. That's actually impressive.

This has been My Weird Prompts. If you enjoyed the episode, leave us a review wherever you listen — it helps other people find the show. I'm Herman Poppleberry.

I'm Corn. We'll catch you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#4057: How to Fix AI's Bullet Point Addiction

Downloads

You Might Also Like

#4057: How to Fix AI's Bullet Point Addiction