#4053: How to Make AI Write Prose, Not Bullet Points

Why LLMs default to lists and how to force them into flowing, professional prose.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-4232
Published: Jul 2
Duration: 22:29
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models prompt-engineering fine-tuning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Large language models don't just prefer bullet points — they're statistically addicted to them. The default output structure of most LLMs is a cascading list of dashes and numbered items, a pattern deeply embedded by three converging forces: training data dominated by web-scraped lists, RLHF reward models that favor scannable outputs, and attention mechanisms that find structural delimiters like line breaks and dashes to be low-entropy, low-surprise paths.

This creates a credibility problem for enterprises deploying AI to write reports, memos, and executive summaries. Expert readers interpret bullet-point formatting as pre-digested simplification — the structural equivalent of someone speaking slowly and loudly. The solution isn't negative prompting ("don't use bullet points"), which fails 30-40% of the time because negation is a weak signal in transformer architectures. Instead, effective prose steering requires positive constraints: dense system prompts that specify paragraph structure, topic sentences, and transitions, paired with few-shot examples that create new low-entropy paths for the model to follow.

For production-scale reliability, the engineering levers span a spectrum. System prompting with examples achieves roughly 60% reliability. Fine-tuning on curated datasets of dense prose can push that to 90% but requires thousands of dollars in compute and significant data curation effort. Textual LoRAs offer a lightweight, toggleable middle ground — effective for style steering without degrading general capabilities. The key insight: you can't fight the bullet-point valley by nudging toward the prose hill. You have to bulldoze a new channel.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#4053: How to Make AI Write Prose, Not Bullet Points

Last episode we gave listeners a tip — if you want to sound like a bot, lean into bullet points. Daniel's prompt this week basically grabs that tip and yanks it in the opposite direction. He wants to know why large language models default so aggressively to bullet-point output, why that's actually a real problem in professional contexts, and what engineering levers exist to force them toward prose-first writing — the kind that reads like a McKinsey white paper, not a dummies' guide. He's asking us to compare system prompting, fine-tuning, and something he calls textual LoRAs, which I'll admit I had to look up.

You looked it up because you wanted me to explain it to you before we recorded.

I looked it up because I value our friendship and didn't want you to have to do all the work.

That's very generous. And also completely false. But Daniel's question is genuinely important right now. Enterprises are deploying AI agents to write reports, memos, executive summaries — and the default output style is this relentless bullet-point cascade that reads like a first-year consultant discovered the formatting toolbar. It creates a credibility gap. Senior readers see that structure and their brains flag it as simplified, reductive, maybe even untrustworthy.

Right — it's the formatting equivalent of someone speaking slowly and loudly at you. The information might be correct, but the presentation signals "I assume you need this dumbed down." And Daniel's observation about our own scriptwriting agent is the perfect springboard here. We taught listeners how to mimic bot-like structure. Now we're asking the reverse engineering question: how do you force a bot to stop writing like a bot?

The answer turns out to be a lot more interesting than "just tell it not to use bullet points." Which, by the way, barely works. There's a real technical reason for that, and it connects all the way down to how these models process structural tokens during generation.

Where do we start — the "why" or the "how to fix it"?

We have to start with the why. Because if you don't understand why the model is pathologically addicted to bullet points, none of the fixes make sense. You'll just be throwing prompts at a wall.

Walk me through it. Why does every language model on earth seem to wake up and choose bullet points?

Three things are happening at once. First, the training data. These models are trained on enormous scrapes of the open web — Wikipedia, Stack Overflow, documentation sites, how-to guides, cooking blogs that give you a six-paragraph life story before the recipe. A huge fraction of that text uses bullet points and numbered lists as structural scaffolding. The model learns, at a statistical level, that when you're explaining something or presenting information, the natural shape of that explanation is a list.

It's not that the model "prefers" bullet points in any intentional sense — it's that the training distribution makes list structures the path of least resistance.

The second factor is RLHF — reinforcement learning from human feedback. When these models go through preference tuning, human raters consistently reward outputs that are concise, scannable, and easy to verify at a glance. Bullet points score high on all three. A dense paragraph requires actual reading. A bullet list lets the rater tick through claims quickly. So the reward model reinforces the behavior.

Which means the very process that makes these models "helpful" is also what makes them write like a rushed analyst who's terrified you'll stop reading.

The third factor is the one most people don't think about — the attention mechanism itself. Bullet points, dashes, numbered items act as structural delimiters. They reduce what you might call format entropy — the model's uncertainty about how to organize tokens. Starting a new line with a dash is a low-surprise move. The model has seen it millions of times. It's the syntactic equivalent of walking downhill.

When you type "summarize the quarterly results" and the model sees that prompt, the statistically coziest next move is a colon followed by a line break and a dash. Not because it reasoned about formatting — because the probability landscape tilts that way.

And this is why negative prompting fails so reliably. When you say "don't use bullet points," the model still activates all the same structural pathways — the dash token, the line break, the numbered list — because negation is a weak signal in transformer architectures. Studies put the failure rate on simple negation instructions around thirty to forty percent. You're asking the model to suppress its strongest statistical instinct with a "please don't.

The cost here is not just aesthetic. If you're sending an AI-generated market analysis to a board, and it arrives looking like a BuzzFeed listicle, you've lost something before anyone reads a word.

Expert readers — the kind who read McKinsey reports or white papers — expect prose that builds an argument. Topic sentence, supporting evidence, transition. Bullet points signal summarization, and summarization signals simplification. The reader's brain treats it as pre-digested.

Which sets up Daniel's real question. If the default is this deeply embedded, what levers do we actually have? He named three: system prompting, fine-tuning, and this textual LoRA approach.

They sit on a spectrum. System prompting is the quick fix — you write better instructions, you add an example, you cross your fingers. It works maybe sixty percent of the time. Fine-tuning is the nuclear option — expensive, data-hungry, and risky because you can degrade the model's general capabilities. Textual LoRAs sit in this fascinating middle ground — lightweight, toggleable, and surprisingly effective for style steering without breaking everything else.

The rest of this conversation is basically: how do we climb back up the probability hill the model keeps sliding down?

Let's start with the negative prompting failure, because it's the most counterintuitive. You'd think saying "don't use bullet points" would work. It's a simple instruction. But what's actually happening under the hood is that the model generates tokens sequentially, and by the time it reaches the structural decision point — colon, then what? — the statistical weight of every training example where a colon is followed by a line break and a dash is bearing down on it. The word "don't" in your prompt is just one more token in a sea of tokens. It gets diluted.

The model isn't disobeying. It's just that "don't" is a feather, and the training data is a freight train.

That's the metaphor. And it gets worse. When you write "don't use bullet points," you've still activated all the token pathways associated with bullet points. The model is now thinking about bullet points. You've primed the very behavior you're trying to suppress. It's like telling someone "don't think about elephants.

Which is why our scriptwriting agent works, and a naive "please write prose" prompt doesn't. The scriptwriting agent never mentions bullet points at all.

The system prompt for the scriptwriting agent does two things simultaneously. First, it gives positive prose constraints — "write in flowing paragraphs," "use conversational turn-taking," "vary sentence length." Second, it includes a few-shot example of what the output should actually look like. That dual signal is what overrides the bullet-point prior. You're not fighting the model's instincts — you're giving it a different instinct to follow.

The abstract instruction says what you want. The example shows what you want. And the model, being a pattern-matching engine, latches onto the pattern.

The pattern is specific enough that it creates a new low-entropy path. Remember, the model gravitates toward bullet points because that path is well-worn and predictable. If you provide a prose example with clear structural markers — topic sentences, transitions, paragraph breaks — you're essentially paving a new path and making it just as statistically comfortable to walk down.

That format entropy concept you mentioned earlier — I want to sit with that for a second. You're saying the model has an internal probability distribution over what comes next, and structural tokens like dashes and line breaks are heavily weighted because they've been reliable predictors in training.

Think of it as a landscape. The bullet-point valley is deep and wide. The prose hill is steep. Most prompts just nudge the model gently toward the hill. It takes one look at the valley and slides right back down. What a good system prompt with a few-shot example does is essentially bulldoze a new channel. It doesn't just nudge — it reshapes the probability surface for that specific generation.

Which explains why the before-and-after difference can be so stark. I'm imagining a typical scenario — someone prompts a model with "analyze the competitive landscape for electric vehicles." The naive output starts with a colon, then a dash, then five more dashes. It's a list of competitors with one-line descriptions. Functional, but reads like meeting notes.

Now take the same query with a structured prose constraint and an example. You get something that opens with a topic sentence — "The electric vehicle market is undergoing a structural shift from technology differentiation to manufacturing scale as the primary competitive moat." Then it builds the argument across three paragraphs, weaving the competitors into the narrative rather than listing them. Same information, completely different credibility profile.

The scriptwriting agent is the proof this works at production scale. It's not just generating a paragraph or two — it's generating four thousand words of conversational dialogue, every episode, with consistent formatting. The bullet-point impulse never breaks through because the positive constraints are so heavily reinforced.

The technical parameters there are instructive. The system prompt doesn't just say "write dialogue." It specifies turn structure, personality traits, pacing rules, prohibitions on certain patterns. It's a dense web of positive instructions, and then it includes actual examples of the desired output format. That combination creates enough steering force to overcome what would otherwise be an overwhelming statistical pull toward structured lists.

For Daniel's listener who wants McKinsey-style reports, the lesson from the scriptwriting agent is: don't tell the model what to avoid. Tell it what to build. Give it the scaffolding — topic sentence, evidence, transition — and show it a sample paragraph that embodies the target style.

That gets you maybe sixty percent reliability, which is fine for many use cases. But if you need deterministic prose style across hundreds of outputs — quarterly reports, client deliverables, anything where a bullet-point relapse would be embarrassing — that's where you start looking at the heavier engineering levers.

Fine-tuning is the one everyone thinks of first, and I understand why. You curate a dataset of, say, two hundred McKinsey-style reports — the real thing, dense prose, layered arguments, zero bullet points unless absolutely necessary — and you retrain the model on those. The output style shifts dramatically. You can hit ninety percent reliability on prose-first formatting.

Two hundred reports sounds like a lot of PDFs to hunt down and a lot of manual reformatting.

And that's before you even touch a GPU. The compute cost for full fine-tuning on a model like DeepSeek V4 Pro runs into thousands of dollars, and you need someone who actually knows how to set up the training pipeline without introducing weird artifacts. But the bigger risk is catastrophic forgetting.

Which is the technical term for "congratulations, your model now writes beautiful prose and has forgotten how to do math.

That's not even an exaggeration. When you fine-tune on a narrow stylistic dataset, you're adjusting the model's weights across the board. The prose style improves, but factual recall, reasoning, quantitative analysis — those can degrade because the training signal is optimizing for something orthogonal to correctness. You end up with a model that sounds authoritative and is subtly wrong more often.

You've traded bullet points for elegant misinformation.

Which is arguably worse. At least bullet points signal "this might be oversimplified." Beautiful prose that's factually shaky is harder to catch.

Alright, so that brings us to the third option — the one Daniel flagged that I had to look up. What are they, and why do they avoid the fine-tuning trap?

LoRA stands for Low-Rank Adaptation. It was introduced in twenty twenty-one by a team at Microsoft Research, and the core idea is elegant. Instead of updating all the model's weights during training, you train a small set of additional parameters — a lightweight adapter — that sits alongside the base model and steers its outputs. The base model stays frozen. Only the adapter learns.

The model's general capabilities are preserved because you're not touching them.

And because the adapter is small — we're talking a few megabytes for a rank-eight LoRA — you can train it on fifty to a hundred examples, on a single GPU, in a couple of hours. Then at inference time, you load the adapter alongside the base model, and it shifts the output distribution toward your target style.

DeepSeek V4 Pro supports this natively?

DeepSeek V4 Pro, released earlier this year, has native LoRA adapter support with minimal latency overhead. You can toggle the adapter on and off per request. That toggleability is a bigger deal than it sounds. You might want bullet points for an internal summary and prose for the client-facing version of the same analysis. With a LoRA, you don't need two separate models. You just flip a switch.

Walk me through what building one of these actually looks like. Daniel's listener wants McKinsey-style reports. What's the pipeline?

Step one, you collect fifty to a hundred prose-heavy reports in your target style. These don't need to be perfect — just representative. Step two, you format each one as an instruction-output pair. The instruction is something like "analyze the competitive dynamics of the semiconductor supply chain," and the output is the full prose report. Step three, you train a rank-eight LoRA on DeepSeek V4 Pro for about a hundred steps. That's maybe two hours on an A100. Step four, you deploy the adapter and test it on prompts the model has never seen.

The reliability difference versus system prompting?

System prompting alone, with a good few-shot example, gets you around sixty percent reliability — meaning six out of ten outputs stay prose-first without bullet-point leakage. A textual LoRA pushes that to about eighty-five percent. Fine-tuning can hit ninety, but at twenty times the effort and with the catastrophic forgetting risk. The LoRA is the sweet spot.

The decision framework is basically: if you're generating the occasional report and can tolerate spot-checking the output, system prompting plus a strong example is fine. If you need deterministic prose style at scale — dozens of reports a week, client-facing deliverables, anything where a formatting failure would be embarrassing — invest the twenty hours in building a LoRA.

I'd add a hybrid approach that most teams overlook. Build the LoRA for your critical outputs, but also maintain a strong system prompt as your default. The prompt handles eighty percent of use cases cheaply. The LoRA is your insurance policy for the high-stakes twenty percent. You get the best of both without paying the fine-tuning tax.

There's something philosophically satisfying about the LoRA approach too. You're not rewriting the model's personality. You're giving it a stylistic lens it can put on and take off. It's the difference between surgery and a well-tailored jacket.

If I had to give Daniel's listener a decision tree, it's three branches. Default to system prompting with a few-shot example — that's your baseline. It's fast, it's cheap, and for most internal use cases it's good enough. If you're generating client-facing reports or anything where a bullet-point relapse would make you look sloppy, build a textual LoRA. And only reach for full fine-tuning if you're also adapting domain knowledge — like teaching the model a specialized industry vocabulary alongside the prose style.

That last point matters more than people realize. Fine-tuning isn't just overkill for style — it's the wrong tool. You're rebuilding the house because you didn't like the paint color.

Paying a contractor who might knock out a load-bearing wall. So let's make this concrete. Here's a prompt template that anyone can copy and adapt.

"Write in flowing paragraphs with clear topic sentences. Use bullet points only when listing three or more discrete items. Begin each paragraph with a claim, then support it. Here is an example of the desired style:" — and then you insert a three or four sentence paragraph that embodies the prose you want. That's it. Four sentences of instruction, one example paragraph, and you've just given the model a structural alternative instead of a prohibition.

That last part is the key insight. You're not saying "don't do the thing you're statistically wired to do." You're saying "here's a different thing, it has its own shape and rhythm, do this instead." The model gets to follow a pattern either way — you've just swapped which pattern it follows.

The example paragraph does more work than the instructions. The model reads your sample and extracts the structural DNA — topic sentence, evidence, transition, repeat. It doesn't need you to explain paragraph construction. It just needs to see one.

Which circles back to something Daniel has said before about examples being crucial in prompts. Abstract instructions are a sketch. An example is a photograph. The model is better at tracing than imagining.

The practical takeaway is: spend ten minutes writing one good example paragraph in your target style. That's the highest-leverage ten minutes in the entire workflow. Everything else — the LoRA training, the dataset curation — builds on that foundation. If you can't articulate what good prose looks like in a single example, you're not ready to automate it.

There's one question I keep turning over, though. As models get better at instruction-following — and we've seen real gains in the past eighteen months — does the bullet-point bias fade on its own? Or is it baked so deep into the training distribution that no amount of alignment tuning will dislodge it?

I think it's baked. Not because the architecture demands it, but because the internet is structured that way. Every documentation page, every how-to article, every listicle — that's the training corpus. You can't alignment-tune your way out of the shape of the web.

There's a twist I don't think most people have clocked yet. We're entering this era of agentic workflows — AI writing to other AI. An analyst agent drafts a report, a summarizer agent condenses it, a briefing agent pulls out action items. In that pipeline, bullet points are actually the optimal format. They're parseable, they're structured, they minimize ambiguity for the next model in the chain.

The very thing that annoys human readers is a feature for machine-to-machine communication. The prose-versus-bullet debate might not have a single answer — it might just depend on who's reading.

Which means the skill Daniel's listener is building isn't "how to make AI write prose." It's "how to make AI write prose when the audience is human." That's a more interesting capability — contextual format switching based on the reader, not the content.

That's probably where this is all heading. Not one default style, but models that modulate format based on context. Bullet points for the summarizer agent downstream. Flowing prose for the board deck. Both from the same model, toggled by something as lightweight as a LoRA.

The LoRA becomes the stylistic gearshift. And the prompt is just the key that starts the engine.

We should land this. Next episode, we're digging into something I've been quietly obsessed with — the economics of AI inference at scale, and why the "rent versus build" decision is getting weirder by the month. Daniel's already sent the prompt and it's a good one.

I've got spreadsheets.

Of course you do. Thanks as always to our producer Hilbert Flumingtop for keeping this operation running.

Now: Hilbert's daily fun fact.

Hilbert: In sixteen eighty-one, a Dutch trader crossing the Karakum Desert near modern-day Turkmenistan recorded finding a perfectly preserved mammoth tusk protruding from thawing permafrost — the earliest documented observation of permafrost methane seeps in Central Asia, though he had no idea what he was looking at and described it as "the earth exhaling frozen breath.

...the earth exhaling frozen breath. That's actually kind of beautiful.

This has been My Weird Prompts. Find every episode at my weird prompts dot com, or email the show at show at my weird prompts dot com. We'll be back next week.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#4053: How to Make AI Write Prose, Not Bullet Points

Downloads

You Might Also Like

#4053: How to Make AI Write Prose, Not Bullet Points