Daniel sent us this one — DeepSeek V4 dropped yesterday, two models, a Pro and a Flash, both with a million-token context window, both open-weights under MIT license. He wants the architecture deep-dive, the training pipeline, and he's also pushing a question I've been turning over myself — why does DeepSeek's prose actually feel more vivid than what we get from Claude or GPT? Not just benchmark scores. The texture of the writing.
There's actually a lot we can say on that question, though I should warn you upfront — the honest answer is that DeepSeek hasn't published a creative-writing recipe. We have four mechanisms that can each plausibly explain it, and I'll walk through all of them, but anyone who tells you they've got the single definitive explanation is guessing.
I'd rather have the honest uncertainty than a tidy story. Also, quick note — today's episode script is coming from DeepSeek V four Pro.
I like it. Alright, let's start with what V4 actually is, because the scale here is genuinely wild. DeepSeek V4 Pro — one point six trillion total parameters, forty-nine billion activated. That makes it the largest open-weights model in existence. It's more than twice the size of V three point two's six hundred eighty-five billion.
The Flash version?
Two hundred eighty-four billion total, thirteen billion activated. But here's what I find more interesting than the raw parameter count. V4 Flash costs fourteen cents per million input tokens, twenty-eight cents per million output tokens. V4 Pro is a dollar seventy-four input, three forty-eight output. For comparison, the contemporaneous closed-model flagships — GPT five point four, Claude Opus four point six — they're charging somewhere around twenty-five to thirty dollars per million output tokens.
An order of magnitude cheaper. That's not a minor discount. That's a different economic tier entirely.
And DeepSeek is already signaling further price drops once the Huawei Ascend nine-fifty supernodes ship at scale in the second half of this year. But we should talk about the hardware story, because it's more nuanced than most coverage suggests.
The narrative is that this is DeepSeek's big Huawei moment — domestic Chinese chips, hardware sovereignty, breaking free from Nvidia.
That's partially true, but only partially. MIT Technology Review had a good piece on this yesterday. V4 is DeepSeek's first model optimized for domestic Chinese chips — the inference side is tuned for the Huawei Ascend nine-fifty PR, which does about one point five six petaflops in FP4 with a hundred twelve gigabytes of high-bandwidth memory. But the bulk of training appears to still be on Nvidia hardware. Fortune reported that a Tsinghua source described it as, quote, "only partial training adaptation for domestic chips.
The inference is Huawei, the training is still mostly Nvidia. That's a partial decoupling, openly framed as partial.
Which I actually respect more than if they'd tried to spin it as a clean break. They're being transparent about the transition. And the architecture itself is where things get interesting. V4 inherits the DeepSeek MoE philosophy — mixture of experts with fine-grained feed-forward layers — and the sparse attention lineage from V2 and V3 point two. But there are three new components that matter.
Walk me through them. Start with the attention mechanism, because I saw the terms CSA and HCA and my eyes started to glaze.
So standard transformer attention is quadratic — the cost scales with the square of sequence length. A million-token context with vanilla attention would be completely unservable. DeepSeek's been chipping away at this since V2 introduced multi-head latent attention, and V3 point two introduced DeepSeek Sparse Attention. V4 introduces what they're calling hybrid attention, which combines two mechanisms. The first is Compressed Sparse Attention, or CSA. It compresses the key-value entries by a factor of four along the sequence dimension using a softmax-gated pooling with learned positional bias. Then a component called the lightning indexer — this is FP4 precision, ReLU-scored multi-head dot product — selects the top-k compressed blocks for each query.
You're compressing first, then doing sparse selection on the compressed representation. The search space is smaller.
Four times smaller, specifically. And then the second mechanism is Heavily Compressed Attention, HCA, which compresses by a factor of one hundred twenty-eight and drops sparse selection entirely — at that compression ratio, the residual sequence is short enough that dense attention is cheap. In the V4 Pro sixty-one-layer stack, layers zero and one are HCA, layers two through sixty alternate between CSA and HCA, and the trailing multi-token prediction block runs sliding-window only.
What does this actually buy you in practical terms? Give me the headline efficiency numbers.
At one million tokens, V4 uses about two percent of the KV-cache size of vanilla grouped-query attention with eight heads in BF16. It's ten percent of V3 point two's KV cache. And per-token FLOPs — V4 Pro uses twenty-seven percent of what V3 point two used. For V4 Flash, those numbers drop to ten percent and seven percent respectively.
You're getting a million-token context window at a fraction of the compute cost of the previous generation's presumably shorter context. That's the architecture story in one sentence.
It's a story about serving cost, not just capability. Western labs in twenty twenty-five and twenty twenty-six have been emphasizing compute scale and post-training — extended thinking, tool harnesses. DeepSeek's headline numbers are KV-cache and FLOP reductions. The architecture is designed to make long context cheap to serve, not just possible to serve.
Which connects directly to the pricing we talked about. If your per-token cost is a quarter of the previous generation, you can charge a quarter of the price and maintain margins.
The second new component is something called Manifold-Constrained Hyper-Connections, or mHC. This replaces standard residual connections. The paper frames it as a stability fix at scale rather than a representational change — it's about making training stable enough to converge at one point six trillion parameters, not about changing what the model can represent.
Mixed-precision storage. Most KV entries are FP8, the rotary position embedding dimensions stay in BF16, the lightning indexer is FP4. The instruct checkpoints store mixture-of-experts expert weights in FP4 with FP8 elsewhere. Base models are FP8 throughout. This is all in service of the same goal — make the thing smaller in memory.
There was a component called Engram in some of the early leaks. Did that ship?
It did not. The optimizer is Muon, which they chose for faster convergence and training stability, and they retained multi-token prediction from V3. The pretraining corpus is described as, quote, "thirty-two trillion plus diverse and high-quality tokens," but they haven't formally broken out the composition.
Which is interesting, because V3 disclosed roughly fourteen point eight trillion tokens with a strong English-Chinese mix and math and code upweighting. We have some indirect evidence about V4's composition though, right?
V4 scores eighty-four point four on Chinese SimpleQA versus fifty-seven point nine on English SimpleQA Verified. And V4 Flash Base hits ninety-two point one on C-Eval. Those numbers strongly suggest continued heavy Chinese-language weighting in the pretraining corpus. And this is actually relevant to the prose vividness question we'll get to.
Let's hold that thread for a moment. I want to understand the post-training pipeline first, because I know DeepSeek does something different from the Western labs here.
This is where it gets really interesting. V4 uses a two-stage post-training paradigm. Stage one they call domain expert cultivation. They train multiple domain experts independently — each one gets supervised fine-tuning followed by reinforcement learning with GRPO. That's DeepSeek's evolution of the Group Relative Policy Optimization from R1 and V3 point two. The V4 version adds domain-specific KL weighting, importance-ratio reweighting for unbiased KL estimation, off-policy sequence masking, expert-routing preservation in the MoE layers, and sampling-mask consistency for top-p and top-k decoding.
Each domain expert is trained to be good at its specific thing, with the RL objective tailored to that domain.
And then stage two is unified model consolidation. They merge the domain experts via on-policy distillation into the single shipped checkpoint. The key word there is distillation — the final model learns from the experts, it doesn't average their weights. The paper says this is designed to preserve domain-specific behavior rather than having it washed out.
Which is a structural safeguard against the flattening effect you get from standard RLHF. If you train a creative writing expert independently and then distill it into the final model, the writing behavior survives the merge.
That's the theory. And it connects to what we know about DeepSeek's alignment philosophy more broadly. They're not running the same kind of human-preference RLHF over open-ended writing that the Western labs do. Their published alignment recipe is GRPO against verifiable rewards — math, code, agent tasks. Verifiable-reward RL doesn't rinse out stylistic variance the way pairwise-preference RLHF does.
You've got a structural reason, an alignment-philosophy reason, and we haven't even gotten to the sampling defaults yet.
Let's talk about those, because they're striking. DeepSeek's recommended sampling defaults for V4 are temperature one point zero, top-p one point zero, across all three reasoning modes. That's in the model card. Western product surfaces typically default to temperature around zero point seven. A temperature of one point zero is roughly double the effective entropy compared to zero point seven. That alone is a direct mechanism for more varied, more "vivid" prose, regardless of any training differences.
Even if you had identical model weights, the higher sampling temperature would produce output that feels less constrained, less samey.
And we know from the EQ-Bench Creative Writing leaderboard that this advantage is real and measurable. DeepSeek V3 point two Speciale topped the EQ-Bench Creative Writing v3 Elo leaderboard late last year. V4 inherits the V3-family corpus and post-training scaffolding wholesale. The community consensus — and this is reflected across the EQ-Bench leaderboards and the various "uncensored open-source" model lists — is that DeepSeek models are markedly less RLHF-rounded than Western peers.
I remember seeing a critic describe the difference between V3-era output and more aggressively safety-tuned models. They called the DeepSeek prose "warm, with rhythm and breath," versus the "short, choppy, robotic sentences" of the alternatives.
That quote has been circulating. And I think it captures something real, but we should be careful not to over-claim. Let me lay out the four mechanisms that can each plausibly explain the vividness gap, and then I'll tell you what we don't know.
Go for it.
Mechanism one — the pretraining corpus. Heavy Chinese-language weighting, and the Chinese web includes substantial human-written fiction and online prose. We're talking Zhihu long-form, web novels on platforms like Qidian and Jinjiang, classical literature. These sources impart rhythm and image habits that English-only models trained mostly on web text and code don't pick up. And these habits transfer cross-lingually via shared embeddings. It's not that the model learned to write English fiction from Chinese fiction directly — it's that it learned something about prose rhythm and vivid imagery that generalizes.
This is the one you flagged as credibly theorized but not documented in the V4 paper.
DeepSeek hasn't published their corpus composition. But the Chinese-SimpleQA versus English-SimpleQA gap, and the C-Eval scores, make it clear that Chinese-language data is heavily represented. Mechanism two — less aggressive RLHF. I've already covered this. GRPO against verifiable rewards preserves stylistic spikiness. Pairwise-preference RLHF — "which of these two responses do you prefer?" — systematically penalizes anything unusual. Over enough training steps, you converge to a safe, bland mean.
Mechanism three is the stage-one to stage-two pipeline we already discussed. The distillation preserves domain-specific behavior rather than averaging it out.
And mechanism four is the sampling defaults. Temperature one point zero versus zero point seven. That's sufficient on its own to explain some of the perceived vividness gap, regardless of training differences. Higher entropy at decode means more lexical variety, more syntactic variation, more willingness to reach for an unexpected image.
Those are the four mechanisms. What don't we know?
We don't know whether DeepSeek runs any form of preference RL specifically targeted at narrative quality. We don't know the specific composition of the SFT data used in stage one for any writing expert — if such an expert even exists as a distinct entity. We don't know whether long-form Chinese fiction is upsampled or filtered down in the thirty-two trillion token mix. Any specific causal claim about why the prose feels more vivid is informed speculation. The prose advantage is real and measurable — EQ-Bench confirms that. We can point to four mechanisms that could explain it. But DeepSeek hasn't published a creative-writing recipe.
I appreciate the honesty. Let's shift to the agent and reasoning side, because V4 does something that none of the previous DeepSeek releases did.
This is a big deal. V4 is the first DeepSeek release to fold reasoning, agent tool-use, and long-context into a single base and instruct family. Previous generations shipped a separate Reasoner SKU — you had deepseek-chat and deepseek-reasoner as distinct API endpoints. Those legacy endpoints retire July twenty-fourth of this year. V4 unifies everything.
There are three inference modes now.
Non-think, Think High, and Think Max. Think Max requires at least three hundred eighty-four thousand tokens of context. The agent-specific innovations are worth highlighting. They've got interleaved thinking across tool calls — reasoning traces survive user-message boundaries inside tool-using conversations. Non-tool conversations keep the V3 point two behavior where the reasoning trace flushes on each turn.
If the model is using tools, it can maintain a coherent chain of thought across multiple tool calls. That's useful for complex agent workflows.
They also introduced a new tool-call schema using a special token called DSML. It's an XML schema with a string-equals-true or string-equals-false flag that separates raw-string parameters from JSON-structured parameters. The explicit goal is to kill nested-quote parsing failures, which have been a persistent headache in tool-calling implementations.
There's a sandbox platform?
DSec — DeepSeek Elastic Compute. It's a Rust sandbox platform that exposes function calls, containers, Firecracker microVMs, and full QEMU virtual machines through a single Python SDK. They claim it scales to hundreds of thousands of concurrent sandboxes. And it has preemption-safe trajectory replay, so interrupted RL steps can resume without re-running tool calls.
That's the kind of infrastructure investment that most labs don't talk about publicly. It's not glamorous, but it's what makes agent workflows actually reliable at scale.
Simon Willison had a good line in his piece yesterday. He called V4 "almost on the frontier, a fraction of the price." And I think that captures the strategic positioning. DeepSeek isn't claiming to beat GPT five point four or Claude Opus four point six on every benchmark. They're saying — here's a model that's competitive with the frontier, open-weights, MIT license, at roughly a tenth the cost.
Let's talk about what differentiates this from the Western labs more systematically. You've touched on pieces of this, but I want to lay it out clearly.
First, open weights. MIT license, full base checkpoints on Hugging Face. GPT five point four, Claude Opus four point six, Gemini three point one Pro — all closed. V4 Pro is the largest fully open model in existence, and it's not close.
Architecture designed for inference cost, not training scale. I already made this point, but it's worth underlining. CSA plus HCA, mHC, FP4 expert weights — all of these exist to make serving cheap.
Third is the hardware sovereignty story.
Partial but real. Inference optimized for Huawei Ascend, training still mostly Nvidia, openly framed as a transition. MIT Technology Review called it "DeepSeek's first model optimized for domestic Chinese chips.
Fourth is pricing, which we've covered. V4 Flash output at twenty-eight cents per million tokens is the cheapest small-model output price from any major lab right now.
Fifth is the alignment philosophy. Three explicit reasoning-effort modes, no separate safety SKU, no published RLHF safety report comparable to Anthropic's system cards or OpenAI's. The censorship behavior people observe on chat dot deepseek dot com is widely reported to live in the application layer rather than the model weights. The open-weights V4 checkpoints are markedly less RLHF-rounded than Western counterparts.
Which is consistent with the R1 lineage. DeepSeek wasn't subjected to stringent RLHF the way GPT and Claude were.
And this is a double-edged sword. Less rounding means more vividness, more stylistic character, less of that homogenized corporate voice. It also means less guardrailing. Different users will have different feelings about that trade-off.
Let's go back to the training pipeline for a moment. You mentioned the thirty-two trillion token corpus. Do we have any official cost figure for the training run?
And this is important — the roughly five point two million dollar figure circulating in the press is an unverified extrapolation from V3's five point five seven six million dollar H800 disclosure. We should treat that as speculation. DeepSeek hasn't published a cost figure for V4.
Good to flag. What about the tokenizer?
Custom tokenizer, non-Jinja chat template. There's a dedicated encoding function, and the new DSML special token for tool calls. The model cards on Hugging Face have the full details.
You mentioned the multi-token prediction earlier. That's carried over from V3?
And the optimizer is Muon, which they chose for faster convergence and training stability. Sebastian Raschka has a good write-up on the evolution from V3 to V3 point two that covers some of the architectural lineage, and the V4 paper extends that lineage pretty directly.
If I'm trying to synthesize all of this — what's the single most important thing about V4?
I think it's the inference-cost architecture combined with the open-weights release. DeepSeek has made a million-token context model that's cheap enough to actually use, and they've given away the weights. That changes the economics for anyone building on top of these models. You're not paying thirty dollars per million output tokens to a closed API. You're paying three forty-eight, or you're self-hosting.
The million-token context isn't just a spec sheet number. The hybrid attention architecture means you can actually fill that context window without the latency becoming unusable.
That's the key. Lots of models claim long context. Most of them become painfully slow or prohibitively expensive long before you reach the theoretical limit. V4's architecture is designed so that the million-token context is practically usable, not just technically possible.
Which brings us back to Daniel's question about prose vividness. I want to push on something. You said the four mechanisms are pretraining corpus, less RLHF, the distillation pipeline, and sampling defaults. If you had to weight them — and I know this is speculative — which do you think is doing the most work?
I think the pretraining corpus and the RLHF philosophy are probably the biggest factors, but I want to be clear that this is my judgment, not something I can prove from the published literature. The reason I put the corpus first is that pretraining data shapes everything downstream. If your model was trained on thirty-two trillion tokens that include a substantial fraction of human-written fiction and long-form prose — in any language — it's going to internalize patterns of rhythm, imagery, and narrative structure that are hard to replicate with web text and code alone. And the cross-lingual transfer via shared embeddings is a real phenomenon. We've seen it in other multilingual models.
The RLHF point?
Pairwise-preference RLHF is a powerful homogenizing force. Every training step where the model is rewarded for the "safer" or "more helpful" response pushes it toward the center of the distribution. Over enough steps, you lose the tails — the unusual word choices, the unexpected metaphors, the stylistic quirks that make prose feel alive. DeepSeek's choice to use GRPO against verifiable rewards means they're not running that homogenization process over open-ended writing. The writing quality is preserved because it's never directly optimized against.
The distillation pipeline feels like the most structurally interesting explanation, but maybe not the one doing the heaviest lifting.
I'd agree with that. The two-stage pipeline is elegant — train domain experts independently, then distill into a unified model. It's designed to preserve domain-specific behavior. But whether it actually produces better creative writing than a well-tuned single-stage pipeline is an open question. The structural logic is sound. The empirical evidence is thin.
The sampling defaults — that's almost a cheat. You can get more varied output from any model by cranking up the temperature. The fact that DeepSeek defaults to one point zero just means they're willing to ship with higher entropy.
It's not a cheat, exactly. It's a design choice. Western labs default to lower temperatures because they're optimizing for safety and consistency. DeepSeek defaults to higher temperature because they're optimizing for... something closer to expressiveness. Both are legitimate. They reflect different priorities.
I want to circle back to something you said earlier about the censorship living in the application layer rather than the weights. What's the practical implication?
If you're using the open-weights V4 checkpoint directly, you're getting a model that's less constrained than what you'd experience through chat dot deepseek dot com. The application layer can refuse requests or filter outputs. The weights themselves are more permissive. This is consistent with how DeepSeek has operated since R1. It's a different philosophy from Anthropic or OpenAI, where the safety tuning is baked into the model weights themselves.
Which has implications for fine-tuning and downstream use. If the constraints are in the application layer, you can strip them out by self-hosting.
And that's part of why the open-weights release matters so much. You're not just getting a model. You're getting a model with a specific alignment profile — less rounded, more stylistically varied, more permissive — and the freedom to modify it.
Let's talk about what's next. DeepSeek is signaling further price drops once the Ascend supernodes ship. What does that timeline look like?
Second half of this year, according to their API announcement. The Ascend nine-fifty PR in FP4 does about one point five six petaflops with a hundred twelve gigabytes of HBM and roughly one point four terabytes per second of memory bandwidth, according to TrendForce. Once those are deployed at scale, the inference economics shift further.
On the training side, the transition to domestic chips is partial. Do we expect that to change?
Huawei has pledged, quote, "full support." But the gap between current Ascend hardware and Nvidia's latest is still substantial for large-scale training. I'd expect the transition to be gradual. DeepSeek seems to be framing it that way too — they're not pretending it's a clean break.
One thing we haven't touched on — the context window. A million tokens. What does that actually enable that wasn't practical before?
Whole-codebase analysis. You can drop in an entire large software project and have the model reason across it. Full book-length document processing. Multi-hour conversation histories without summarization or truncation. The agent use case is particularly interesting — with interleaved thinking across tool calls and a million-token context, you can run complex multi-step workflows where the model maintains coherent reasoning across dozens or hundreds of tool interactions.
The hybrid attention architecture means you're not paying a quadratic penalty for using that context.
The KV-cache is two percent of what vanilla attention would require at that sequence length. That's the difference between "technically possible" and "economically viable.
And now: Hilbert's daily fun fact.
The collective noun for a group of porcupines is a prickle.
What should listeners actually take away from all this? If you're building with these models, what matters?
First, if cost is a factor — and it usually is — V4 Flash at twenty-eight cents per million output tokens is the cheapest option from any major lab right now. For many use cases, the Flash model will be sufficient, and the price is hard to beat. Second, if you need long context that's actually usable, V4's architecture makes million-token context practical in a way that most models don't. Third, if you care about open weights and the ability to self-host or fine-tune, V4 Pro is the largest fully open model available.
If you care about prose quality — vividness, stylistic character, writing that doesn't sound like it was run through a corporate communications filter — DeepSeek's alignment philosophy and sampling defaults make V4 worth trying. Even if we can't fully explain why it works.
The unknown is part of the story here. We've got four credible mechanisms. The pretraining corpus, the RLHF philosophy, the distillation pipeline, the sampling defaults. They all point in the same direction. But DeepSeek hasn't published a creative-writing recipe, and anyone claiming certainty about the cause is overselling.
That feels like the right place to land. V4 is a significant release — open weights, inference-cost-first architecture, partial hardware decoupling, aggressive pricing — and it also raises questions about prose quality that the published research doesn't fully answer. Both things can be true.
We haven't even mentioned the fact that this all dropped on a Thursday with basically no warning. DeepSeek's release strategy continues to be — build the thing, ship it, let the weights speak for themselves.
Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com, and if you want to dig into the V4 paper yourself, the model cards are up on Hugging Face, Simon Willison's write-up is excellent, and the MIT Technology Review piece has good context on the hardware story.
Go read the model card. It's unusually candid about the things they haven't done.
Until next time.