#1606: DeepSeek’s Return: V4, R2, and the AI Pricing War

DeepSeek returns with a trillion-parameter model and rock-bottom pricing. Explore the tech behind V4 and the mystery of the Hunter Alpha leak.

0:000:00

Episode Details

Published: Mar 27
Duration: 22:48
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: large-language-models ai-agents geopolitics

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

DeepSeek has re-emerged as a dominant force in the AI landscape with the release of its V4 and R2 models. After a period of relative quiet, the Hangzhou-based lab has introduced a one-trillion-parameter model that challenges the performance and pricing structures of established Western labs.

Architectural Efficiency at Scale

The V4 model utilizes a highly refined Mixture of Experts (MoE) approach. While the total parameter count reaches one trillion, the model only activates 32 billion parameters per token during inference. This allows for a vast knowledge base without the prohibitive compute costs typically associated with dense models of this magnitude.

A key technical breakthrough supporting this scale is the implementation of Manifold-Constrained Hyper-Connections (MHC). This method applies geometric constraints to the model’s internal representations, preventing the training instability and "gradient drift" that often plague ultra-large-scale models. By keeping the latent space stable, DeepSeek has achieved a one-million-token context window with 99% retrieval accuracy and native multimodality, processing text and video in the same underlying space.

The Economics of AI Disruption

Perhaps the most significant impact of the V4 launch is its pricing. At roughly 27 cents per million input tokens, DeepSeek is offering services at a fraction of the cost of many competitors. This shift is not merely a marketing subsidy but a result of deep technical optimizations. Innovations in K V cache compression and inference efficiency allow the model to run on significantly less VRAM, fundamentally changing the unit economics for startups and developers who require high-tier reasoning at scale.

Hardware Sovereignty and the Reference Architecture

DeepSeek’s success is also a testament to hardware optimization. Rather than relying solely on high-end Nvidia chips, the team has optimized its models for domestic Chinese silicon, including Huawei and Cambricon hardware. This constraint has forced algorithmic efficiencies that result in world-class performance on domestic hardware, proving that software ingenuity can overcome hardware limitations.

The influence of DeepSeek now extends beyond its own products. The "Hunter Alpha" mystery—where a high-performing model briefly appeared on benchmarks before being revealed as Xiaomi’s MiMo V2 Pro—highlighted DeepSeek’s role as a "reference architecture." Other major tech firms are now using DeepSeek’s training recipes and reasoning techniques as the blueprint for their own proprietary models.

The Future of Reasoning and Agents

The R2 model introduces "Internalized Reasoning," a step forward from the explicit chain-of-thought processing seen in earlier versions. By training the model through reinforcement learning to compress reasoning steps into internal activations, DeepSeek has reduced latency while maintaining high scores on logic benchmarks like ARC-AGI.

Looking forward, the development of OpenClaw suggests a move toward standardized agent infrastructure. By creating a unified protocol for how AI interacts with software environments, the goal is to transition from simple chatbots to autonomous digital employees capable of complex, multi-step tool use.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1606: DeepSeek’s Return: V4, R2, and the AI Pricing War

Daniel's Prompt

Custom topic: DeepSeek - they had a dramatic moment in the sun with DeepSeek V3 and R1 that briefly shocked the AI world, but then seemed to fade back into relative obscurity. What are they up to now? What's their

So, Herman, I have a theory. I think the team at DeepSeek just decided to go on a very long, very quiet vacation after their big moment last year. They just ghosted the entire AI industry, left us on read for about twelve months, and waited for everyone to stop talking about them before kicking the front door down again. It was like they were the main character in a movie who disappears into the mountains to train in a montage, only to reappear when everyone thought they were a one-hit wonder.

It certainly felt that way if you were only watching the headlines, Corn. But if you were looking at the pre-print servers and the specialized hiring boards in Hangzhou, you could see the pressure cooker was starting to whistle. I am Herman Poppleberry, by the way, for anyone joining us for the first time, and I have been refreshing the DeepSeek GitHub page more than is probably healthy for a donkey of my age. My ears have been twitching for months waiting for this drop.

It is definitely not healthy, Herman, but I appreciate the sacrifice you make for the listeners. Today's prompt from Daniel is about this massive comeback. He is asking us to dig into the DeepSeek V four and R two launch, the whole Hunter Alpha mystery from a few weeks ago, and whether these guys are actually going to bankrupt the rest of the industry with their pricing. We are talking about a company that everyone claimed had faded into obscurity, only for them to drop a nuke on the pricing charts.

It is the perfect time to talk about it because we are finally seeing the full picture of their new architecture. People were calling the twenty twenty-five launch of V three a Sputnik moment, but V four, which just dropped earlier this month to coincide with the Two Sessions in Beijing, feels more like they are trying to build a permanent moon base. We are looking at a one-trillion-parameter model that somehow runs with the agility of something a fraction of that size. It is a massive statement of intent, timed perfectly with China’s major political meetings to show off their domestic technological prowess.

One trillion parameters. That sounds like one of those numbers people just throw out when they want to sound impressive, like saying you have a billion cousins or you can eat a thousand leaves in a minute. Is there actually a functional reason for that scale, or is this just architectural flexing to make the Silicon Valley labs nervous?

It is a bit of both, but mostly functional. The key is their Mixture of Experts approach, which they have refined to an incredible degree. Even though the total parameter count is one trillion, the model only activates thirty-two billion parameters per token during inference. That is the secret sauce. It allows them to have this massive, deep well of knowledge and nuance to draw from without the massive compute cost of a dense model like GPT five point four. Think of it like a library with a trillion books, but you only ever have thirty-two librarians running to fetch exactly what you need for a specific sentence. It is efficient, but the sheer size of the library means those librarians have access to everything.

Okay, but we have seen Mixture of Experts before. Mistral does it, Grok does it. What makes V four different from just a bigger version of what they were doing last year? Because if it is just more of the same, I am not sure it justifies the hype of a trillion parameters.

The real breakthrough is something called Manifold-Constrained Hyper-Connections, or m H C. Liang Wenfeng and his team published the foundational paper on this back in January twenty twenty-six, and it is honestly brilliant. In traditional deep scaling, you run into massive training instability once you cross certain thresholds. The gradients just start doing weird things, the model starts to collapse, or it just stops learning. m H C essentially uses geometric constraints to keep the model’s internal representations from drifting into areas of the latent space that cause those hallucinations or crashes. It is like putting high-tech guardrails on a Formula One car so it can go three hundred miles per hour without shaking itself to pieces. It allows them to scale to that trillion-parameter mark while keeping the training process smooth and the output stable.

I like the guardrail idea, mostly because I am a sloth and speed generally terrifies me. If I am going fast, I want to know I am not going to fly off the track. But let's talk about the context window for a second. Daniel mentioned a one-million-token window for V four. We have seen that from Gemini for a while now, so is DeepSeek just playing catch-up there, or is there a twist that makes this special?

The twist is the retrieval accuracy over that million tokens. Usually, when you cram a million tokens into a model, the middle of the document becomes a bit of a blur. It is the classic needle in a haystack problem where the model remembers the beginning and the end but forgets the middle. But with the m H C architecture, DeepSeek V four is hitting ninety-nine percent retrieval accuracy across the entire window. And they are doing it with native multimodality. This is not a text model with a vision encoder slapped on the front like a Frankenstein monster. It is processing text, images, and video in the same underlying latent space from day one. It understands the relationship between a frame of video and a line of code natively.

That brings us to the Hunter Alpha situation. For a few days there in mid-March, the internet was convinced DeepSeek had leaked their crown jewels. This mystery model shows up on OpenRouter, starts absolutely destroying benchmarks, and everyone points at Hangzhou. The speculation was wild, Herman. People were saying it was a rogue employee or a state-sponsored leak. Then, on March nineteenth, Xiaomi comes out and says, actually, that is us. That is our MiMo V two Pro.

That was such a fascinating moment for the ecosystem. It proved that DeepSeek’s real contribution isn't just their own models, but the training recipes they have pioneered. Xiaomi essentially took the DeepSeek V three point two architecture and the reasoning techniques from the R series and applied them to their own proprietary datasets. It shows that DeepSeek has become the reference architecture for the entire Chinese tech sector. They are the national champion now, whether they officially want that title or not. It is like they wrote the textbook that everyone else is now using to build their own specialized models.

It is a bit of a double-edged sword, right? Being the national champion means you get the best engineers and the state-backed hardware, but it also means everyone is looking at your server routing with a magnifying glass. I saw some reports that almost all DeepSeek API traffic is still hitting mainland China servers, which makes a lot of enterprise legal teams very nervous. If you are a big bank in London or a tech firm in San Francisco, that is a hard sell.

That is the big hurdle for them in the West. Even if the model is forty times cheaper, if your data privacy officer sees a route to a server in Hangzhou, they are going to pull the emergency brake. But from a purely technical standpoint, what they are doing with non-Nvidia hardware is incredible. They are optimizing specifically for Huawei and Cambricon chips. While the rest of the world is fighting over H one hundreds and Blackwell chips, DeepSeek is proving you can get world-class performance out of domestic Chinese silicon through sheer algorithmic efficiency. They are building for the hardware they have, not the hardware they wish they had, and that constraint is actually making their software better.

Speaking of efficiency, let's talk about the price. This is the part that usually makes my ears perk up, even if I am moving slowly. Twenty-seven cents per million input tokens. Herman, G P T five point four is still hovering around ten dollars for that same million tokens. How is that even possible? Are they just subsidizing this to kill the competition, or is the tech actually that much cheaper to run? Because that is not just a discount, that is a different universe of pricing.

It is mostly the tech, though there is certainly a strategic element. When you only activate thirty-two billion parameters per token, your compute cost per generation is drastically lower than a dense model. Plus, their K V cache compression is leagues ahead of the industry. They have found ways to store the model's memory of a conversation using a fraction of the V R A M that OpenAI or Anthropic requires. So, while there might be some strategic pricing involved to gain market share, the underlying math actually supports a much lower cost. They aren't just burning money; they are just running a much leaner engine. It is the difference between a massive V twelve engine that gulps fuel and a high-tech electric motor.

It makes me wonder why anyone would use the big, expensive models for anything other than the absolute highest-tier reasoning tasks. If I am building a startup and I can get ninety percent of the performance for one-fortieth of the cost, my C F O is going to make that decision for me before I even finish my coffee. It changes the unit economics of AI entirely.

That is exactly what we are seeing in the startup space right now. DeepSeek V three point two, which came out late last year, has become the workhorse for agentic workflows. They call it Thinking in Tool-Use. Instead of the model just talking to you, it is designed to pause, reason about which tool to use, and then execute a multi-step task autonomously. It is less of a chatbot and more of a digital employee. And that leads us directly into the new R two model, which takes that reasoning to the next level.

Right, the R two. This is the one that is supposed to go toe-to-toe with the OpenAI o series, the ones that sit there and think for thirty seconds before answering. What is the deal with Internalized Reasoning? Is it just a fancy way of saying the model talks to itself in a hidden scratchpad, or is there something more profound happening under the hood?

It goes deeper than a scratchpad. In R one, you could see the chain of thought. It was very explicit, and sometimes quite long-winded. In R two, they have moved toward what they call Internalized Reasoning. The model has been trained through reinforcement learning to compress those reasoning steps into its internal activations. It is still doing the hard work of logic and verification, but it is doing it more efficiently and with fewer tokens wasted on the output. It is scoring around seventy-two percent on the A R C A G I two benchmark. Now, Gemini three point one Pro is still the king there at seventy-seven percent, but R two is closing that gap fast, and it does not have the same latency lag that the earlier reasoning models had. It feels snappier.

I have noticed that R two feels much more like a conversation and less like I am waiting for a slow computer to boot up. But let's pivot to the agent stuff. Daniel mentioned they are hiring a ton of Agent Infrastructure Engineers for something called OpenClaw. That sounds like a name a supervillain would give their secret base, or maybe a very aggressive arcade game.

It does sound a bit menacing, doesn't it? But OpenClaw is actually their attempt to standardize how AI agents interact with software environments. Right now, every agent has its own way of clicking buttons or writing code, and it is a mess of different protocols. DeepSeek is trying to build the underlying rails for autonomous systems. They want to move beyond the chatbot. They want their models to be the brains inside of autonomous dev tools, research assistants, and even physical robotics. That is why the Xiaomi connection is so important. Imagine DeepSeek R two running the logic for a Xiaomi humanoid robot. That is the long-term play here. They are building the brain for the machines.

So they aren't just building a smarter box to talk to. They are building an operating system for things that actually do stuff in the real world. That explains the shift in their research direction. It is less about making the model a better poet and more about making it a better engineer. They want it to be able to handle a wrench, or at least a digital version of one.

Precisely. If you look at their recent job postings, they aren't just looking for L L M researchers anymore. They are looking for people who understand system architecture, low-latency networking, and hardware-software co-design. They are becoming a vertically integrated AI company. They build the training algorithms, they optimize for the specific chips like the Cambricon ones, and they build the agentic framework that sits on top of it all. It is a full-stack approach that is very different from the labs that just focus on the model weights.

It makes the fade to obscurity narrative look pretty silly in retrospect. They weren't losing steam; they were just retooling the entire factory. It is a very different vibe from the Silicon Valley approach where you have to have a new product announcement every Tuesday or people think you are dying. They were willing to be quiet for a year to get the architecture right.

It is a very disciplined approach. Liang Wenfeng has a background in high-frequency trading and quantitative systems, and you can see that D N A in DeepSeek. Everything is about optimization, latency, and cost-per-unit of intelligence. They treat AI like a commodity that needs to be produced as cheaply and reliably as possible. They aren't trying to build a god; they are trying to build a utility.

Which is great for us as users, but I do want to go back to the privacy concern for a second. If I am a developer in the U S or Europe, and I want to use V four because it is cheap and brilliant, what are my actual options for keeping my data from making a round trip to Hangzhou? Can I run this locally, or am I stuck with the API?

That is the big advantage of their open-weight policy. Unlike G P T five or Gemini, DeepSeek releases the weights for their models. Now, running a one-trillion-parameter model locally is a tall order for most people. Even the V four Lite version, which is two hundred billion parameters, requires a pretty beefy setup. But for a mid-sized company, you can host these on your own private cloud instances. You can put them on a cluster of A one hundreds or H one hundreds in your own data center, and then the data never leaves your building. That is why they are winning the developer heart-and-mind war. They give you the keys to the kingdom, whereas the other labs just let you look through the window.

That is a huge differentiator. It turns it from a service you rent into a tool you own. I think that is a distinction that gets lost in the noise sometimes. When you are using a closed API, you are at the mercy of their uptime, their pricing changes, and their censorship filters. If you have the weights, you are the boss. You can fine-tune it, you can quantize it, you can do whatever you want.

And they are leaning into that. They have even been providing specialized quantization scripts to help people run these models on consumer-grade hardware. They want DeepSeek to be everywhere. They want it to be the default engine for the next generation of software. They are betting that by being the most accessible, they will become the most essential.

Okay, let's get practical for a minute. If someone is listening to this and they are managing a dev team or they are a C T O, how should they be evaluating DeepSeek V four versus the incumbents right now? Is it a binary choice, or is there a way to play both sides without getting caught in the middle?

I think the smart move right now is a hybrid approach. You use something like Gemini three point one Pro or G P T five point four for your most sensitive, high-stakes reasoning where you need that extra five percent of accuracy and the legal peace of mind of a U S-based provider. But for everything else—your internal documentation bots, your code completion, your data extraction, your agentic workflows—you switch to DeepSeek. The cost savings alone will fund your entire R and D budget for the next year. It is about being pragmatic.

It is like having a fleet of reliable work trucks and one or two Ferraris. You don't use the Ferrari to haul gravel, and you don't use the work truck to go to a gala. But most of life is hauling gravel.

That is a rare analogy from you, Corn, but it fits perfectly. You don't need a ten-dollar-per-million-token model to summarize a meeting transcript or format a J S O N object. DeepSeek V four is the work truck that happens to have a jet engine inside. It is overpowered for the price, and that is exactly what developers want.

I will take the jet-powered truck, please. But what about the technical benchmarks? Daniel mentioned DeepSeek Coder V three and DeepSeek Math V two. Are they still leading the pack there, or have the specialized models from other labs caught up while DeepSeek was in their quiet phase?

In coding and math, they are still incredibly hard to beat. DeepSeek Coder V three is currently the top-performing open-weight model for Python and C plus plus. In some benchmarks, it is actually outperforming the flagship closed models. The reason is that they have a very specific way of data curation. They don't just scrape the web; they use synthetic data generation to create complex logical problems that the model has to solve during training. It is like they are teaching the model how to think through a problem rather than just memorizing the answer. They are building a logical foundation, not just a pattern matcher.

It sounds like the era of just throwing more data at a model is officially over, and we are now in the era of high-quality, curated, and synthetic reasoning data. Which is probably good for the internet, because we were running out of human-written text anyway. We have already eaten all the books and all the blogs.

We definitely were. The next frontier is going to be video and physical interaction data. That is why the native multimodality in V four is so important. By training on video, the model starts to understand the physics of the real world—how objects move, how cause and effect work in a three-dimensional space. That is what will take agents from being just software bots to being something that can actually operate in the physical world. It is about grounding the intelligence in reality.

That is a bit of a "wait, what?" moment for me. Are you saying V four actually understands that if I drop a glass, it breaks? Like, it has a concept of gravity and impact?

In a way, yes. By processing millions of hours of video, it builds a world model that is much more sophisticated than a text-only model. It understands temporal consistency. If you ask it to describe what happens next in a video clip, its predictions are grounded in the laws of physics that it has observed. It is not just predicting the next word; it is predicting the next state of the world. It is a subtle but massive shift in how these models represent information.

That is both incredibly cool and slightly terrifying. It brings us back to the R two reasoning. If you combine a world model with that kind of internalized logic, you are getting very close to something that looks like actual intelligence, not just a very fancy autocomplete. It is starting to feel like we are building something that can actually reason about the world it lives in.

We are definitely crossing that threshold. And the fact that DeepSeek is doing this while maintaining their commitment to open weights is what makes this moment so pivotal. They are effectively commoditizing high-level intelligence. They are making it so cheap and so accessible that the bottleneck is no longer the AI itself, but our ability to build the systems around it. The challenge now is on the developers to use this power effectively.

So the takeaway for everyone listening is: don't ignore the quiet ones. DeepSeek took a year off from the spotlight to rebuild their entire stack, and they came back with a trillion-parameter multimodal beast that costs next to nothing to run. They proved that a year of R and D can be more powerful than a year of marketing.

And keep a very close eye on OpenClaw and their agent infrastructure. That is the real battlefield for the rest of twenty twenty-six. It is not about who has the best chatbot anymore; it is about who has the most capable autonomous agents. DeepSeek is betting the farm on being the foundation for those agents, and with their current pricing and performance, it is a very strong bet.

I think I need to go lie down and process the idea of a jet-powered work truck that understands physics. That was a lot, even for me. My sloth brain is spinning.

It is a lot for anyone. But it is a very exciting time to be watching this space. The pace of innovation isn't slowing down; it is just moving into the foundations. We are seeing the infrastructure of the future being built right now.

Well, if you want to dig deeper into how we got here, you should definitely check out episode fourteen seventy-one, which we called The Cursor Incident. It covers that first big wave of Chinese AI models taking over Western developer tools. And if you are curious about that Xiaomi connection Herman mentioned, episode fifteen ninety-nine goes deep on the MiMo V two and the rise of physical AI agents.

Both are great context for what is happening right now. It is all connected, from the code editors to the robots on the factory floor.

Thanks as always to our producer, Hilbert Flumingtop, for keeping the show running smoothly while I move at my usual pace. And a big thanks to Modal for providing the GPU credits that power this show—it is great to have a partner that understands the importance of high-performance infrastructure.

We couldn't do it without them.

This has been My Weird Prompts. If you found this useful, or if you just want to make sure you don't miss the next big architectural shift, find us at myweirdprompts dot com. You can find our R S S feed there and all the different ways to subscribe.

See you next time.

Catch you later.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.