So, Herman, I was looking at some of the developer forums last night, and the fallout from the Cursor incident is still absolutely everywhere. It is the kind of drama that really peels back the curtain on how the industry is actually operating behind the scenes. It is not just a technical glitch; it feels like a vibe shift in the entire power dynamic of global A I development.
It really is. The Cursor and Kimi leak from March nineteenth has effectively shattered the illusion that Western frontier models have a permanent, unbridgeable lead. For those who missed the noise, Cursor, which is widely considered the gold standard for A I powered coding tools, was found to be using a fine tuned version of Moonshot A I's Kimi K two point five for its Composer two feature. It only came out because of a leaked model identifier in the A P I responses, and then Yulun Du, who is the head of pretraining at Moonshot, basically called them out on it on social media.
It is wild because everyone just assumes these top tier Western tools are powered by OpenAI or Anthropic. If you are paying for a premium subscription to a Silicon Valley startup, you expect Silicon Valley code under the hood. But today's prompt from Daniel is about why that assumption is becoming increasingly outdated. Daniel wants us to dive into the big four Chinese labs: DeepSeek, Moonshot, Zhipu, and MiniMax. He is asking which of these show the most promise for code and agents, how they actually stack up against the Western state of the art, and if there is any reason to use them other than just saving a few bucks.
I am Herman Poppleberry, and I have been waiting for us to really dig into this because the shift we have seen in just the last few months is staggering. We are no longer talking about cheap knockoffs or cost effective alternatives. We are talking about architectural innovations that are forcing the heavy hitters in San Francisco to look over their shoulders. As of today, March twenty third, twenty twenty six, the gap is not just closing; in some specific domains like long context retrieval and reasoning efficiency, it might have already disappeared.
Well, let us start with the one that seems to be the current king of the hill for developers, which is DeepSeek. They have been the talk of the town since the R one release, but the buzz around the imminent V four release is reaching a fever pitch. People are saying it might actually leapfrog G P T five point three in long context coding. What is actually under the hood there that makes them so different?
The genius of DeepSeek, led by Liang Wenfeng, really comes down to mathematical efficiency. While the major Western labs have been throwing more and more compute and more and more parameters at the problem—basically trying to brute force intelligence—DeepSeek has focused on architectural cleverness to bypass the need for massive hardware clusters. They use something called Multi head Latent Attention, or M L A.
Okay, break that down for the non engineers. What does M L A actually do for the end user?
In a standard transformer model, the key value cache, or K V cache, stores the context the model needs to remember. Think of it like the model's short term working memory. In standard models, that cache grows linearly with the context length. If you have a hundred thousand tokens of code, that cache becomes a massive memory hog, making the model incredibly expensive and slow to run. M L A significantly compresses that cache into a latent vector. It allows them to handle massive context windows with a fraction of the memory overhead. It is like being able to fit a whole library into a single notebook without losing the details.
And that is why the pricing is so disconnected from what we see here in the States. Daniel mentioned that DeepSeek V three costs about twenty seven cents per million input tokens. To put that in perspective for everyone, Claude four point six Opus is five dollars for that same million tokens. That is nearly a twenty times difference. Is that just a loss leader strategy to gain market share, or is the architecture actually that much cheaper to run?
It is the architecture. When you combine M L A with their specific implementation of Mixture of Experts, which they call DeepSeek M o E, you get a model that performs at a frontier level but was trained for approximately six million dollars. Most estimates for training a model of that caliber in the West start at a hundred million dollars and go up from there. DeepSeek is a subsidiary of High Flyer Quant, which is a massive quantitative hedge fund, so they approached this from a perspective of extreme mathematical efficiency. They are not just trying to build a smart chatbot; they are building a reasoning engine for high stakes environments where every millisecond and every watt of power counts.
That reasoning aspect is key. DeepSeek R one matched OpenAI's o one series on the A I M E twenty twenty four benchmarks, which is basically the gold standard for mathematical reasoning. They both landed around seventy nine percent. It feels like the moat OpenAI thought they had with reinforcement learning and "thinking" models was crossed in record time.
And if the reports about V four are accurate, we are about to see them pull ahead in long context coding tasks. But we should not let DeepSeek soak up all the spotlight because what Zhipu A I is doing with G L M five is arguably even more impressive from a systems engineering perspective. Zhipu is a spin off from Tsinghua University, and they just went public in January. Their G L M five model is a seven hundred forty four billion parameter beast, but the most important thing about it is not the size. It is the fact that it was trained entirely on domestic Huawei Ascend chips.
That feels like a direct answer to the export controls. There was this narrative for a long time that without Nvidia H one hundreds or the newer B two hundreds, Chinese labs would be stuck in the previous generation of A I. But G L M five scored a seventy seven point eight on the S W E bench Verified benchmark. That is neck and neck with Anthropic's latest and greatest.
The hardware sovereignty angle cannot be overstated. By optimizing their models for the Huawei Ascend stack, Zhipu has proven that you can reach state of the art performance without being tethered to the Silicon Valley supply chain. Beyond the hardware, G L M five is explicitly designed as a "systems architect" rather than a conversationalist. When we talk about agentic A I, most models are just chat boxes that have been shoehorned into using tools. G L M five was built from the ground up to handle multi step software engineering workflows. It looks at a codebase not as a string of text, but as a complex system of dependencies.
I love that distinction. It is the difference between an assistant who can write a function for you and an engineer who can refactor an entire repository. If I am building an agentic workflow, I do not care if the model is polite or has a "personality." I want it to understand the architectural implications of a pull request.
G L M five actually ranks first among open weights models in the Vending Bench two business simulation, which tests high level decision making and resource management. It is designed to think in terms of execution. And speaking of execution, we have to talk about Mini Max and their new M two point seven model that just dropped on March eighteenth. They are calling it a "self evolving" model.
That sounds like marketing fluff at first glance. What does "self evolving" actually mean in this context? Is it just a buzzword, or is it actually doing something new?
It refers to their focus on autonomous research workflows. Mini Max was founded by Yan Junjie, who was a vice president at SenseTime, and he has a very specific vision for A I that can perform reinforcement learning research on its own. In their internal testing and on the M L E Bench Lite, which focuses on machine learning engineering tasks, M two point seven achieved a sixty six point six percent medal rate. That puts it right there with Google’s Gemini three point one. It is specifically tuned to handle the kind of iterative, trial and error work that human researchers do when they are trying to optimize a model or analyze a complex data set. It is essentially an A I that is good at building better A I.
So if DeepSeek is the mathematician and G L M is the systems architect, Mini Max is the data scientist. It is interesting how these labs are starting to specialize rather than everyone just trying to build a general purpose "everything" model. It makes the choice of which one to use much more tactical for a developer.
It really does. And then you have Moonshot A I, the creators of Kimi. Their founder, Yang Zhilin, is a former Google and Meta researcher who is basically a legend in the field of long context. Kimi K two point five, which launched on March eleventh, has a two hundred fifty six thousand token context window, but what makes it the leader in that space is their proprietary prefix caching.
We have talked about the "Vector D B Hangover" in episode twelve fifteen, where we looked at how expensive it is to keep retrieving data for R A G. Does Kimi's prefix caching solve that?
It changes the math entirely. Prefix caching allows the model to store the pre processed state of a large chunk of text, like a massive legal archive or a ten thousand page codebase. When you ask a question, the model does not have to re process that entire context from scratch. It just jumps straight to the new information. It makes retrieval across those two hundred fifty six thousand tokens feel almost instant and, more importantly, makes it much cheaper for the developer. This is why Cursor was using it. If you are building a tool that needs to "see" your whole project at all times, you need that kind of efficient long context retrieval. You cannot wait thirty seconds for a model to "read" your files every time you hit save.
It is funny you mention episode twelve fifteen, because back then we were talking about how the gold rush of vector databases was hitting a wall because of the cost of scaling. If you can just stuff a quarter million tokens into a model like Kimi and have it actually work without breaking the bank, the need for complex R A G architectures starts to diminish for a lot of mid sized use cases.
The economics are shifting so fast that what was a best practice six months ago is now a legacy bottleneck. But we should address the elephant in the room that Daniel brought up, which is the comparison to Western state of the art models. If we look at the S W E bench for software engineering, G L M five is at seventy seven point eight and Mini Max M two point seven is at fifty six point two. Claude four point six Opus is at eighty point eight and G P T five point three Codex is at ninety two point zero. So the Western models, especially the specialized coding versions of G P T five point three, still hold a raw performance lead in pure coding.
But at what cost? That is the part that keeps coming back to me. If I am an enterprise and I am running millions of agentic steps a day, am I going to pay twenty times more for a five or ten percent increase in benchmark performance? Especially when the "cheaper" models are open weights?
That is the second major advantage beyond cost: sovereignty. When you use Claude or G P T five point three, you are sending your data to a closed A P I. You have no control over the model weights, no ability to self host, and you are subject to the safety filters and "nanny" layers that these companies bake in. We did a whole episode on this, episode eight forty seven, about the rise of uncensored models. DeepSeek and Zhipu provide open weight versions of their frontier models. An enterprise can take DeepSeek V three or G L M five, host it on their own local hardware, and know that their proprietary data never leaves their four walls.
And for a lot of companies, especially in finance or healthcare, that isn't just a "nice to have" feature. It is a requirement. If you can get near G P T five performance on your own servers for a fraction of the price, the "moat" of the San Francisco labs starts looking more like a white picket fence.
It also touches on the political worldview we often discuss. There is a real strategic necessity for sovereign A I. If you are a country or a massive corporation, you cannot afford to have your entire cognitive infrastructure dependent on the whims or the regulatory environment of a single foreign city. The fact that these Chinese models are becoming "frontier class" means there is now a viable alternative stack. Even if you are pro American and want the U S to lead, you have to acknowledge that the competition is no longer just "copying." They are innovating on the fundamental math.
It is a bit of a wake up call. We saw this with the Cursor incident. A U S based company, built by world class engineers, looked at the landscape and decided that a Chinese model was the best tool for their specific job. That is a meritocratic shift. It says that the best tool wins, regardless of the passport.
It is also a licensing headache. The Cursor incident got spicy because Moonshot's license for Kimi is a modified M I T license. It requires prominent attribution if the model is being used in a high revenue product. Cursor apparently skipped that part, which is why Yulun Du was so vocal about it. It shows that these labs are starting to protect their intellectual property just as aggressively as the Western labs. They know they have something valuable. They are not just giving it away to help Western startups get rich.
So if we are looking at the "Big Four" and trying to give Daniel some practical takeaways on which one to use for what, how would you break it down?
If you are doing pure algorithmic coding, heavy math, or if you need a "thinking" model that can reason through complex logic, DeepSeek R one and the upcoming V four are your best bets. They are the closest thing we have to a peer for OpenAI's o series. If you are building agentic systems, like a virtual software engineer or a complex business automation agent, Zhipu's G L M five is the leader. Its "systems architect" design makes it much more reliable at multi step tool use. It does not get "distracted" as easily as a chat focused model.
And for the research and data side?
That is where Mini Max M two point seven shines. If you need to automate machine learning experiments or complex office suite tasks like deep Excel modeling or multi document analysis, their focus on "self evolving" intelligence and autonomous research workflows gives them an edge. And finally, if you have a massive amount of static context—like a whole library of technical manuals or a giant codebase—and you need to query it constantly, Kimi's prefix caching is the gold standard. It will save you a fortune compared to running standard R A G against a Western A P I.
It feels like the era of the "one model to rule them all" is ending. We are moving into a world of hyper specialized, highly efficient models where the geographic origin is becoming less important than the architectural philosophy.
We are also seeing the "Agentic Shift" we predicted in episode twelve thirty one play out in real time. We said back then that twenty twenty six would be the year where raw benchmarks matter less than agentic reliability. G L M five and Mini Max M two point seven are the embodiment of that prediction. They are built for the "do" phase of A I, not just the "chat" phase. They are designed to be reliable components in a larger machine.
What I find most fascinating is the hardware angle you mentioned. If Zhipu can hit S O T A on Huawei chips, it suggests that the U S strategy of chip denials might have actually accelerated the development of a completely independent and highly optimized alternative stack. It is the classic "necessity is the mother of invention" scenario. If you cannot buy the best chips, you have to write the best math.
It is a massive hedge against export controls. If you are a developer in a region that might face future restrictions, or if you just want to avoid the "Nvidia tax," these models prove there is a path forward. It is a more resilient ecosystem when there are multiple ways to reach the frontier. We are seeing a diversification of the entire A I supply chain, from the silicon up to the weights.
I do want to poke at one thing though. We have talked about the advantages, but what about the risks? If an enterprise is self hosting DeepSeek, they still have to worry about the provenance of the training data and the potential for embedded biases or even security vulnerabilities. Is that a real concern or just geopolitical F U D?
It is a legitimate concern for any open weight model, regardless of where it comes from. When you host a model, you are responsible for the output. However, because the weights are open, you can actually audit the model more effectively than you can with a closed A P I like G P T five point three. You can run your own red teaming, you can apply your own fine tuning to align it with your corporate values, and you can wrap it in your own security layers. In many ways, self hosting an open weight model from Zhipu gives you more security control than sending your data to a black box in San Francisco where you have zero visibility into what happens to your prompts.
That is a fair point. Transparency is a form of security in itself. You might not know exactly what went into the training set, but you can see exactly how it reacts to your specific data without a middleman watching.
And that brings us back to the "Vector D B Hangover" we mentioned earlier. When you have these massive, efficient context windows and the ability to self host, the entire architecture of how we build A I applications changes. You no longer need to build these fragile, complex pipelines to feed bits of information to a distant, expensive model. You can just give the model the whole context and let it work. It is a return to a simpler, more robust way of programming.
It is a much more elegant way to build. So, looking ahead to the rest of twenty twenty six, we have DeepSeek V four on the horizon. We have the continued evolution of the Huawei hardware stack. Do you think the Western labs have a response, or are we looking at a permanent state of parity now?
I think we are looking at a permanent state of intense competition. The "A I moat" that people were talking about a year ago—the idea that you needed ten billion dollars and a hundred thousand H one hundreds to compete—has been proven false. DeepSeek did it for six million. That means the barrier to entry for frontier level intelligence is much lower than we thought. The response from the Western labs will likely be to lean even harder into "agentic" ecosystems—integration with your O S, your email, your entire digital life—where they still have a massive home field advantage.
The "ecosystem moat" rather than the "model moat." That makes sense. If Apple or Google can bake their models into the hardware you already carry, they don't need to be twenty times cheaper. They just need to be there. But for the developers, the builders, and the enterprises who are looking at the raw utility of the intelligence, the "Big Four" are no longer optional. They are a core part of the toolkit.
They really are. If you are not testing your workflows against G L M five or DeepSeek, you are probably overpaying and underperforming. The frontier is no longer a single point on a map; it is a global, distributed network of innovation.
It is a wild time to be a developer. Daniel, thanks for the prompt—this was a deep one, but I think it is one of the most important shifts we have covered this year. It really changes the perspective on where the "frontier" actually is.
It is not a single point on a map anymore. It is a global race.
Well, I think that is a good place to wrap this one up. We have covered the efficiency of DeepSeek, the agentic focus of Zhipu, the research autonomy of Mini Max, and the long context mastery of Kimi. The "Big Four" are here, and they are not just competing on price—they are competing on the fundamental way these models are built and deployed.
It is a testament to what happens when you have a massive amount of talent and a very clear set of constraints. It forces a level of innovation that you just don't see when the budget is unlimited.
Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the G P U credits that power this show and allow us to dive into these technical topics every week.
If you want to keep up with these developments as they happen, search for My Weird Prompts on Telegram to get notified the second a new episode drops. We are also on Spotify, Apple Podcasts, and pretty much everywhere else you might be listening.
This has been My Weird Prompts. We will catch you in the next one.
See you then.