#2355: Why Open-Weight Models Are Winning

Discover how Cogito v2.1 leverages process supervision and MoE architecture to redefine reasoning efficiency in open-weight AI models.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2513
Published: Apr 20
Updated: May 15
Duration: 20:24
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: large-language-models open-source ai-training

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Cogito v2.1 671B, developed by Deep Cogito, is making waves in the open-weight AI ecosystem. Built on DeepSeek-V3-Base, this model leverages a Mixture of Experts (MoE) architecture, activating only 37 billion parameters per token despite its nominal size of 671 billion. This design significantly reduces compute costs, making it feasible to deploy at scale.

One of Cogito v2.1’s standout features is its process supervision training approach. Unlike traditional outcome supervision, where models are rewarded solely for correct final answers, process supervision provides feedback on each step of the reasoning chain. This results in shorter, more efficient chains, averaging 4,894 tokens per response—60% fewer than comparable models. This efficiency translates to lower latency and cost, particularly impactful for reasoning-heavy tasks like coding and complex problem-solving.

Benchmark performance highlights Cogito v2.1’s strengths in reasoning and math. On GPQA Diamond, it scores 76.8%, competitive with top-tier models like Claude Opus. It also excels in coding tasks, achieving 68.8% on LiveCodeBench. However, it shows modest results in long-context reasoning and agentic tasks, underscoring its niche as a reasoning-optimized model.

Pricing is competitive, ranging from $0.90 to $1.25 per million tokens across platforms like OpenRouter and Fireworks AI. For developers building coding assistants or automated review systems, Cogito v2.1 offers a compelling combination of efficiency and capability.

Mentions

Baseten Inference platform for Cogito v2.1
Cogito v2.1 671B Open-weight reasoning model by Deep Cogito
Deep Cogito US AI lab behind Cogito models
DeepSeek-V3-Base Base model used for Cogito post-training
Fireworks AI Deployment platform with Cogito v2.1
HuggingFace Model repository for Cogito weights
Ollama Local tool to run Cogito v2.1 self-hosted
OpenRouter Multi-model API including Cogito v2.1
Together AI Cloud platform hosting Cogito v2.1
Unsloth Quantized GGUF versions of Cogito v2.1

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2355: Why Open-Weight Models Are Winning

Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model under the microscope is Cogito v2.1 671B, built by a lab called Deep Cogito. Herman, let's start at the top. Who are these people?

Deep Cogito is a US-based AI lab, and their whole identity is built around open-weight models. They are not trying to be OpenAI or Anthropic in the sense of keeping everything behind a closed API. The weights are out there, you can pull them, you can self-host. That is a deliberate positioning choice, not just a distribution decision.

Are they a known quantity in the space, or is this more of a newer name?

Newer name, but not an unknown one. What they are known for is taking strong open-source base models and doing serious post-training work on top of them. In this case, the base they are working from is DeepSeek-V3-Base, the November 2024 checkpoint. And if you follow the open-weight space at all, you know that DeepSeek-V3 was a significant release. Six hundred and seventy-one billion total parameters, Mixture of Experts architecture, trained on roughly fourteen point eight trillion tokens. It was widely regarded as one of the strongest open-source base models available when it dropped, competitive with GPT-4o and Claude 3.5 Sonnet on code and math benchmarks.

Deep Cogito is not training from scratch here. They are inheriting a very capable base and then doing their own work on top of it.

The post-training is theirs. The base is DeepSeek's. And that is a legitimate and increasingly common approach in the open-weight ecosystem. You are not getting credit for the base model, but you are absolutely accountable for what your post-training does to it. That is where Deep Cogito is making their claim.

The claim they are making with this particular model is pretty direct. They are calling it the best open-weight large language model built by a US company. We will get into whether the benchmarks support that. But first, let's talk about what this thing actually is under the hood.

Walk me through what this model actually is. Six hundred and seventy-one billion parameters is a big number. What are we working with architecturally?

The parameter count is confirmed. Six hundred and seventy-one billion total. And based on the DeepSeek-V3-Base lineage, we are almost certainly looking at a Mixture of Experts architecture, or MoE. The way MoE works is that not all parameters are active on every forward pass. DeepSeek-V3 specifically activates thirty-seven billion parameters per token, even though the total model weight is six hundred and seventy-one billion. So the compute cost per inference call is much closer to a thirty-seven billion parameter dense model than to a six hundred and seventy-one billion one.

That matters for cost and speed.

It matters a lot. It is why you can run a model of this nominal size at a price point that would be impossible for a true six hundred and seventy-one billion dense model. I should be clear that Deep Cogito's own model card does not explicitly confirm the MoE architecture or the active parameter count. That detail comes from the DeepSeek-V3 base, and the supplementary research corroborates it. But if you are building on top of that base and you have not fundamentally changed the architecture, the MoE structure carries over.

What about context window? Because that is usually one of the first things engineers want to know.

The model card does not state a context window directly. What we can say from independent sources is that the Together AI listing cites one hundred and twenty-eight thousand tokens, and some configurations apparently go up to around one hundred and sixty-three thousand. So the working assumption for most deployments is a one hundred and twenty-eight thousand token context. But I want to be honest that this is not stated on the primary source page, so treat it as a strong indicator rather than a confirmed spec.

Now the thing that Deep Cogito is really hanging their hat on here is not just the base model. It is what they did in post-training. Tell me about the process supervision approach.

This is the genuinely interesting part. Most reasoning models are trained with outcome supervision. You show the model a problem, it generates a chain of thought, and you reward it based on whether the final answer is correct. Process supervision is different. You are supervising the reasoning chain itself, not just the outcome. You are giving the model feedback on whether each step in its reasoning is on the right track.

Why does that produce shorter chains?

The claim is that it builds better intuition for what a productive search trajectory looks like. Instead of the model generating long, exploratory chains and hoping to land on the right answer, it learns to recognize earlier when it is heading somewhere useful. The result, according to Deep Cogito, is that Cogito v2.1 uses significantly fewer tokens per reasoning task than comparable models. The figure that comes up in the reception research is an average of around four thousand eight hundred and ninety-four tokens per response, and they claim that is roughly sixty percent more efficient than similar-capability reasoning models.

That is a meaningful difference if it holds up across workloads.

It is not just an efficiency story. Shorter chains also mean lower latency and lower cost per call, which compounds quickly at scale. The other improvements they are claiming over the prior version are in instruction following, coding, longer and more complex queries, multi-turn conversation, and creative tasks. No quantitative delta on any of those, just qualitative claims from the lab. We will see what the benchmarks section tells us about how those claims hold up.

Let us talk pricing. Herman, what are we working with here?

I should flag this upfront, Corn. All pricing we are about to cite is as of April 20, 2026, and these numbers shift, sometimes weekly. That is a standing caveat for this series and it applies here more than most, because the source page itself lists no pricing whatsoever. Deep Cogito does not publish rates on their model card. You have to go to the host platforms directly.

What are the host platforms?

The model is available through OpenRouter, Fireworks AI, Together AI, Baseten, RunPod, and Ollama cloud. And if you want to self-host, you can pull the weights via Ollama locally or grab a GGUF quantised version through Unsloth on HuggingFace. So the deployment surface is reasonably broad for an open-weight model.

What is the pricing picture across those platforms?

The supplementary research points to a range of roughly ninety cents to one dollar and twenty-five cents per million tokens, but I want to be careful here. That figure is not sourced from the model card. It comes from third-party comparison pages, and those can lag or misattribute. The number that appears most consistently in the reception research is around ninety cents per million tokens on the input side, but I would not quote that to a client without checking the platform page directly on the day you are building your cost model.

For context, how does that range sit relative to comparable open models?

It is competitive. You are not paying frontier closed-model rates. But the more interesting comparison is probably cost per useful output token rather than cost per raw token, which is actually where the efficiency argument from the process supervision work starts to matter. We will come back to that when we get into the benchmarks.

Let us get into what the benchmarks actually show. What is Deep Cogito claiming, and how much of it holds up?

The headline claim is that Cogito v2.1 671B is the best open-weight LLM built by a US company, and that it performs competitively with frontier closed and open models. Those are strong words, and I want to be transparent about something before we dig into the numbers. The benchmark results on the official model card are presented as graphs, not as text. We cannot extract the figures directly from that source. What we have are scores from third-party reception research, and I will attribute those as we go.

What are the numbers that are circulating?

The figures that appear consistently across the reception research are these. On GPQA Diamond, which tests graduate-level science reasoning, the model scores seventy-six point eight percent. On AIME 2025, the American Invitational Mathematics Examination, it scores seventy-two point seven percent. On MMLU Pro, the multi-task language understanding benchmark, eighty-four point nine percent. On LiveCodeBench, which is a coding evaluation, sixty-eight point eight percent. And on HLE, Humanity's Last Exam, eleven percent.

That last one sounds low.

It is low in absolute terms, but HLE is hard. It is designed to be resistant to benchmark saturation, and eleven percent puts Cogito v2.1 in the same general territory as other strong models. It is not an outlier there. The more interesting comparison is on GPQA Diamond, where seventy-six point eight is a competitive number. One third-party comparison page puts that alongside Claude Opus and o3, and Cogito holds its own in that neighbourhood, though it does not top those particular models on every axis.

The token efficiency story, does the benchmark data support that?

The reception research does support it. The figure that comes up is an average of roughly four thousand eight hundred and ninety-four tokens per response, and the claim is that this represents approximately sixty percent shorter reasoning chains than models of comparable capability. That is a distinctive trait. Most reasoning models in this capability tier are burning considerably more tokens to reach similar accuracy. If that figure is reproducible across diverse workloads, and that is still a meaningful if, it has real implications for cost and latency at scale.

Are there any benchmarks where the model looks weaker?

TerminalBench Hard, which tests agentic task completion, comes in at sixteen point seven percent. Long-context reasoning scores are modest. And the instruction following benchmark, IFBench, is forty-six point three percent, which is not a standout number. So the profile is strong on reasoning and math, more mixed on agentic and long-context tasks. That is worth keeping in mind when we get to the workloads conversation.

Given that benchmark profile, where does this model actually earn its place? If I am an engineer evaluating it for a project, what is the honest case for reaching for it?

The clearest case is reasoning-heavy work where token efficiency matters. Think complex multi-step problems where you need the model to work through a chain of logic and arrive at a defensible answer, but you are also watching your inference bill. The process supervision training approach is specifically designed for that. The model is not just trained to get the right answer at the end, it is trained to develop better intuition for the right search path through the problem. In practice that means it tends to reach correct answers with shorter chains, which translates directly to lower cost and faster responses at scale.

What does that mean for a concrete use case?

If you are building a coding assistant or a code review pipeline, this is interesting. The SWE-Bench evaluation was included explicitly, and LiveCodeBench at sixty-eight point eight percent is a strong number. The model has clearly been optimised with coding tasks in mind. If you are running something like an automated pull request review system, or a backend that generates and explains code, the combination of coding capability and token efficiency is a meaningful advantage over models that burn twice the tokens for similar output quality.

What about multi-turn and instruction following? The benchmark numbers there were more mixed.

They were, and I want to be honest about that. The IFBench number is not a standout. But the lab's own qualitative claims, and the reception research, do consistently highlight multi-turn conversation and longer query handling as areas of improvement in v2.So the signal is that it is better than the prior version on those dimensions. Whether it is best in class is less clear from the evidence we have.

Where would you steer people away from it?

Anywhere that requires vision, audio, or embeddings. This is a text-only model. No multimodal capability is mentioned anywhere in the documentation. If your pipeline involves image understanding, document parsing from scanned files, or any kind of audio transcription, you need a different model entirely. That is not a criticism, it is just a scope boundary.

What about agentic workloads? We mentioned TerminalBench Hard was low.

Right, sixteen point seven percent on TerminalBench Hard is a flag. If you are building a fully autonomous agent that needs to navigate complex terminal environments or execute long multi-step agentic sequences, the evidence does not strongly support this model for that use case. Reasoning assistant, yes. Autonomous agent operating in complex environments, the numbers suggest caution.

We have covered the architecture, the benchmarks, and the workloads. What is the broader industry actually saying about this one? Is there meaningful reception yet, or is it still early?

There is reception, and it is broadly positive, though with some nuance worth unpacking. The dominant signal from reviewers and platform write-ups is that this is one of the strongest open-weight models globally right now. That framing comes up consistently, not just from Deep Cogito's own marketing but from third-party coverage. The Together AI model page, for instance, describes it as matching the performance of frontier closed and open models, and that language is echoed across several independent comparisons.

Matching frontier closed models is a strong claim. Is that holding up under scrutiny?

On specific benchmarks, particularly reasoning and math, the numbers are competitive with models like DeepSeek V3 and in some comparisons with o1. The GPQA Diamond score of seventy-six point eight percent and the AIME 2025 score of seventy-two point seven percent are strong numbers for an open-weight model. Where it gets more complicated is on tasks like Humanity's Last Exam, where eleven percent is a relatively modest score, and on agentic benchmarks, which we covered. So the "matches frontier" framing is accurate in some domains and overstated in others.

You mentioned there was at least one review that was more measured.

Yes, one review we came across scored it moderately, giving it a technical score of three point eight out of ten and a content score of five out of ten, while acknowledging its strong global standing. The value score was higher, at six point eight, which tracks with the efficiency story. That review noted room for improvement, which is a fair read. It is not a universal ten out of ten, and anyone treating it as such is not doing the full analysis.

What about the infrastructure side? Engineers care about latency and throughput, not just benchmark scores.

The Fireworks AI deployment data is useful here. They are reporting a time to first token of three hundred and sixty-seven milliseconds and a throughput of thirty-one tokens per second. For a six hundred and seventy-one billion parameter model, that is a reasonable operational profile. It is not the fastest model you can run, but it is not unusable either. And the fact that it is available across Together AI, Fireworks, OpenRouter, Baseten, and RunPod, plus self-hosting via Ollama and Unsloth GGUF, means teams have genuine flexibility in how they deploy it.

Any red flags or controversies in the reception?

The main honest criticism is that the benchmark images rather than tables make independent verification harder, and the competitive comparisons are somewhat selective in which models get named. That is not unusual for a lab release, but it is worth noting. The reception is positive, and the caution is methodological rather than substantive.

Alright, let's land this. If you are an AI professional looking at Cogito v2.1 671B right now, what is the actual decision framework? When do you reach for it?

The clearest case is when you need a strong open-weight model for reasoning-heavy work and you want to keep token costs down. The efficiency story is real. Sixty percent shorter reasoning chains than comparable models is not a marketing number you can ignore, especially if you are running high-volume inference. For coding tasks, multi-turn conversations, and complex instruction following, the benchmark profile supports it. GPQA Diamond at seventy-six point eight, AIME at seventy-two point seven, these are competitive numbers in the open-weight tier.

The open-weight status itself is a reason to reach for it?

For a lot of teams, yes. If data privacy is a constraint, if you need to self-host for compliance reasons, or if you want the option to fine-tune, the fact that weights are available on HuggingFace and you can run it locally via Ollama or Unsloth GGUF is a genuine differentiator. The no-storage policy on Deep Cogito's own chat interface is a small but real signal that the lab is thinking about that use case.

When do you not reach for it?

Three clear cases. First, if you need vision, audio, or embedding capabilities, this model does not have them. Second, if your workload is heavily agentic, the TerminalBench Hard score of sixteen point seven percent suggests you should test carefully before committing. Third, if pricing transparency is a hard requirement for your procurement process, the fact that you have to check each platform independently and that rates shift frequently is a friction point worth acknowledging.

The DeepSeek base lineage, is that a concern?

It is a question worth asking in your own organisation, particularly around the licence terms, which are permissive but not MIT. Teams should read the DeepSeek licence before deploying commercially. That is not a red flag, it is due diligence.

The short version.

Strong open-weight option for reasoning, coding, and multi-turn work. Gaps in multimodal capability and agentic tasks. Do your licence homework. If those gaps do not apply to your use case, it belongs on your evaluation shortlist.

That is Cogito v2.1 671B from Deep Cogito. Thanks for listening to My Weird Prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2355: Why Open-Weight Models Are Winning

Mentions

Downloads

You Might Also Like

#2355: Why Open-Weight Models Are Winning