#2348: Diffusion Models Take on Text Generation

Explore Inception Labs’ Mercury 2, a groundbreaking diffusion-based language model that rethinks text generation and reasoning.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2506
Published: Apr 20
Updated: May 15
Duration: 19:37
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: transformers parallel-computing voice-first

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Introducing Mercury 2: A New Approach to Text Generation**
Mercury 2, developed by Inception Labs, represents a significant departure from traditional autoregressive language models. Instead of generating text sequentially, Mercury 2 employs a diffusion-based architecture, inspired by techniques used in image generation. This allows it to generate and refine multiple tokens in parallel, unlocking potential speed advantages over conventional models.

Architectural Innovation
At its core, Mercury 2 is a diffusion large language model (dLLM). While diffusion models are well-established in image generation, applying them to text at commercial scale is groundbreaking. The model’s parallelism reduces reliance on sequential dependencies, enabling faster processing on existing GPU infrastructure. This architectural shift is supported by strategic investments from NVIDIA and Microsoft, signaling confidence in its viability.

Reasoning and Capabilities
Mercury 2 introduces tunable reasoning, allowing developers to adjust the level of reasoning applied to each API call. This flexibility is particularly useful for applications that require varying degrees of computational effort. The model also supports native tool use, structured outputs, and an OpenAI-compatible API, making it accessible to developers already familiar with existing ecosystems.

Performance and Benchmarks
Mercury 2 claims to generate over 1,000 tokens per second on standard GPUs, though real-world observations show lower throughput in single-request scenarios. Benchmarks reveal strong performance in structured reasoning tasks, such as mathematics and code generation, but weaker results in broad knowledge retrieval and frontier research-level problems.

Use Cases
The model excels in latency-sensitive applications, such as coding workflows and real-time voice interfaces, where its fast token generation provides a notable advantage. However, its limitations in knowledge breadth make it less suitable for tasks requiring deep world understanding.

Pricing and Efficiency
Mercury 2’s pricing is competitive, with significant cost savings enabled by high cache hit rates. While its headline performance claims may not align with observed data, its architectural innovations and tunable reasoning make it a compelling option for specific use cases.

Conclusion
Mercury 2 is a bold experiment in rethinking text generation, offering speed and flexibility for targeted applications. While it may not replace autoregressive models entirely, it represents a promising alternative for developers seeking efficiency and parallelism in their workflows.

Mentions

Agent Zero Autonomous AI agent framework
AIME 2025 Competitive mathematics benchmark
CritPt Research-level physics reasoning benchmark
GPQA Diamond Graduate-level science reasoning benchmark
Hermes Agent Agent with memory and 40+ tools
HLE Humanity's Last Exam benchmark
Inception Labs Palo Alto AI startup behind Mercury
Mercury 2 Diffusion LLM by Inception Labs
Mercury Coder Code-focused variant of Mercury
OpenClaw AI agent for messaging and automation

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2348: Diffusion Models Take on Text Generation

Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model is Mercury 2, built by a lab called Inception, also known as Inception Labs. Herman, you brought this one to the table. Give us the lab first.

Inception Labs is a Palo Alto-based AI startup, and the thing that makes them worth paying attention to is that they are not building another autoregressive transformer. They came out of stealth in early 2025 with a specific thesis: that diffusion models, the same broad family of techniques behind image generation, could be applied to text at commercial scale. That is their whole identity as a lab.

They have been funded to actually pursue that thesis, not just blog about it.

They raised fifty million dollars in a seed round in November 2025, led by Menlo Ventures. The notable co-investors are NVentures, which is NVIDIA's venture arm, and M12, which is Microsoft's. There was also an earlier Mayfield-led investment before that round. So the backing is serious and the strategic investors are exactly who you would want if you are betting on a new hardware-adjacent inference paradigm.

The NVIDIA piece is interesting. That is not just financial validation.

It signals that the approach works on existing GPU infrastructure, which matters a lot for adoption. The CEO is Stefano Ermon, who is a Stanford professor and has published extensively on diffusion models. So the lab has genuine research lineage, not just a product team that licensed someone else's work.

Mercury 2 sits in a family of models, not just a one-off release.

The Mercury family currently has three members: the original Mercury, Mercury Coder which is their code-focused variant, and now Mercury 2, which is the flagship reasoning model. Mercury 2 released on March 4, 2026. It is the one we are looking at today.

Let's get into what this model actually is, because the architecture is the whole story here. This is not another fine-tune on a transformer base.

No, it is genuinely different at the architectural level. Mercury 2 is what Inception calls a diffusion large language model, or dLLM. The reasoning part is new with this release, which is why they are calling it the first reasoning dLLM. But to understand why that matters, you have to understand what diffusion means in this context.

Walk us through that.

Standard autoregressive language models generate text one token at a time, left to right, each token conditioned on everything before it. That is the fundamental loop. Diffusion models work differently. The conceptual origin is in image generation, where you start with noise and iteratively refine it toward a coherent output. Inception has applied that same principle to text. Mercury 2 generates and refines multiple tokens in parallel rather than producing them sequentially.

The claim is that parallelism is where the speed comes from.

That is the core thesis, yes. If you are not bottlenecked by a sequential dependency chain, you can do a lot more work per unit of time on the same hardware. We will get into what the observed numbers actually look like when we hit the benchmarks segment, but architecturally that is the mechanism.

What about the reasoning capability specifically? Because diffusion for text generation is not brand new, but reasoning on top of it apparently is.

Right, the original Mercury family demonstrated the diffusion approach for general text and for code. Mercury 2 adds a reasoning layer, and importantly it exposes that as a tunable parameter through the API. So developers can dial the reasoning level rather than getting a fixed amount of chain-of-thought compute on every call. The model card mentions that reasoning tokens are tracked separately in the API response, which matters for cost accounting if you are building something that only needs heavy reasoning some of the time.

Though we should say we do not have a lot of detail on what those levels actually are in practice.

The API exposes a reasoning parameter, but the page does not specify how many levels there are, what the latency or cost delta looks like between them, or what the practical difference in output quality is. That is a gap. If you are evaluating this for a production system, you would need to test that yourself.

What else is notable on the capability side?

Native tool use is supported, which is table stakes for agentic work but worth confirming. Schema-aligned JSON and structured outputs are supported natively, not bolted on. The context window is one hundred and twenty-eight thousand tokens with a maximum output of fifty thousand tokens. It is OpenAI API compatible, so the integration lift is low if you are already in that ecosystem. And cache read is supported, which we will see reflected in the pricing numbers.

The model card does not list it, so we cannot do a direct apples-to-apples size comparison with other models in this tier.

What does Mercury 2 actually cost to run?

Before I get into the numbers, I should flag the caveat we always put on this segment. All pricing we are about to cite is as of April 20, 2026. These numbers shift, sometimes weekly, so check the current rates on the OpenRouter pricing page before you build anything around them.

Standard input is twenty-five cents per million tokens. Output is seventy-five cents per million tokens. Those are the headline rates. Cache read drops to two and a half cents per million tokens, which is a tenth of the standard input price.

There is a weighted average figure in there too.

Right, and this is worth paying attention to. The observed weighted average for input over the last hour of data we have is about fourteen cents per million tokens, not twenty-five. The reason is a forty-eight point one percent cache hit rate. Nearly half of all input tokens being served are coming from cache, so the effective cost is substantially lower than the rack rate.

That is a meaningful gap.

If your workload has significant prompt repetition, shared system prompts, or you are running a lot of similar queries, you could see effective input costs closer to that fourteen cent figure than the twenty-five cent headline. Output weighted average is essentially flat against the standard rate, seventy-four point nine cents versus seventy-five, so caching is not moving the needle on the output side.

What about tiered or batch pricing?

On the hosting side, OpenRouter is the sole listed provider at this point. There is no direct API pricing from Inception shown on this page, and no self-hosting option mentioned, so we cannot compare those alternatives.

One provider, no batch discounts, but that cache hit rate is doing some real work on the effective input cost.

Let us get into the performance numbers. What is Inception claiming on speed?

The headline claim is over one thousand tokens per second on standard GPUs. They also claim Mercury 2 is five times faster or more than Claude 4.5 Haiku and GPT 5 Mini. One review we found put the real-world figure even higher, citing a range of roughly six hundred and sixty to nearly twelve hundred tokens per second depending on conditions, and end-to-end latency of about one point seven seconds compared to fourteen to twenty-three seconds for autoregressive peers.

What does the observed data on OpenRouter actually show?

The provider card shows average throughput of one hundred and thirty-eight to one hundred and forty-five tokens per second, with an average end-to-end latency of zero point six two seconds and a first-token latency of zero point two eight seconds. So there is a significant gap between the lab claim of over one thousand tokens per second and what the OpenRouter performance tab is recording.

That is a big discrepancy. What explains it?

The page does not explain it directly, and we should be honest about that. The most likely explanation is that the lab's benchmark figure is measured under high-throughput batch conditions on specific hardware, while the OpenRouter observed figure reflects real-world single-request or low-concurrency traffic. Different measurement conditions produce very different numbers. Neither figure is necessarily wrong, but they are measuring different things, and if you are designing a system around throughput expectations, you need to test under your own load profile, not take either number at face value.

What about quality benchmarks?

The picture is mixed in an interesting way. GPQA Diamond, which tests graduate-level scientific reasoning, comes in at seventy-seven percent. That is a strong result for a model in this price tier. AIME 2025, the competitive mathematics benchmark, scored ninety-one point one percent according to one review, which is competitive with the Haiku and Mini class. Instruction following on IFBench is sixty-nine point eight percent, and the agentic index sits at thirty-nine point seven on Artificial Analysis, placing it above seventy-three percent of compared models.

Reasoning and math are holding up. Where does it fall down?

CritPt is the one that stands out. That is a research-level physics reasoning benchmark, and Mercury 2 scored zero point eight percent. That is not a rounding error, that is a genuine floor. HLE, Humanity's Last Exam, came in at fifteen point five percent, and the AA-Omniscience Accuracy figure is twenty point five percent, which suggests real limitations in knowledge breadth. GDPval-AA, which tries to measure performance on economically valuable tasks, is twenty-three percent. So the pattern is: strong on structured reasoning within a defined domain, noticeably weaker on broad knowledge retrieval and frontier research-level problems.

The tool use error rates?

The observed tool call error rate is four point eight nine percent, and structured output errors are coming in at two point five seven percent. For a model being positioned heavily at agentic and tool-use workloads, those are numbers you would want to pressure-test before committing to a production pipeline. Not disqualifying, but not something to wave past either.

Let us talk about where you would actually reach for this. Given everything we have just covered, the speed profile, the benchmark pattern, the error rates, what does the use case map look like?

The clearest fit is anywhere latency compounds. Coding workflows are the obvious one. If you are running a loop where a developer is waiting on model output before they can take the next action, shaving that latency has a multiplier effect on the whole experience. Mercury 2's first-token latency of under three hundred milliseconds is useful there, even if the throughput figure is closer to the observed one hundred and forty-five tokens per second than the lab's headline number.

The page itself calls out real-time voice and search interfaces.

Yes, and that makes architectural sense. Diffusion-based parallel generation means you are not waiting for a sequential chain to resolve before you get output. For a voice assistant or a search-augmented retrieval pipeline where you need to surface something fast and the query is reasonably bounded, that latency profile is a real advantage. The question is always whether the quality is sufficient for the task, and for search and retrieval augmented generation, where the model is largely synthesising retrieved content rather than drawing on deep world knowledge, the knowledge breadth limitations we flagged are less of a problem.

What about the agentic use case? The top apps by token volume are interesting here.

OpenClaw, which is described as an AI agent for messaging, file, and email automation, is the top consumer by volume at roughly one point four five billion tokens this month. Agent Zero, which is positioned as autonomous AI agents, is third at around six hundred and thirty-one million tokens. Hermes Agent, which apparently has memory and over forty tools, is fourth. So the actual usage pattern is heavily agentic, which tracks with the model's native tool use support and the low per-token cost. In a long-running agent loop, cost and latency both accumulate, and Mercury 2's pricing at twenty-five cents per million input tokens and seventy-five cents per million output tokens makes it economical to run at volume.

We should note there are two other top apps, ZimmWriter and Wire Pyramid Engine, that we cannot characterise because the page does not describe what they do.

They are significant by token volume, but we are not going to speculate about their use cases.

The hard limits on what it does not do.

No vision input, no audio, no embeddings, no reranking. If your application needs any of those, this is not your model. It is a text-in, text-out reasoning system, and the page gives no indication that is changing.

What is the broader reception looking like? Engineers, press, anyone who has actually put it through its paces.

Broadly positive, with some useful nuance. The clearest signal is from a detailed review on Awesome Agents, which rated it seven point four out of ten. The headline finding there was speed described as "uncanny," and they put some specific numbers on it: ten times faster than Claude four point five Haiku, fourteen times faster than GPT-5 Mini in their testing. They also flagged the cost comparison, calling it two and a half to six and a half times cheaper than peers at those speed tiers.

That tracks with what we have been saying about the pricing, though I want to be careful about the speed figures because we have already flagged the gap between the lab's headline number and what OpenRouter is actually observing.

Right, and the review numbers sit somewhere in between. Artificial Analysis has benchmarked throughput in the range of roughly six hundred and sixty to nearly twelve hundred tokens per second depending on conditions, which is meaningfully higher than the one hundred and thirty-eight to one hundred and forty-five tokens per second we see on the OpenRouter provider card. The honest read is that real-world throughput is hardware and load dependent, and none of these figures are the same number. What they agree on is that it is fast relative to autoregressive alternatives.

Is there a consensus view on where the quality sits?

Yes, and it is consistent across sources. Artificial Analysis places it at thirty-three on their Intelligence Index, which they note is well above the average of around nineteen to twenty for models in its price tier, and above seventy-two percent of compared models. The framing from multiple reviewers is mid-tier intelligence, competitive with the Haiku and Mini class, not competing with frontier models. Nobody is claiming otherwise, and to be fair, Inception is not claiming otherwise either.

Any criticisms worth noting?

The most substantive one is the verbosity flag. The Artificial Analysis data shows Mercury 2 generating around sixty-nine million tokens on the Intelligence Index, against a median of around twenty-six million for compared models. That is a lot of output to get to an answer, which has cost and latency implications depending on how you are using it. It is not a disqualifying issue, but if you are running high-volume inference, you want to account for it.

No major controversies, no red flags from the engineering community?

Nothing surfaced in the coverage we reviewed. The reception is positive without being credulous. The consistent note is that the speed advantage is real, the quality is appropriate for the price tier, and the knowledge breadth limitations we discussed are acknowledged rather than contested. For a model that has been out since March of this year, that is a reasonably clean early record.

Alright, let's land this. Herman, if someone is listening to this and they are trying to decide whether Mercury 2 belongs in their stack, what is the short version?

The short version is: if latency is a first-class constraint in your system, Mercury 2 is worth serious consideration. Real-time voice assistants, agentic loops where you are chaining multiple calls and the delays compound, high-throughput coding workflows where you are waiting on the model constantly. Those are the cases where the speed advantage translates directly into a better product or meaningfully lower infrastructure cost.

The structured output and tool use support makes it a practical choice for those agent pipelines, not just a theoretical one.

The tool call error rate of just under five percent and the structured output error rate of around two and a half percent are not perfect, but they are workable numbers for production agentic use, and the native support means you are not bolting something on. The real-world adoption data points the same direction. The top token consumers on OpenRouter right now are agent frameworks and automation tools. The market is already voting with its usage.

When do you not reach for it?

If your task requires deep research-level reasoning, the benchmarks are honest about the ceiling. The CritPt score of under one percent, the HLE score of fifteen and a half percent, the Omniscience accuracy of twenty and a half percent. Those are not numbers you paper over. If you are building something where the model needs broad, reliable factual knowledge or graduate-level scientific reasoning across domains, Mercury 2 is not the right tool. You want a frontier model and you are going to pay for one.

No vision, no audio, no open weights, licence terms we do not know. If any of those are requirements, the answer is also no.

Exactly the right framing. It is a focused tool. It does a specific set of things very well and it is honest, or at least the evidence is honest, about what it does not do.

For AI professionals building latency-sensitive text-based systems, it is a legitimate option at a competitive price point. For everything else, the gaps are real and the alternatives exist. That is Mercury 2 from Inception Labs. Thanks for listening to My Weird Prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2348: Diffusion Models Take on Text Generation

Mentions

Downloads

You Might Also Like

#2348: Diffusion Models Take on Text Generation