#2349: The 30-Person Lab Outpacing AI Giants

Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2507
Published: Apr 20
Updated: May 15
Duration: 20:39
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: ai-models reasoning-models benchmarks

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Arcee AI’s Trinity Large Thinking is quickly gaining attention in the AI community, and for good reason. Built by a small team of just 30 people, this reasoning-optimized model demonstrates how efficiency and innovation can compete with larger, more resource-heavy labs. At its core, Trinity Large Thinking is a sparse Mixture of Experts (MoE) model, boasting a total parameter count of 400 billion. However, its activated parameter count per token is only around 13 billion, making it far more efficient to run than traditional dense models. This architectural choice is key to Arcee’s ability to deliver high performance at a fraction of the cost.

The model was trained on a corpus of 17 trillion tokens, combining curated web-scale data with synthetic data. Arcee’s technical report highlights synthetic data generation as one of the largest publicly documented efforts in pretraining, underscoring their commitment to innovation. The optimizer used, Muon, is a newer alternative to AdamW, further showcasing Arcee’s willingness to experiment with cutting-edge techniques.

Trinity Large Thinking’s reasoning capabilities set it apart. The model generates internal reasoning tokens before producing its final response, allowing users to inspect its intermediate steps via an API-exposed field called “reasoning details.” This transparency is particularly useful for multi-turn conversations, where reasoning details can be passed back to the model to maintain continuity across interactions—a critical feature for agentic workloads.

On benchmarks, Trinity Large Thinking shines in agent-relevant tasks. It ranks second on PinchBench, a benchmark focused on tool calling, multi-step reasoning, and instruction following. While independent confirmation of this ranking is still pending, the model’s performance in tasks like website generation and data visualization is competitive, though it struggles with precise outputs like SVG.

At $0.85 per million output tokens, Trinity Large Thinking is significantly cheaper than competitors like Claude Opus 4.6, making it an attractive option for high-volume reasoning workloads. Its combination of efficiency, transparency, and affordability positions it as a compelling choice for developers building agentic systems.

Mentions

Arcee AI 30-person lab building efficient open-source models
Artificial Analysis Independent platform for model benchmark analysis
Claude Opus 4.6 Anthropic's frontier reasoning model
Design Arena Human-voted head-to-head model comparison platform
Hacker News Community discussion forum for technologists
Hugging Face Platform hosting open-source model weights
Muon Optimizer used in training large models
PinchBench Benchmark for agentic task capability ranking
TechCrunch Media outlet covering tech underdog stories
Trinity Large Thinking Reasoning-optimized MoE model from Arcee AI

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2349: The 30-Person Lab Outpacing AI Giants

Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model is Trinity Large Thinking, built by Arcee AI. Herman, let's start at the top. Who is Arcee AI?

Arcee is a San Francisco-based lab, and the first thing you need to know about them is the scale. We are talking roughly thirty employees. That is not a typo.

Thirty people building frontier models.

Thirty people building frontier models. They have raised forty-nine million dollars in funding, which in the context of this industry is a relatively modest number. For comparison, some of the larger labs have raised that in a single funding round just to cover compute costs for a few months.

How are they actually shipping anything competitive?

That is the interesting part. Their own numbers, which they published, put the total cost of their current model generation at twenty million dollars all-in. Compute, salaries, data, storage, operations. And out of that they shipped four models in six months.

Four models in six months on twenty million dollars.

The framing they use internally is efficiency as a design principle, not just a constraint. TechCrunch ran a piece on them in early April twenty twenty-six with a headline that basically said the writer could not help rooting for them, which tells you something about how the press is reading this story. lab, genuinely open-source licensing, punching above their weight on benchmarks.

What does their lineup look like around Trinity Large Thinking specifically?

The Trinity family has at least three visible variants. You have Trinity Large Preview, which was the predecessor, Trinity Mini on the smaller end, and then Trinity Large Thinking, which is the reasoning-optimized release that we are profiling today. The preview model apparently accumulated three point three seven trillion tokens served on OpenRouter in its first two months and became the number one most-used open model in the United States by that metric.

That is a meaningful adoption signal for a thirty-person lab.

Trinity Large Thinking is the next step from there.

Alright, so what actually is Trinity Large Thinking under the hood?

Let's start with the architecture. Trinity Large is a sparse Mixture of Experts model, or MoE. The total parameter count is approximately four hundred billion, but that number is a little misleading on its own, because with a sparse MoE you do not activate all of those parameters for every token. The actual activated parameter count per token is around thirteen billion.

When people see four hundred billion and think this must be enormously expensive to run, that is not quite the right frame.

The four hundred billion is the full weight footprint, which matters for hosting and memory. But the compute per forward pass is much closer to a thirteen billion dense model. That is the core efficiency argument for the MoE architecture. You get a large model's capacity and specialization without paying the full compute cost on every single token.

The training side?

Seventeen trillion tokens. That is the training corpus size, and Arcee's technical report describes it as a large mixed corpus combining curated web-scale data with synthetic data. They specifically call out synthetic data generation as one of the larger publicly documented efforts in pretraining. The optimizer they used is Muon, which is a relatively newer optimizer that some labs have been experimenting with as an alternative to AdamW for large-scale training runs.

Is there anything known about the base model underneath? Is this a fine-tune of something like Llama or Qwen, or is it a novel architecture from Arcee?

That is a gap in what is publicly confirmed. The model card and the OpenRouter listing do not specify whether Trinity Large is built on a known base or is a fully novel architecture from Arcee. Given the scale and the training details in the technical report, it reads more like an original pretraining effort, but we cannot say that definitively. We will flag it as an open question.

Now the "Thinking" part of the name, what does that actually mean technically?

Trinity Large Thinking is the reasoning-optimized variant of the family. What that means in practice is that the model produces internal reasoning tokens before it generates its final response. You can think of it as the model working through a problem step by step before it commits to an answer. That internal chain of thought is not just happening invisibly. The API exposes it through a field called reasoning details, which is an array of the model's intermediate reasoning steps that the caller can actually inspect.

You can see the work, not just the answer.

You can see the work. And there is a further capability on top of that. In multi-turn conversations, you can pass those reasoning details back to the model in subsequent turns, so it picks up its reasoning thread from where it left off rather than starting cold. For long-running agentic tasks that is a meaningful design choice.

The context window?

Two hundred and sixty-two thousand tokens. Some sources cite five hundred and twelve thousand, so there may be a provider-level difference in what is exposed, but the OpenRouter listing shows two hundred and sixty-two thousand. Either way, you are in the range where very long documents and extended agent loops are within scope.

Let's talk about what this actually costs to run. Herman, before we get into the numbers, I know you have a flag to put down here.

Yes, and this is a series convention for good reason. All pricing we are about to cite is as of April twenty, twenty twenty-six. These numbers shift, sometimes weekly, and OpenRouter in particular routes across multiple backend providers, so what you see on the listing page may not be identical to what a specific provider is charging underneath. Always verify before you build a cost model around this.

What are we looking at?

On OpenRouter, input is twenty-two cents per million tokens. Output is eighty-five cents per million tokens. Those are the headline figures.

Because this model is generating a lot of internal thinking before it produces output. Is that billed separately?

That is one of the gaps in the public documentation. Reasoning tokens are tracked separately in the usage response, you can see them in the API payload, but the OpenRouter listing does not break out a distinct price for them. Our best read is that they are billed at either the input or output rate, but we cannot confirm that from the page alone. If you are building a cost projection for a reasoning-heavy workload, that ambiguity matters and you should verify it directly.

Cached input pricing?

With those caveats on the table, how does the output price compare to what else is in the market?

Arcee's own blog post makes the comparison directly. They put the output price at roughly ninety cents per million tokens, and they describe that as approximately ninety-six percent cheaper than Claude Opus four point six for output tokens. That is a significant spread. You are getting a model that Arcee claims sits just behind Opus four point six on PinchBench, at a fraction of the output cost.

We will get into whether that benchmark claim holds up in a moment. But the cost structure alone is worth noting for anyone running high-volume reasoning workloads.

Let us get into what this model actually does on benchmarks. What is Arcee claiming, and what does the independent evidence say?

Let us start with PinchBench, because that is the headline claim. PinchBench is a benchmark from Kilo that measures model capability on tasks relevant to agentic workloads, things like tool calling, multi-step reasoning, instruction following under pressure. Arcee says Trinity Large Thinking ranks number two on that leaderboard, sitting just behind Claude Opus four point six. That is the claim from their own blog post, and it is a meaningful claim if it holds.

Does it hold? Is there independent confirmation?

The benchmark itself is real and the ranking is cited by Arcee directly in their launch post, so this is not an invented number. What we do not have is a third-party replication sitting in front of us. The Artificial Analysis page for this model exists and their logo appears on the OpenRouter listing, but we did not capture specific scores from that page. So the PinchBench number two claim is from the lab, corroborated by the benchmark existing and being cited, but I would want to see Artificial Analysis or a comparable independent source run their own eval before I treated it as fully settled.

What about Design Arena? That is a different kind of signal.

It is, and it is worth understanding what Design Arena is measuring. It runs head-to-head tournaments, four models at a time, with human voting on outputs. The data we have covers one thousand five hundred and sixty-eight tournaments. So it is a reasonable sample, but the voting is subjective and the categories are design-oriented, which may not be where a reasoning model like this is optimised to shine.

What did it show?

Best result was website generation, where Trinity Large Thinking sits in the top forty-nine percent. That sounds middling until you remember this is a reasoning model being compared against models that may be specifically tuned for front-end output. Code categories came in at top fifty-six percent, 3D and game development both around top fifty-five to sixty-one percent. Data visualisation at top sixty percent. Those are all in the middle of the pack.

The weaker spots?

UI components at top seventy percent, and SVG at top seventy-five percent. Those are the tail. SVG in particular is a known pain point for reasoning-heavy models because it rewards spatial precision and format adherence over deliberative thinking.

The picture is: strong in agentic and multi-step reasoning tasks, competitive but not dominant in design-oriented generation, and notably weaker when the task is about precise structured output like SVG.

That is a fair summary. And the usage data from OpenRouter adds one more data point worth flagging. Across the observed period, the model generated roughly fifty-seven point eight million reasoning tokens against four point eight eight million completion tokens. That is a ratio of about six point six to one. This model thinks a lot before it speaks, which is consistent with what you would want from a number two on an agentic benchmark, but it also has implications for latency and, once we get clarity on reasoning token pricing, for cost.

Let us talk about where you would actually reach for this model. Given everything we have covered, what does the workload fit look like?

The clearest win is agentic loops. That is not just us reading between the lines of the benchmark placement. Arcee says it explicitly, and the PinchBench number two ranking is specifically measuring agent-relevant capability, things like multi-step tool calling, instruction following across turns, and coherent behaviour in long-running tasks. The reasoning continuity feature reinforces that. The ability to pass the reasoning details array back into subsequent turns means the model can carry its thinking forward across a conversation rather than starting cold each time. For anyone building an agent that needs to hold state across multiple steps, that is a useful property.

What does that look like concretely? Give me a use case.

Say you are building a research agent that needs to query multiple tools, synthesise results, and produce a structured report. The model reasons through each step, you pass that reasoning context forward, and the next turn picks up with full awareness of what was already worked out. You are not paying the re-derivation cost every time. That is the kind of workload where the six-to-one reasoning-to-completion ratio we saw in the usage data becomes an asset rather than a liability.

What about the context window? Five hundred and twelve thousand tokens is substantial.

It is, and for cost-sensitive long-context applications it is a meaningful combination. You are getting a large context at twenty-two cents per million input tokens. If you have a workload that needs to ingest large codebases, long document sets, or extended conversation histories, and you are currently paying frontier closed-source rates for that, this is worth a serious look. The open weights availability also matters here. If you have the infrastructure to self-host, you can bring the per-token cost down further and keep the data on your own stack.

Where would you tell someone to look elsewhere?

Two clear cases. First, SVG and UI component generation. The Design Arena data puts this model in the bottom quarter of the field for SVG and close to it for UI components. If your product is generating front-end assets at scale, there are models better suited to that task. Second, latency-sensitive real-time applications. A model that generates roughly six reasoning tokens for every one completion token is doing a lot of internal work before it responds. That is appropriate for deliberative tasks, but if you need sub-second responses for something like a live chat interface or a real-time coding assistant, the thinking overhead is going to work against you. The response times in the Design Arena data, some of them pushing over a hundred seconds, give you a sense of what you are dealing with at the longer end.

The profile is: deep reasoning, long context, agentic tasks, cost-conscious deployments. Not pixel-perfect design generation, not anything where latency is the primary constraint.

That is the honest read of what the evidence supports.

Let's talk about how the industry has actually received this. What is the general read out there?

The clearest signal is the TechCrunch piece from early April. The headline was essentially "I can't help rooting for Arcee," which tells you something about the framing. The coverage positions Arcee as a genuine underdog story: twenty-six to thirty employees, a twenty million dollar all-in budget covering compute, salaries, data, and storage, and they shipped four models in six months. The piece is honest that Trinity Large Thinking is not outperforming the closed-source leaders from Anthropic or OpenAI, but the angle is that Arcee is not trying to hold the open-source community hostage with restrictive licensing the way some larger labs have. Apache two-point-zero, no strings attached, and that matters to a certain segment of the professional community.

The underdog framing is real, but it can also be a way of damning with faint praise. Is there substance behind the goodwill?

There is, and the Hacker News thread on the predecessor model is where you see it. The discussion references a seventy percent win rate in head-to-head comparisons over a year and a half. To be clear, that figure applies to Trinity Large, the base model, not specifically the Thinking variant, and the context is the predecessor's trajectory rather than a direct benchmark of this release. But it does suggest the underlying model family has been earning its reputation incrementally rather than arriving with a single splashy launch.

What about the PinchBench number two ranking? Is that landing with practitioners?

It is getting attention, partly because of the cost comparison that comes with it. The Arcee blog post frames it explicitly: number two on PinchBench, just behind Opus four point six, at roughly ninety-six percent lower output cost than that model. That is the kind of comparison that gets shared in engineering Slack channels. The caveat I would add is that PinchBench is a relatively narrow benchmark focused on agent-relevant tasks, so practitioners who have been burned by benchmark overfitting are right to hold it loosely. But for the specific workload it measures, the result is credible.

Any red flags or controversies in the reception?

The main honest critique in the coverage is the one Arcee itself acknowledges: this is not a model that beats the closed-source frontier. The TechCrunch piece says it plainly. If your benchmark is GPT-4-class closed models, you are not going to find a surprise upset here. What you are finding is a capable open-weight model at a price point that makes the comparison interesting for cost-sensitive production workloads. The reception reflects that accurately, which is probably the best thing you can say about it.

Alright, let's land this. If you had to give someone a single sentence on when to reach for Trinity Large Thinking, what is it?

If you are running agentic workflows where you need to see the reasoning, not just the answer, and you cannot justify closed-source pricing at scale, this is the model you should be evaluating right now.

Unpack that a little. What does the reasoning transparency actually buy you in practice?

The reasoning details array is not just a nice-to-have. In a multi-turn agent loop, being able to pass that chain of thought back into subsequent turns means the model is not starting cold each time. For debugging, for auditing, for building systems where a human needs to spot-check why the model made a decision, that is useful. Most closed-source models either do not expose that at all or charge you separately for the privilege.

The cost picture makes that more compelling.

Twenty-two cents per million input tokens, eighty-five cents per million output tokens, number two on PinchBench behind a model that costs roughly ninety-six percent more per output token. For a cost-sensitive production workload, that arithmetic is hard to ignore.

When do you not reach for it?

A few scenarios. If you are doing UI component generation or SVG work, the Design Arena numbers put it in the bottom quarter of the field for those categories, and you would want to test carefully before committing. If your organisation has already budgeted for closed-source frontier models and latency is the primary concern, the provider routing through OpenRouter adds a variable you may not want. And if you need the absolute ceiling on reasoning quality and cost is not a constraint, the closed-source leaders are still ahead. Arcee says so themselves.

The open-source licensing is also part of the calculus for some teams.

Apache two-point-zero, no restrictions, weights on Hugging Face. For teams that have had bad experiences with models that are open in name only, that is a real differentiator. You can self-host, you can audit, you can build on it without a licensing conversation every time the use case expands.

Bottom line: a focused, honest value proposition from a small lab that is doing serious work. Not the model for every job, but for the jobs it fits, it fits well. That is Trinity Large Thinking from Arcee AI.

It is a reminder that you do not always need the biggest model to get the job done. Sometimes you need the right model.

That is a wrap for this AI Model Spotlight. Thanks for joining us on My Weird Prompts. We will see you next time.

Bye for now.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2349: The 30-Person Lab Outpacing AI Giants

Mentions

Downloads

You Might Also Like

#2349: The 30-Person Lab Outpacing AI Giants