#2403: Choosing Your LLM Eval Framework

An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2561
Published: Apr 25
Updated: May 15
Duration: 27:26
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models ai-agents benchmarks

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

LLM Evaluation Frameworks: An Architectural Shootout**

Choosing the right evaluation framework for your LLM application is one of the most consequential infrastructure decisions a team can make. The wrong choice means fighting your tooling instead of improving your models. This post compares four major frameworks — Inspect AI, Promptfoo, DeepEval, and Braintrust — across their core abstractions, design philosophies, and where each breaks down under real-world stress tests.

Inspect AI: The Safety-First Composer

Built by the UK AI Safety Institute, Inspect AI uses a solver-scorer pattern. Solvers take a task and produce an output (simple model calls, chains, or full agent loops). Scorers evaluate outputs against criteria. Solvers are composable — they can be nested, chained, or delegated — making the framework purpose-built for multi-step agentic evaluations that safety researchers need.

The tradeoff is that Inspect is a Python library, not a service. There's no hosted dashboard, no built-in experiment tracking, no team collaboration features. Teams run it locally and manage their own result storage. Async execution is handled well through Python's asyncio, with configurable concurrency at the sample level. However, dataset versioning is conspicuously absent — Inspect reads from JSON or CSV files but provides no built-in versioning, lineage tracking, or diffing. This works for institutions with dedicated data management processes but creates friction for everyone else.

Promptfoo: Eval as Configuration

Promptfoo's core abstraction is elegant simplicity: YAML configuration files defining prompts, providers (which models to test), and assertions. Assertions range from string matching to JSON structure validation to LLM-graded scoring. The framework runs every prompt against every provider against every assertion in a matrix, producing a visual pass/fail table.

This combinatorial coverage model catches drift that pairwise testing would miss, and CI integration is best-in-class with native GitHub Actions support. The limitation is that Promptfoo's model is fundamentally request-response — there's no native concept of multi-turn conversations or tool-using agents with state. Its assertion model is also relatively shallow, requiring custom Python graders for deeper semantic evaluation. Promptfoo Cloud adds persistent storage and team dashboards, but remains less mature than Braintrust's hosted offering.

DeepEval: The Pytest Approach

DeepEval frames LLM evaluation as unit testing. Tests look exactly like pytest functions with assertions for faithfulness, answer relevancy, hallucination detection, and other pre-built metrics. For engineers already living in pytest, this is incredibly natural — LLM evals live alongside unit tests in the same CI pipeline.

The framework's metrics are sophisticated, using separate LLM calls for semantic evaluation (hallucination detection compares responses against context, for example). The unit test metaphor breaks on non-determinism — LLM outputs are probabilistic, so DeepEval uses threshold-based assertions. Choosing those thresholds is a dark art: too low misses regressions, too high creates noisy CI. Multi-turn support exists but is awkward and turn-by-turn. Tool-using agents and dataset versioning are essentially absent in the open-source version.

Braintrust: Eval as Managed Service

Braintrust approaches from the opposite direction — eval as a hosted service with an API. The core abstraction is the experiment: define a task (any function producing output), define a dataset (inputs with optional expected outputs), run the experiment, and Braintrust stores everything — configuration, dataset version, individual results, scores, latency measurements.

Dataset versioning is the killer feature. Every experiment links to a specific dataset version, with change tracking, diffs, and lineage that explains exactly why scores shifted between experiments. The dashboard is best-in-class with experiment comparison views, score distributions, per-example drill-downs, and team collaboration features. The tradeoff is vendor lock-in — your evaluation infrastructure lives on Braintrust's platform.

Picking Winners by Use Case

For research labs doing safety evals: Inspect AI's solver-scorer composability and parallel execution make it the right choice, provided the team has engineering support for result management.

For startups running regression tests in CI: Promptfoo's matrix model, CI integration, and fast setup are unmatched — as long as applications stay request-response rather than conversational.

For enterprise teams wanting hosted dashboards: Braintrust's dataset versioning and collaboration features justify the vendor dependency for teams running dozens of experiments weekly.

For solo engineers prototyping a prompt: DeepEval's pytest integration and pre-built metrics provide the fastest path from zero to functioning evaluation, accepting the threshold calibration challenge.

Mentions

Braintrust Hosted experiment-tracking evaluation platform
Confident AI DeepEval's commercial platform
DeepEval pytest-style LLM evaluation framework
GitHub Actions CI/CD service for automated evals
Inspect AI LLM evaluation framework from UK AI Safety Institute
Promptfoo Matrix-style YAML config evaluation framework
pytest Python testing framework used for evals
UK AI Safety Institute Government body behind Inspect AI

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2403: Choosing Your LLM Eval Framework

Daniel sent us this one — and it's gloriously specific. He wants an opinionated architectural shootout of the major LLM evaluation harnesses. We're talking Inspect AI from the UK AI Safety Institute, Promptfoo, DeepEval, and Braintrust. For each one, he wants the core abstraction and design philosophy laid out — Inspect's solver-scorer pattern, Promptfoo's matrix-style YAML configs, DeepEval's pytest-style assertions, Braintrust's hosted experiment-tracking and dataset-versioning model. Then the critical part: where does each one break down? Multi-turn conversations, tool-using agents, async execution at scale, dataset versioning, CI integration — those are the stress tests. And he explicitly says no equal-time hedging. Pick winners for specific use cases. Research lab doing safety evals, startup running regression tests in CI, enterprise team wanting hosted dashboards, solo engineer prototyping a prompt.

Oh, this is going to be fun. I've been deep in evaluation land for months now and most of the discourse is either "here's my Medium tutorial on running one eval" or complete vendor hype. Nobody actually compares architecture. By the way, fun fact — DeepSeek V four Pro is writing our script today, which feels appropriate given we're about to evaluate evaluation frameworks.

Alright, let's start with the one that has the most institutional weight behind it — Inspect AI from the UK AI Safety Institute. My read is this is the framework built by people who think evaluation is a public good, not a product.

And the architecture reflects it. Inspect's core abstraction is the solver-scorer pattern. A solver is anything that takes a task and produces an output — could be a simple model call, could be a chain of model calls, could be a whole agent loop with tool use. A scorer evaluates that output against some criteria. The key design decision is that solvers are composable. You can nest them, chain them, have one solver delegate to another. It's explicitly designed for the kind of multi-step agentic evaluations that safety researchers need.

It's like Lego blocks for eval pipelines. But composability usually comes with a complexity tax. Where's the friction?

The friction is that Inspect is fundamentally a Python library, not a service. There's no hosted dashboard, no built-in experiment tracking database, no team collaboration features. You run it locally or on your own infrastructure, and you're responsible for storing results, comparing runs, all of that. For a research lab with dedicated engineering support, that's fine — they'll build their own analytics on top. For a startup that just wants to know if their prompt broke, it's overkill with missing pieces.

The async execution story? Daniel specifically called that out as a stress test.

Inspect actually handles this fairly well. It's built on Python's asyncio, and it has native support for parallel evaluation with configurable concurrency. You can run hundreds of samples in parallel, and the solver architecture doesn't block on individual calls. Where it gets tricky is when your solver itself is a long-running agent loop — Inspect's parallelism is at the sample level, not within a single agent's execution. So if you're evaluating an agent that makes fifty tool calls over ten minutes, Inspect will happily run a hundred of those in parallel, but each individual agent runs sequentially. That's usually the right tradeoff, but it's worth knowing.

What about dataset versioning? That seems like the kind of thing a safety institute would care deeply about.

Surprisingly, Inspect doesn't have a strong opinion on dataset versioning. It reads from JSON or CSV files, and it has some basic dataset loading utilities, but there's no built-in versioning, no lineage tracking, no diffing between dataset versions. The assumption is you'll manage that externally with Git or your own infrastructure. For the UK AISI, that's probably fine — they have institutional processes around data management. For the rest of us, it's a gap.

Alright, let's move to Promptfoo. This is the one I see indie developers and startups gravitating toward. The elevator pitch seems to be "eval as config.

Yes, and the core abstraction is brilliant in its simplicity. Promptfoo uses YAML configuration files where you define your prompts, your providers — meaning which models to test — and your assertions. The assertions are things like "contains this string," "doesn't contain that string," "passes this regex," "has JSON structure," "scored above threshold by this grader." Then Promptfoo runs every prompt against every provider and evaluates every assertion. It's a matrix — prompts times providers times assertions — and the output is a visual table showing what passed and what failed.

That matrix model is interesting because it forces you to think in terms of combinatorial coverage. Instead of testing one prompt against one model, you're testing every combination. That catches drift that pairwise testing would miss.

And it integrates beautifully with CI. Promptfoo has native GitHub Actions support, it outputs results in formats that CI systems understand, and it can fail builds when assertions don't pass. For a startup that wants regression testing on every pull request, Promptfoo is probably the fastest path from zero to functioning eval pipeline.

Multi-turn conversations and tool-using agents. Promptfoo's model is fundamentally request-response. You send a prompt, you get a response, you evaluate it. There's no native concept of a conversation with state, no way to simulate a user-agent interaction over multiple turns. You can hack around it by chaining outputs into inputs manually, but the framework isn't designed for it. If you're building a chatbot or an agent, Promptfoo will test individual turns but not the coherence of a whole interaction.

That's a significant limitation. Most of the interesting LLM applications right now are conversational or agentic. If Promptfoo can't handle state, it's testing components in isolation, not the system as users experience it.

The other thing is that Promptfoo's assertion model is relatively shallow. You can check for string presence, regex matches, JSON validity — but those are surface-level checks. For anything deeper, you need to bring your own grader, which Promptfoo supports, but then you're back to writing Python and the elegant YAML abstraction starts to leak.

Let's talk about the hosted side. Promptfoo has a cloud offering now, right?

They do, and it's where they're clearly investing. Promptfoo Cloud gives you persistent experiment storage, team dashboards, shareable results, and some collaboration features. It's not as mature as Braintrust's offering, but it's evolving fast. The pricing model is per-evaluation — you pay based on how many evaluations you run. For a team that's already using the open-source version and wants to graduate to something hosted, it's a natural progression.

This is the one that frames itself in terms of testing — pytest-style assertions, unit test metaphors. My immediate reaction is that this either clicks perfectly or feels like a category error, depending on who you are.

That's the central tension with DeepEval. The core abstraction is that you write evaluation tests that look exactly like unit tests. You import DeepEval's assertion functions, you write test functions with descriptive names, you run them with pytest. The assertions are things like "this response should be factually correct," "this response should not contain hallucinations," "this response should be relevant to the query." DeepEval provides a bunch of pre-built metrics — faithfulness, answer relevancy, contextual recall, hallucination detection — and each one is an assertion you can drop into a test.

For a software engineer who already lives in pytest, that's incredibly natural. Your LLM evals live alongside your unit tests, your integration tests, your regression tests. Same runner, same CI pipeline, same mental model.

And DeepEval has put serious work into their metrics. Their hallucination detection, for example, uses a separate LLM call to compare the generated response against the provided context and flags contradictions. It's not just string matching — there's actual semantic evaluation happening. They've also got a synthesizer that can generate test cases from your documentation, which is clever for bootstrapping an eval suite.

The unit test metaphor has to break somewhere. LLM outputs are probabilistic. A unit test passes or fails deterministically. How do you square that?

That's where it gets interesting and where DeepEval's design decisions become opinionated. They handle non-determinism by setting thresholds. An assertion like "faithfulness should be above zero point eight" will pass if the metric score exceeds that threshold and fail otherwise. But choosing those thresholds is a dark art. Set them too low and you miss regressions. Set them too high and your CI is constantly red from noise. DeepEval gives you the tools but doesn't solve the calibration problem.

What about the stress tests Daniel mentioned? Multi-turn, agents, async?

Multi-turn support exists but it's awkward. DeepEval has a conversational test type where you define a sequence of user messages and expected assistant responses, but the evaluation is still turn-by-turn — there's no holistic conversation quality metric. For tool-using agents, you're essentially on your own. DeepEval can evaluate the final output of an agent run, but it doesn't have native abstractions for tool calls, intermediate reasoning steps, or agent trajectories. Async execution is fine because pytest handles parallelism, but it's not purpose-built for eval at scale.

Nonexistent in the open-source version. DeepEval's commercial platform — Confident AI — adds experiment tracking and some versioning capabilities, but it's not a first-class concept in the framework itself. You're expected to version your test files in Git and call it a day.

Which brings us to Braintrust. This is the one that seems to come from the opposite direction — not "eval as code" but "eval as a managed service with an API.

Braintrust's core abstraction is the experiment. You define a task — a function that takes an input and produces an output, which could be a model call, a chain, an agent, whatever. You define a dataset — a collection of inputs with optional expected outputs. You run an experiment, which executes your task against every row in the dataset and scores the results. Braintrust stores all of this — the experiment configuration, the dataset version, every individual result, every score, every latency measurement — in their hosted platform.

The dataset versioning is built in from day one. That's the differentiator.

It's the killer feature for teams. Every experiment is linked to a specific dataset version, and Braintrust tracks changes to datasets over time. You can diff datasets, see what examples were added or changed, and understand exactly why scores shifted between experiments. For an enterprise team running dozens of experiments a week, that lineage tracking is the difference between "the numbers went down and we don't know why" and "scores dropped three percent because we added harder examples to the dataset.

The dashboard story?

Best in class among these four. Braintrust provides rich experiment comparison views, score distributions, latency histograms, per-example drill-downs, and the ability to annotate and review individual results. It's designed for teams where multiple people need to look at eval results — product managers, domain experts, not just engineers. The collaboration features are real: you can leave comments on specific examples, flag issues, and track resolution.

Where does Braintrust break down?

First, the local development experience. Braintrust is fundamentally a cloud service with a Python SDK. You can run evaluations locally, but the results are pushed to their platform. If you're an individual developer who wants a fast, offline eval loop, Braintrust feels heavy. Second, the abstraction model is deliberately thin. Braintrust doesn't prescribe how you structure your prompts or how you score outputs — it just gives you a framework for running experiments and storing results. That's powerful for experienced teams who know what they want to measure, but it's less guided than DeepEval's pre-built metrics or Promptfoo's declarative assertions.

There's also the pricing model to consider. Braintrust has a free tier, but serious usage costs money. For a startup watching burn rate, that matters.

For a research lab that needs to run thousands of evaluations on sensitive data that can't leave their infrastructure, a cloud-first platform is a non-starter. Inspect wins there by default because it's fully self-hosted and open source.

Alright, let's do the thing Daniel actually asked for. Research lab doing safety evals.

It's purpose-built for this use case — the UK AISI literally created it for their own safety evaluations. The solver-scorer architecture maps cleanly onto the kind of multi-step, agentic evaluations that safety research requires. It's open source, self-hosted, no vendor dependency, no data leaving your infrastructure. The lack of a hosted dashboard is a feature for labs that need full control over their data and results. The composable solver pattern means you can model complex agent behaviors in a way that the other frameworks don't support natively.

I'd add that Inspect has the institutional credibility that matters in safety research. If you publish a paper saying "we evaluated this model for dangerous capabilities using Inspect," other researchers know exactly what that means and can reproduce your work. That reproducibility is central to the safety research workflow in a way it isn't for commercial teams.

Next: startup running regression tests in CI.

The matrix-style YAML config is the fastest path to a working CI eval pipeline I've seen. You define your prompts, your models, your assertions, and you get a pass-fail result that integrates with GitHub Actions. For a startup where speed matters and the engineers are already stretched thin, the declarative approach means less code to write, less code to maintain, fewer bugs in the eval infrastructure itself.

The combinatorial coverage is genuinely useful for catching regressions. If you change a system prompt and Promptfoo runs it against four models with twenty assertions each, you catch breakage that a manual spot-check would miss. The limitation is that this only works for single-turn evaluations. If the startup is building a chatbot or an agent, Promptfoo alone won't cut it.

Right — the asterisk on that recommendation is "as long as your product is request-response." Which, honestly, still covers a lot of startups. Enterprise team wanting hosted dashboards.

The dataset versioning alone is worth it for a team where multiple stakeholders need to understand eval results. When a product manager asks "why did the hallucination score drop this week," you can show them exactly which examples changed in the dataset, which model configuration shifted, what the per-example scores look like. That traceability is what turns evals from a developer tool into an organizational capability.

The collaboration features are the multiplier. Engineers write the eval logic, but domain experts review the results, annotate edge cases, flag quality issues. Without a shared platform, that feedback loop is email threads and screenshots. Braintrust makes it a structured workflow.

For enterprises with compliance requirements, having a complete audit trail of every evaluation — what model was tested, against what data, with what results, who reviewed it — is increasingly non-negotiable. Braintrust provides that out of the box.

Last one: solo engineer prototyping a prompt.

This is the hardest call because it depends on what kind of engineer. If you're a Python developer who thinks in tests, DeepEval. The pytest integration means zero new tooling to learn — you write evals the same way you write tests, they run in the same CI pipeline, the mental model is identical. The pre-built metrics for faithfulness, relevancy, and hallucination give you a fast start without having to design your own scoring logic.

I'd actually argue for Promptfoo here, at least for the initial exploration phase. When you're prototyping a prompt, you're iterating fast — tweak the wording, run against a few examples, tweak again. Promptfoo's YAML config makes that iteration loop extremely tight. Change a line, rerun, see the results table. You don't need to write any code until you've converged on a prompt structure you like.

That's fair. The counterpoint is that Promptfoo's assertion model is shallow — for nuanced quality evaluation during prototyping, you often want something closer to DeepEval's semantic metrics. But for the first hour of exploration, the speed of the YAML loop probably wins.

The real answer might be Promptfoo for exploration, then DeepEval once you have a prompt worth testing rigorously. But if I have to pick one for the solo engineer who just wants to get something working, I'd say Promptfoo for the zero-code start.

Let's zoom out to something that struck me while researching all four. Every framework makes a bet about who the primary user is, and that bet shapes everything downstream. Inspect bets on research scientists who need composability and reproducibility. Promptfoo bets on developers who want configuration over code. DeepEval bets on software engineers who already think in test suites. Braintrust bets on teams who need collaboration and lineage.

None of them serve all four audiences well. That's not a failure — it's a reflection that LLM evaluation is multi-stakeholder. The safety researcher, the startup CTO, the enterprise ML platform team, and the indie developer prototyping on a Saturday all have different constraints.

The other thing that jumps out is how immature the multi-turn and agent evaluation story is across the board. Inspect has the best architecture for it, but even there, evaluating an agent trajectory — not just the final output but the quality of the intermediate reasoning, the appropriateness of tool selections, the efficiency of the path — is still an open research problem. None of these frameworks has a satisfying answer.

That's where I think the next wave of innovation will come from. As more applications move from single-turn prompting to agentic workflows, the eval frameworks that can't handle state and trajectories will become increasingly irrelevant. Promptfoo's request-response model works great until your product isn't request-response anymore.

The dataset versioning gap is more concerning than it looks. Without proper versioning, you can't do apples-to-apples comparisons over time. You change your eval dataset — add harder examples, fix mislabeled ones, expand coverage — and your scores shift. Without versioning, you don't know if the shift is because your model got worse or your test got harder. Braintrust nails this. The others basically leave it to you.

That's a good segue to practical takeaways. If you're evaluating these frameworks for your own use, what should you actually do?

First, be honest about your primary use case. Are you doing safety research or regression testing? Those lead to different frameworks. Don't pick a framework because it has features you might need someday — pick the one that solves your actual problem today.

Second, test the multi-turn and agent story before committing. Build a minimal version of your actual eval — not a toy example, something representative — and see how the framework handles state, tool calls, and conversation flow. The documentation might say it supports multi-turn, but the ergonomics matter enormously.

Third, don't underestimate the dataset versioning problem. If you're going to run evals regularly — and you should — you need to know whether score changes come from model changes or data changes. If your framework doesn't handle this, budget time to build versioning yourself.

Fourth, think about who needs to see the results. If it's just you, a local CLI output is fine. If it's your team lead, you need dashboards. If it's a compliance auditor, you need exportable reports with full lineage. The audience for eval results shapes the infrastructure you need around the framework.

A point that connects back to something Daniel's been interested in — the open source versus hosted tradeoff isn't just about cost. It's about control. Inspect is fully open source and self-hosted, which means your eval data never leaves your infrastructure. That matters for security-sensitive applications, for research on unpublished models, for any context where you can't send data to a third party. Braintrust is the opposite — the value is in their platform, and you're committing to their infrastructure.

Promptfoo and DeepEval sit in the middle. Both have open-source cores with commercial hosted layers. You can start self-hosted and migrate to the cloud when you need collaboration features. That hybrid model is increasingly common and, honestly, probably where the industry is heading.

One more thing that I think gets overlooked: the quality of the default metrics matters enormously for adoption. DeepEval's pre-built hallucination detection and faithfulness metrics mean you can get a reasonable eval running in an afternoon. With Braintrust, you need to bring your own scoring logic, which is more flexible but has a higher startup cost. With Inspect, you're building scorers from scratch, which gives you maximum control but requires the most expertise.

That's the classic flexibility-versus-time-to-value tradeoff. Inspect maximizes flexibility, Promptfoo and DeepEval optimize for time-to-value in different ways, Braintrust optimizes for team scalability. Pick your poison.

I want to flag one thing we haven't mentioned: the community and ecosystem around each framework. Promptfoo has a surprisingly active open-source community — lots of example configs, shared assertions, blog posts. DeepEval has the pytest ecosystem to lean on. Inspect has the credibility of the UK AISI but a smaller community. Braintrust has enterprise customers but less public community activity. For a solo engineer, community matters — it's where you find answers when the docs are insufficient.

The docs are always insufficient somewhere. Every framework has dark corners where the documented behavior and the actual behavior diverge. A healthy community means someone else has already found and documented those corners.

Alright, let's land this. If someone put a gun to my head and said pick one framework for the typical listener of this show — someone technically sophisticated, probably in a small team or solo, building LLM-powered applications, wants evals in CI — I'd say Promptfoo for the speed of setup and the CI integration. But with the strong caveat that if your application involves multi-turn conversations or agents, you need to supplement it with something else or accept that you're only testing components, not the full system.

I'd go DeepEval for the Python-native developer, with the same caveat. The pytest integration is just too natural if you already live in that world, and the pre-built metrics save you from reinventing wheels. But if I were running a research lab, Inspect. And if I were leading an ML team at a company with more than twenty engineers, Braintrust. The dataset versioning and collaboration features pay for themselves at that scale.

The fact that we can't give a single answer is actually the most useful signal. The market hasn't consolidated. The abstractions haven't standardized. We're still in the phase where the right choice depends heavily on your specific constraints. That'll change over the next few years — one or two frameworks will pull ahead, the others will specialize or fade. But right now, you have to do the work of matching framework to use case.

Daniel, since you asked for actual opinions and not equal-time hedging: if you're evaluating these for your own work — AI and automation, technical communications, open source development — I suspect Promptfoo or DeepEval is your sweet spot, depending on whether you prefer YAML or Python. Inspect is probably overkill for your use cases, and Braintrust is probably over-budget for solo or small-team work. But you know your constraints better than we do.

One forward-looking thought: the framework that figures out agent evaluation — real, native support for multi-turn trajectories with tool use, not bolted-on workarounds — will have a structural advantage for the next five years. Agents are where the field is going, and eval infrastructure that can't handle agents is infrastructure with an expiration date. None of the four has fully solved this yet, which means the window is still open.

Thanks to our producer Hilbert Flumingtop for keeping this show running. Modal sponsors our infrastructure — serverless GPUs keep the lights on. This has been My Weird Prompts. You can find every episode at myweirdprompts.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2403: Choosing Your LLM Eval Framework

Mentions

Downloads

You Might Also Like

Featured In

#2403: Choosing Your LLM Eval Framework