Daniel sent us this one — he's running a data classification pipeline, thousands of items getting labeled programmatically, and he's wondering about batch inference APIs. What are they actually for, why the fifty percent discount, can you use them in anything conversational, and at what volume does the engineering overhead stop being worth it. Plus he wants the lay of the land across OpenAI, Anthropic, Google, and the rest.
This is exactly the right question at exactly the right moment. Batch APIs are one of those things where most people nod along and say "cheap inference, got it" and then completely miss what's actually happening under the hood.
Before we dive in — quick note, today's episode script is coming from DeepSeek V four Pro. So if anything sounds unusually coherent, that's why.
I'll take that as a compliment to our usual incoherence.
It was meant as one. Alright, let's start with the basics. What actually is a batch API? Because I think the name is slightly misleading.
A batch API is essentially an asynchronous inference pipeline. Instead of you sending a request and waiting for the response, you submit a file containing hundreds or thousands of requests, the provider queues them up, processes them whenever they have spare capacity, and then gives you a results file back, usually within twenty-four hours. The key word is asynchronous. You're not holding an open connection waiting for tokens to stream back.
It's less "batch" in the mainframe job sense and more "deferred execution." You hand over a JSONL file of prompts and walk away.
And that JSONL format has become something of a de facto standard. Each line is a complete request object with a custom ID you assign so you can match results back to inputs. OpenAI, Anthropic, Google — they all use some variation of this pattern.
Alright, so the fifty percent discount. That's the headline number Daniel mentioned and it's roughly accurate — OpenAI's Batch API is fifty percent off standard pricing, Anthropic's Message Batches are fifty percent off, Google's batch prediction is in that same ballpark. Why can they afford to do that?
This is where the economics get genuinely interesting. There are three things going on simultaneously. The first and most important is off-peak utilization. GPU clusters are expensive capital assets and they sit partially idle during low-demand periods — overnight, weekends, holidays. If you're running a fifty-thousand-GPU cluster and it drops to sixty percent utilization at three in the morning, every idle GPU hour is revenue you never get back. Batch jobs let you fill those valleys.
It's basically yield management, like airlines filling empty seats.
It's exactly yield management. The second factor is that batch jobs give providers perfect scheduling flexibility. A synchronous request has a strict latency SLA — if you don't get tokens back within a few hundred milliseconds, the user experience degrades badly. Batch jobs have no such constraint. The provider can pause your job, resume it later, shift it between clusters, defragment GPU memory around it. That operational flexibility is worth real money.
The third factor?
Batching at the inference engine level. With batch processing, the inference engine can combine multiple requests into the same forward pass, sharing the model weights across them. If you can pack four or eight requests into one pass, your throughput per GPU multiplies while your compute cost stays nearly flat. That's not something you can do with synchronous requests arriving unpredictably.
It's not just "we'll get to it when we're less busy." There's a genuine efficiency gain in how the computation itself is structured.
And this is why the discount exists at all — it's not a loss leader. The providers are still making margin on batch jobs, sometimes better margin than on synchronous traffic, because the cost to serve is lower. They're passing some of that through as the discount.
Daniel's classification job — thousands of items, each one a straightforward label-this-text instruction, no user waiting on the other end — that is the platonic ideal of a batch workload.
It's textbook. Classification, extraction, enrichment, evaluation runs, synthetic data generation, embedding backfills — these share a few properties. One, the prompts are known in advance. Two, there's no human in the loop waiting for a response. Three, the outputs are structured — you're typically asking for JSON, a label, a score. Four, you care about throughput and cost, not latency.
Let's talk about that latency tradeoff. Daniel's intuition was that conversational UIs need sub-second latency and batch endpoints can't deliver that. He's right, but I want to understand exactly why, and whether there's any edge case where you'd use batch in a user-facing product.
The latency tradeoff is structural, not incidental. A synchronous API call returns your first token in two hundred to five hundred milliseconds, and streams the rest at maybe fifty to a hundred tokens per second. With a batch API, your minimum turnaround is measured in minutes — and in practice, it's often hours. OpenAI's Batch API documentation says they aim to complete jobs within twenty-four hours, and in practice most finish much faster, but there's no SLA guaranteeing sub-hour completion. Anthropic is similar.
Even the fastest possible batch job — say it completes in fifteen minutes — is three orders of magnitude slower than a synchronous request.
Three orders of magnitude. So no, you cannot use batch APIs directly in a conversational interface. But — and this is the nuance — there is a pattern where batch and synchronous APIs work together in the same product.
Walk me through that.
Imagine a customer support chatbot. The conversational back-and-forth with the user — that's synchronous. Has to be. But behind the scenes, you might be doing nightly batch runs to re-embed your entire knowledge base, or to classify and route the previous day's unresolved tickets, or to generate evaluation data for fine-tuning. The user never sees the batch part, but it's powering the synchronous part indirectly.
Batch is infrastructure. It's the thing that prepares the ground for the real-time experience.
That's exactly the right framing. Batch is infrastructure. Synchronous is interface. They're complementary layers.
Let's go provider by provider. What are the actual differences in how they implement batch?
OpenAI's Batch API is the most mature in terms of ecosystem integration. You upload a JSONL file where each line has a custom ID, a method, a URL, the request body with model and messages. The file goes to their Files API, you create a batch job pointing at that file, and you get back an output JSONL file with your custom IDs preserved. Pricing is fifty percent off the standard rate. One detail that trips people up — the batch endpoint has its own rate limits separate from your synchronous limits, and the maximum batch size is around fifty thousand requests per file.
What about Anthropic?
The pattern is almost identical — JSONL upload, custom IDs, fifty percent discount. One difference is that Anthropic enforces a stricter rate limit on batch submissions — around ten thousand requests per batch. The biggest practical difference is that Anthropic's batch API respects the same content safety filters as their synchronous API, which matters if you're processing user-generated content that might trip those filters.
Google's batch prediction is part of their Vertex AI platform, and it's a different beast. It's designed not just for LLMs but for any model hosted on Vertex. For Gemini, you submit a BigQuery table or a Cloud Storage file with your prompts. The pricing model is different too — instead of a straight fifty percent discount, they charge per node-hour, which can work out cheaper or more expensive depending on utilization efficiency. Google also offers a simpler "batch mode" on the Gemini API directly, but the Vertex batch prediction is the industrial-grade version.
What about DeepSeek and OpenRouter?
DeepSeek's batch API is available through their platform, but the pricing is already so low on their synchronous API — around fourteen cents per million input tokens on DeepSeek V three — that the batch discount is proportionally smaller, maybe twenty to thirty percent. OpenRouter is an aggregator, so their batch offering is a unified interface over multiple providers' batch APIs. You submit once, they route to whichever underlying provider gives you the best combination of price and availability. The tradeoff is another layer of abstraction, which means another potential failure mode.
That brings us to the question I think is most practically useful — when does it stop making sense to batch? Daniel's running thousands of items. That's clearly batch territory. But what about hundreds? What about dozens?
This is the breakeven analysis that almost nobody does. The batch API saves you fifty percent on token costs. But it adds engineering overhead — you need to build the file upload and download pipeline, handle asynchronous polling, manage retries and partial failures, and deal with the fact that your results might arrive in fifteen minutes or fifteen hours. The question is whether the dollar savings exceed the engineering cost.
There's a crossover point.
There is, and I think it's lower than most people assume. If you're spending a thousand dollars a month on synchronous inference for a classification pipeline, switching to batch saves you five hundred dollars. That probably justifies a couple of days of engineering work. But if you're spending twenty dollars a month, the savings are ten dollars. Unless your time is free, the batch migration doesn't pay for itself.
What about volume in terms of request count rather than dollars?
I'd say the practical floor is around a thousand requests per job. Below that, the overhead of creating the JSONL file, uploading it, polling for completion, and parsing the output starts to feel disproportionate. Between a thousand and ten thousand requests, batch is a clear win if the workload is asynchronous. Above ten thousand, batch isn't just a win — it's basically mandatory if you care about cost, because synchronous rate limits will throttle you anyway.
OpenAI's batch endpoint will happily handle jobs with hundreds of thousands of requests. But at that scale, a single JSONL file with a million requests is tens of gigabytes — you need to think about upload bandwidth. You also start worrying about failure modes. If one percent of your requests fail, that's ten thousand failed requests you need to retry. At scale, the batch pipeline itself becomes a non-trivial piece of infrastructure.
There's a middle sweet spot — large enough that the savings matter, small enough that you're not building a whole new distributed system to manage it.
I think that sweet spot is roughly one thousand to one hundred thousand requests per job. Within that range, batch APIs are one of the best deals in AI infrastructure.
Let's talk about some of the gotchas, because I've played with these endpoints and there are things that don't show up in the documentation.
Please, go ahead.
First gotcha — rate limits on batch submission itself. Most providers limit how many batches you can have in flight simultaneously. OpenAI lets you submit something like twenty batches per day depending on your tier. If you've got a million requests split across twenty files, you're fine. If you've got a million requests in one file, you're also fine — but only one file at a time. The parallelism constraint is real.
Related to that — batch prioritization is opaque. There's no priority tier, no "pay extra for faster turnaround." I've seen jobs complete in ten minutes and I've seen identical jobs take eight hours. If your workflow has any time sensitivity, batch is not your friend.
Second gotcha — partial failures. Your batch job completes, you download the output file, and some fraction of your requests returned errors. Now you need to identify which ones failed, extract them, fix whatever went wrong, and resubmit. If you don't build this into your pipeline from the start, you'll be doing it manually.
Third gotcha — the output format isn't always a clean mirror of the input. The custom ID is supposed to come back so you can match results, but I've seen edge cases where the output ordering doesn't match the input ordering, especially with very large files. Always join on the custom ID.
Fourth — cost estimation is harder with batch. With synchronous calls, you know immediately what each request cost. With batch, you submit a file, wait hours, and then find out what you spent. If you had a bug that caused the model to generate ten times more output tokens than expected — congratulations, you just silently overspent by an order of magnitude.
Which is why you should always do a dry run on a small subset before submitting the full batch. Ten requests, check the output, check the token counts, then scale.
Let's circle back to something Daniel asked about — the use case profile. He listed classification, extraction, enrichment, evals, synthetic data generation, embedding backfills. That's a solid list.
Content moderation pipelines — running a batch of user-generated content through a classifier to flag policy violations. Translation backfills — if you've got a corpus of documentation and you need to translate it into twelve languages. Summarization of large document sets — think processing every SEC filing from the past decade. And one that's emerging — dataset distillation for fine-tuning. You take a massive dataset, run it through a strong model in batch mode to generate high-quality training examples, and then fine-tune a smaller model on the output.
Dataset distillation is interesting because it's batch on both ends — you're using batch inference to generate training data for a model that you'll then serve synchronously.
This pattern is becoming more common as people realize that the best way to get high-quality fine-tuning data is to have a frontier model generate it, but doing that synchronously at scale would be prohibitively expensive.
What about the embedding backfill case? That's slightly different because embeddings don't go through the chat completions endpoint.
Right, but you can use the same Files API pattern for embeddings. The economics are similar — you're trading latency for cost. And embedding backfills are a perfect use case because they're almost always one-shot, large-scale jobs. You've got a million documents, you embed them all once and you're done.
One thing I want to flag — batch APIs aren't a good fit for workloads where the output of one request determines the input of the next. If you're doing a chain-of-thought reasoning process where each step depends on the previous step, batch doesn't work because you can't interleave the steps. You'd have to submit step one for all items, wait for completion, then submit step two — and at that point you've lost most of the latency advantage.
That's a really important point. Batch is for embarrassingly parallel workloads. Each request is independent. If there are dependencies between requests, you need something more like a workflow orchestrator — and even then, each independent stage can be batched, but you've added complexity.
Alright, let's do a quick provider comparison in words.
OpenAI Batch API — most mature, widest model selection, fifty percent discount, JSONL format, Files API for upload and download. Works with GPT-4o, GPT-4, GPT-3.5, and the reasoning models. Maximum batch size around fifty thousand requests. Anthropic Message Batches — very similar pattern, fifty percent discount, works with Claude Opus, Sonnet, and Haiku. Smaller maximum batch size — ten thousand requests. One advantage is that Anthropic's safety filters are consistent between sync and batch. Google — two offerings. Gemini API batch mode, which is simpler, and Vertex AI batch prediction, which is the industrial version. Vertex integrates with BigQuery and Cloud Storage, which is great if you're already in the Google Cloud ecosystem but adds friction if you're not. Pricing is node-hour based rather than a straight discount. DeepSeek — batch available through their platform, discount is smaller because base prices are already so low. OpenRouter — aggregates across providers with a unified batch interface. Advantage is flexibility, disadvantage is adding an intermediary.
If you're already committed to a specific model ecosystem, use that provider's batch API. If you want to shop around or hedge against provider downtime, OpenRouter makes sense.
That's the heuristic, yes.
Let's get to the volume threshold. Daniel asked at what point the engineering overhead is worth the fifty percent saving versus just firing parallel requests at the regular endpoint.
I've been thinking about this in terms of a simple model. Let's say your synchronous pipeline processes a thousand items per hour and costs ten dollars in API fees. Switching to batch cuts that to five dollars. If it takes you four hours of engineering time to set up the batch pipeline, and your engineering time is worth, say, a hundred dollars an hour, you've spent four hundred dollars to save five dollars per run. You need eighty runs to break even.
That's if you're doing it manually every time. Once the pipeline exists, the marginal cost of running it is near zero.
The setup cost is one-time. So the real question is: how many times will you run this pipeline? If it's a one-off classification job, the batch savings might not justify the setup. If it's a pipeline you'll run weekly for the next two years, batch is a no-brainer even at small volumes.
I think there's also a threshold where synchronous parallelism stops being practical. If you're firing a thousand parallel requests at the synchronous endpoint, you're going to hit rate limits. At some point, batching isn't just about cost — it's about feasibility.
The synchronous API wasn't designed for massive throughput. Rate limits, connection management, retry logic for transient failures — all of that gets harder as you scale. The batch API is purpose-built for high-volume throughput. At a certain scale, you use batch not because it's cheaper, but because it's the only thing that works reliably.
The decision framework is: is your workload asynchronous? Are your requests independent? Is your volume above roughly a thousand items per run? Will you run this more than once or twice? If yes to all of those, batch is the right call.
I'd add one more: can you tolerate variable turnaround time? If you need results in under an hour, batch is risky. If you can wait until tomorrow, batch is perfect.
One thing we haven't talked about — what happens when batch jobs fail catastrophically? Not partial failures, but the whole job fails.
It's rare but it happens. Usually because of a malformed input file — a JSON error on line forty-seven thousand that causes the whole batch to be rejected. Or a provider-side outage that aborts your job mid-processing. The best practice is to validate your input file with a small test batch first — ten or twenty requests — and to build your pipeline so that a failed batch job can be resubmitted without losing state. Keep your input file, keep your custom IDs, and design for idempotency.
Idempotency is one of those words that sounds fancy but just means "running it twice gives the same result as running it once.
And with LLMs, true idempotency is hard because the outputs are non-deterministic. But you can at least ensure that resubmitting a failed batch doesn't duplicate work or corrupt your results.
Alright, I want to zoom out for a second. Batch APIs right now are basically a discount mechanism for off-peak compute. But as AI inference gets cheaper — and it's been dropping by something like a factor of ten per year — does the batch discount eventually become irrelevant?
I don't think so. The gap between peak and off-peak demand isn't going away. In fact, as AI becomes more embedded in real-time applications, the demand curve probably gets peakier — more usage during business hours, less at night. That means the economic incentive for providers to fill off-peak capacity persists. The absolute dollar amounts might shrink, but the relative discount — the fifty percent — probably sticks around.
The other dynamic is that as inference gets cheaper, people do more of it. The volume of batch workloads might actually increase because use cases that were previously too expensive become viable.
Jevons paradox applied to AI inference. Cheaper inference leads to more inference, not less spending.
And now: Hilbert's daily fun fact.
The average cumulus cloud weighs about one point one million pounds. That's roughly the weight of two hundred adult elephants, floating above your head.
Where does this leave someone like Daniel? I think the answer is pretty clear — he should batch. Classification is asynchronous, independent, structured-output, and at thousands of items, well above the threshold where batch makes sense. The fifty percent savings on a pipeline like that is real money.
The engineering overhead isn't that high. If you're already writing code to call the synchronous API, switching to batch is maybe an afternoon of work. You wrap your prompts in a JSONL file, upload it, poll for completion, download results. The OpenAI SDK has native batch support now, Anthropic's SDK does too.
The one thing I'd caution is to build the retry and failure-handling logic from day one. It's tempting to write the happy path and move on, but batch pipelines run unattended, and when they fail at three in the morning, you want them to fail gracefully with a clear error message, not silently drop half your data.
Monitor your costs. Set up billing alerts. Batch makes it easy to submit a large job and forget about it until the invoice arrives. A thousand requests at fifty percent off is still a thousand requests — the discount doesn't make it free.
The broader takeaway here is that batch APIs are one of those rare things in cloud infrastructure that are underpriced relative to the value they provide. You're getting the same model, the same quality, the same output — just delivered asynchronously — for half the price. That's a better deal than almost any other optimization you can make.
The reason it's a good deal is structural, not promotional. The providers have lower costs to serve batch workloads, and they're passing some of that through. That's sustainable.
One open question I have is whether we'll eventually see a tiered batch system. Priority batch with a twenty percent discount and a one-hour SLA, standard batch with fifty percent and twenty-four hours. The current one-size-fits-all approach leaves a gap for workloads that are too latency-sensitive for overnight batching but too cost-sensitive for synchronous.
I'd bet we see that within the next year. The cloud providers figured out tiered storage and tiered compute a long time ago. Tiered inference is the obvious next step.
Thanks to Hilbert Flumingtop for producing, as always. This has been My Weird Prompts. You can find every episode at myweirdprompts.We'll be back soon.