Batch Inference

Batch inference is running LLM requests asynchronously in bulk — submit a job of many requests, collect results when ready — instead of the interactive request-response loop, usually at a steep discount.

It exists because providers can schedule deferred work into idle capacity: the standard batch tier prices at roughly half of interactive rates for results within a stated window. The candidates are everything without a user waiting — labeling and classification backfills, synthetic-data generation, periodic summarization, bulk evaluation runs, embedding regeneration — which in many products is the majority of token volume, hiding in plain sight at full price.

The practical pattern: audit your traffic, split it into interactive (humans waiting — pay for latency) and deferrable (move to batch), and stack the discounts — batch pricing composes with prompt caching on repeated prefixes. It's one of the three blunt levers in LLM cost engineering, alongside caching and model right-sizing — and the only one that's purely logistical: same model, same outputs, half the bill.

Frequently asked questions

When should I use a batch API?

Whenever no human is waiting: backfills, dataset labeling, synthetic-data generation, nightly summarization, embedding refreshes, bulk evals. Provider batch tiers typically cost about half of interactive pricing in exchange for results within a window (commonly up to 24 hours, usually much faster) — free money for offline workloads.

Is batch inference the same as batching in serving?

Different layers, same word. Provider batch APIs are a product tier: submit a file of requests, collect results later, pay less. Serving-level batching (continuous batching in engines like vLLM) is an engine technique packing concurrent requests onto the GPU. One is how you buy; the other is how the GPU stays busy.

Frequently asked questions

Related