# Batch Inference

> Batch inference processes many LLM requests asynchronously instead of one-at-a-time interactively — typically at ~50% discount via provider batch APIs.

**Batch inference is running LLM requests asynchronously in bulk — submit a job of many requests, collect results when ready — instead of the interactive request-response loop, usually at a steep discount.**

It exists because providers can schedule deferred work into idle capacity: the standard batch tier prices at roughly **half of interactive rates** for results within a stated window. The candidates are everything without a user waiting — labeling and classification backfills, [synthetic-data](/glossary/synthetic-data) generation, periodic summarization, bulk evaluation runs, embedding regeneration — which in many products is the *majority* of token volume, hiding in plain sight at full price.

The practical pattern: audit your traffic, split it into interactive (humans waiting — pay for latency) and deferrable (move to batch), and stack the discounts — batch pricing composes with [prompt caching](/glossary/prompt-caching) on repeated prefixes. It's one of the three blunt levers in [LLM cost engineering](/guides/advanced/llm-cost-latency-engineering), alongside caching and model right-sizing — and the only one that's purely logistical: same model, same outputs, half the bill.

---

_Source: https://agentscamp.com/glossary/batch-inference — Term on AgentsCamp._