Inference
Inference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
Inference is using a trained model rather than training it: for LLMs, the process of generating output tokens one at a time, each requiring a full pass through the model's weights.
Two phases with different physics: prefill processes the whole prompt in parallel (compute-bound, sets time-to-first-token), then decode generates autoregressively, one token per step (memory-bandwidth-bound, sets tokens-per-second). The KV cache keeps decode from re-reading the prompt each step; quantization shrinks the weights being streamed; speculative decoding drafts several tokens per big-model step; engines like vLLM batch many requests over the same weights.
Inference economics shape every LLM product decision: API pricing per token, the self-host vs API question (which is really "can your utilization beat a provider's"), and the latency budget your UX can absorb. The applied playbook — caching, right-sizing models, p95 budgets — is LLM Cost and Latency Engineering.
Frequently asked questions
- Why is LLM inference expensive?
- Generation is sequential: each output token requires a full forward pass over billions of weights, and you can't produce token N+1 before token N. Reading the prompt parallelizes well; writing the answer doesn't. That asymmetry is why output tokens cost more than input tokens and why long answers dominate latency.
- What's the difference between time-to-first-token and throughput?
- TTFT is how long before the first token appears — dominated by prompt processing, it's what makes chat feel responsive (and what streaming exploits). Throughput is tokens per second once generation is rolling — what determines total time and serving capacity. Optimizations often trade one against the other; know which one your product feels.
Related
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- KV CacheThe KV cache stores each token's attention keys and values so an LLM doesn't recompute the whole context per new token — the memory that makes generation fast.
- QuantizationQuantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.
- Speculative DecodingSpeculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- Batch InferenceBatch inference processes many LLM requests asynchronously instead of one-at-a-time interactively — typically at ~50% discount via provider batch APIs.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Mixture of Experts (MoE)MoE is a model architecture where a router activates only a few expert subnetworks per token — huge total capacity, a fraction of the compute per token.
- Reasoning ModelA reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.
- Token StreamingToken streaming delivers model output incrementally as it's generated — via SSE or websockets — so users see text immediately instead of waiting.