Inference

Inference is using a trained model rather than training it: for LLMs, the process of generating output tokens one at a time, each requiring a full pass through the model's weights.

Two phases with different physics: prefill processes the whole prompt in parallel (compute-bound, sets time-to-first-token), then decode generates autoregressively, one token per step (memory-bandwidth-bound, sets tokens-per-second). The KV cache keeps decode from re-reading the prompt each step; quantization shrinks the weights being streamed; speculative decoding drafts several tokens per big-model step; engines like vLLM batch many requests over the same weights.

Inference economics shape every LLM product decision: API pricing per token, the self-host vs API question (which is really "can your utilization beat a provider's"), and the latency budget your UX can absorb. The applied playbook — caching, right-sizing models, p95 budgets — is LLM Cost and Latency Engineering.

Frequently asked questions

Why is LLM inference expensive?

Generation is sequential: each output token requires a full forward pass over billions of weights, and you can't produce token N+1 before token N. Reading the prompt parallelizes well; writing the answer doesn't. That asymmetry is why output tokens cost more than input tokens and why long answers dominate latency.

What's the difference between time-to-first-token and throughput?

TTFT is how long before the first token appears — dominated by prompt processing, it's what makes chat feel responsive (and what streaming exploits). Throughput is tokens per second once generation is rolling — what determines total time and serving capacity. Optimizations often trade one against the other; know which one your product feels.

Frequently asked questions

Related