# Inference

> Inference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.

**Inference is using a trained model rather than training it: for LLMs, the process of generating output tokens one at a time, each requiring a full pass through the model's weights.**

Two phases with different physics: **prefill** processes the whole prompt in parallel (compute-bound, sets time-to-first-token), then **decode** generates autoregressively, one token per step (memory-bandwidth-bound, sets tokens-per-second). The [KV cache](/glossary/kv-cache) keeps decode from re-reading the prompt each step; [quantization](/glossary/quantization) shrinks the weights being streamed; [speculative decoding](/glossary/speculative-decoding) drafts several tokens per big-model step; engines like [vLLM](/tools/vllm) batch many requests over the same weights.

Inference economics shape every LLM product decision: API pricing per [token](/glossary/llm-token), the [self-host vs API question](/guides/mlops/self-host-vs-api-llm) (which is really "can your utilization beat a provider's"), and the latency budget your UX can absorb. The applied playbook — caching, right-sizing models, p95 budgets — is [LLM Cost and Latency Engineering](/guides/advanced/llm-cost-latency-engineering).

---

_Source: https://agentscamp.com/glossary/inference — Term on AgentsCamp._
