# Quantization

> Quantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.

**Quantization is compressing a model by representing its weights (and sometimes activations) in lower-precision numbers — 8-bit, 4-bit, or below instead of 16-bit floats — trading a small amount of accuracy for large savings in memory and speed.**

Model weights are just numbers, and most of their precision is redundant. Mapping them onto a coarser grid shrinks a model ~4× at 4-bit, which compounds: less VRAM to fit, less memory bandwidth per token (the real bottleneck of [inference](/glossary/inference)), bigger batches per GPU. The cost is quantization error — typically a few percent at 4-bit, near-zero at 8-bit, and increasingly visible below.

It shows up everywhere in the stack: **local inference** runs on quantized GGUF builds via [Ollama](/tools/ollama) and LM Studio; **serving economics** in [self-host deployments](/guides/mlops/self-host-vs-api-llm) lean on 8/4-bit to multiply throughput per GPU; **QLoRA** fine-tunes against a quantized base ([LoRA](/glossary/lora)); and even [vector databases](/glossary/vector-database) quantize embeddings to shrink indexes. The recurring engineering move is the same: measure the quality delta on *your* task, then take the free memory.

---

_Source: https://agentscamp.com/glossary/quantization — Term on AgentsCamp._
