Quantization
Quantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.
Quantization is compressing a model by representing its weights (and sometimes activations) in lower-precision numbers — 8-bit, 4-bit, or below instead of 16-bit floats — trading a small amount of accuracy for large savings in memory and speed.
Model weights are just numbers, and most of their precision is redundant. Mapping them onto a coarser grid shrinks a model ~4× at 4-bit, which compounds: less VRAM to fit, less memory bandwidth per token (the real bottleneck of inference), bigger batches per GPU. The cost is quantization error — typically a few percent at 4-bit, near-zero at 8-bit, and increasingly visible below.
It shows up everywhere in the stack: local inference runs on quantized GGUF builds via Ollama and LM Studio; serving economics in self-host deployments lean on 8/4-bit to multiply throughput per GPU; QLoRA fine-tunes against a quantized base (LoRA); and even vector databases quantize embeddings to shrink indexes. The recurring engineering move is the same: measure the quality delta on your task, then take the free memory.
Frequently asked questions
- How much quality does quantization cost?
- Less than intuition suggests, down to a point. 8-bit is near-lossless for most models; well-made 4-bit typically costs a few percent on benchmarks and is the local-inference default; below 4-bit degradation gets noticeable and task-dependent. Bigger models tolerate quantization better — a 4-bit 70B usually beats a full-precision 7B.
- Why does quantization matter for running models locally?
- It's the difference between fitting and not fitting. A 7B model needs ~14 GB at 16-bit but ~4 GB at 4-bit — laptop territory. Tools like Ollama and LM Studio serve quantized GGUF builds by default, which is what makes local LLMs practical on consumer hardware at all.
Related
- InferenceInference is running a trained model to produce output — for LLMs, generating tokens one at a time. Its cost and latency define the economics of AI products.
- LoRA (Low-Rank Adaptation)LoRA fine-tunes a model by training small low-rank adapter matrices instead of all weights — a fraction of the memory and cost, nearly full-tune quality.
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- OllamaAn open-source tool to run open-weight LLMs locally with a single command, including a local OpenAI-compatible API.
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- Embedding Index TunerTune a vector index — HNSW graph parameters and quantization — to hit a recall target at the lowest latency and memory, by sweeping settings against a fixed query set instead of trusting defaults. Use when vector search is slow or memory-hungry, when recall dropped after enabling quantization, or when standing up an index and you need defensible parameters.
- Best Tools for Running LLMs Locally in 2026The local LLM stack, ranked by job: Ollama for serving tools, LM Studio and Jan for desktop exploration, llama.cpp for control, vLLM when it's real serving.
- Ollama vs LM Studio: Running LLMs Locally (2026)Ollama vs LM Studio compared — CLI-first server for developers vs polished desktop app for exploring local models. Which local LLM tool fits how you work.
- JanAn open-source ChatGPT alternative that runs fully offline — a polished desktop app over llama.cpp with a model hub, MCP support, and a local API server.
- Llama CppThe C/C++ inference engine that made local LLMs possible — GGUF quantization, every GPU backend, and an OpenAI-compatible server, with no dependencies.
- WhisperOpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.
- DistillationDistillation trains a smaller model to imitate a larger one — using its outputs as training data to get most of the capability at a fraction of the cost.
- Embedding DimensionEmbedding dimension is the length of an embedding vector — how many numbers represent each text — trading capacity against storage and search cost.
- Mixture of Experts (MoE)MoE is a model architecture where a router activates only a few expert subnetworks per token — huge total capacity, a fraction of the compute per token.
- Open WeightsAn open-weights model publishes its parameters for anyone to download and run — unlike API-only models — with licenses from permissive to restricted.
- SLM (Small Language Model)A small language model is a compact LLM — roughly 1–15B parameters — that runs cheaply or locally, trading peak capability for speed and deployability.
- Speculative DecodingSpeculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.