Distillation
Distillation trains a smaller model to imitate a larger one — using its outputs as training data to get most of the capability at a fraction of the cost.
Distillation is training a smaller "student" model on a larger "teacher" model's outputs, transferring most of the teacher's capability on a task into a model that's far cheaper and faster to run.
The pattern in practice: run the frontier model over thousands of representative inputs, capture its outputs (often with reasoning included), curate the best, and fine-tune a small model on the result — synthetic data generation and training in one loop. For narrow tasks (classify, extract, route, rewrite), a distilled small model frequently reaches within a few points of the teacher at a tiny fraction of the per-call cost, which is the whole economics of "GPT-quality at Haiku prices" on your specific workload.
Its boundaries: breadth doesn't transfer (the student learns your task, not general intelligence), quality ceilings inherit from the teacher, and provider terms of service often restrict training on outputs — read them. Where distillation sits against prompting, RAG, and ordinary fine-tuning is mapped in the 2026 decision tree.
Frequently asked questions
- How is distillation different from fine-tuning?
- Distillation is a kind of fine-tuning where the training data comes from a teacher model: you collect the big model's outputs on your task (a form of synthetic data) and train a small model to reproduce them. Classic fine-tuning uses human-curated examples; distillation manufactures them from a model you wish you could afford to run everywhere.
- When does distillation make sense for an LLM product?
- When a frontier model nails your task but its cost or latency doesn't fit production scale — and the task is narrow enough for a small model to learn. Classification, extraction, routing, and templated generation distill beautifully; open-ended reasoning distills poorly. Check the model provider's terms first: many restrict using outputs to train competing models.
Related
- Fine-TuningFine-tuning continues training a pretrained model on your own examples, changing its weights to teach durable behavior, format, or domain style.
- Fine-Tune vs RAG vs Prompt vs Distill: The 2026 Decision TreeWhen to reach for prompt engineering, RAG, fine-tuning, or distillation — what each actually changes, where each fails, and how to combine them.
- Synthetic DataSynthetic data is training or eval data generated by a model rather than collected from the world — filling gaps, balancing classes, bootstrapping fine-tunes.
- QuantizationQuantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.
- Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval SplitsThe dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.
- SLM (Small Language Model)A small language model is a compact LLM — roughly 1–15B parameters — that runs cheaply or locally, trading peak capability for speed and deployability.