# Distillation

> Distillation trains a smaller model to imitate a larger one — using its outputs as training data to get most of the capability at a fraction of the cost.

**Distillation is training a smaller "student" model on a larger "teacher" model's outputs, transferring most of the teacher's capability on a task into a model that's far cheaper and faster to run.**

The pattern in practice: run the frontier model over thousands of representative inputs, capture its outputs (often with reasoning included), curate the best, and [fine-tune](/glossary/fine-tuning) a small model on the result — [synthetic data](/glossary/synthetic-data) generation and training in one loop. For narrow tasks (classify, extract, route, rewrite), a distilled small model frequently reaches within a few points of the teacher at a tiny fraction of the per-call cost, which is the whole economics of "GPT-quality at Haiku prices" on your specific workload.

Its boundaries: breadth doesn't transfer (the student learns *your task*, not general intelligence), quality ceilings inherit from the teacher, and provider terms of service often restrict training on outputs — read them. Where distillation sits against prompting, RAG, and ordinary fine-tuning is mapped in [the 2026 decision tree](/guides/mlops/finetune-vs-rag-vs-prompt).

---

_Source: https://agentscamp.com/glossary/distillation — Term on AgentsCamp._
