# Synthetic Data

> Synthetic data is training or eval data generated by a model rather than collected from the world — filling gaps, balancing classes, bootstrapping fine-tunes.

**Synthetic data is data produced by a model instead of gathered from users or the world — generated examples used to train, fine-tune, or evaluate AI systems.**

It solves the data bottleneck that dominates applied ML: real examples are scarce, expensive to label, privacy-encumbered, or missing exactly the edge cases you need. A strong LLM can manufacture variations from seeds, label at scale, simulate rare scenarios, and — in the [distillation](/glossary/distillation) pattern — generate the entire training set for a smaller model. Modern [fine-tuning](/glossary/fine-tuning) pipelines are substantially synthetic, and frontier labs' post-training famously leans on model-generated data.

The craft is quality control, because generation is cheap and *good* data isn't: synthetic distributions run smoother than reality, teacher mistakes propagate, and naive recycling degrades models. Production practice filters aggressively, validates against real held-out data, and never lets the eval set be synthetic-only. The applied versions live in [Preparing a Fine-Tuning Dataset](/guides/mlops/finetune-dataset-prep) and the [finetune-dataset-builder](/skills/data/finetune-dataset-builder) skill — and on the eval side, synthesizing the edge cases your logs lack is a standard move in [Write Evals for an LLM App](/guides/evaluation/write-llm-evals).

---

_Source: https://agentscamp.com/glossary/synthetic-data — Term on AgentsCamp._