# Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval Splits

> The dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.

In fine-tuning, the dataset is the model — quality and coverage matter far more than size. Define the exact input/output format, curate high-quality real examples, clean and deduplicate ruthlessly, augment thin spots with validated synthetic data, and hold out a representative eval split before you train. Most fine-tuning failures are dataset failures, not training failures.

Almost every fine-tuning failure is a dataset failure. The training run is the easy, mechanical part; the model's quality is decided before training starts, by what's in the data. **The dataset is the model** — it learns exactly the distribution, format, and quality you feed it, including the mistakes. So the work is in preparation: the right format, clean and representative examples, careful augmentation, and an honest eval split.

## Quality and coverage beat size

The instinct to gather "as much data as possible" is usually wrong. A few hundred to a few thousand **clean, on-distribution** examples typically outperform tens of thousands of noisy ones, especially for parameter-efficient methods like LoRA/QLoRA. More data with errors, duplicates, or off-distribution noise doesn't help — it teaches the model the noise. Optimize for *representativeness* (does the set cover the real inputs, including the hard cases?) and *correctness*, then add volume only where evals show a gap.

## The format is a decision, not a detail

Decide the exact input→output shape the model will see **in production**, and make the training data match it precisely — same roles, same structure, same tool-call format. If you fine-tune on a format you don't serve, you optimize a shape you'll never use and the gains evaporate at inference. Settle this first; everything downstream formats to it.

## Clean ruthlessly

This is the unglamorous step that matters most:

- **Deduplicate** exact and *near*-duplicates — they cause memorization and silently leak into your eval split, inflating scores.
- **Fix label/answer errors** — a wrong target is worse than a missing one; the model faithfully learns the mistake.
- **Strip PII and secrets** — both a privacy obligation and a way to stop the model regurgitating sensitive strings.
- **Normalize and balance** — consistent formatting, and no single pattern so dominant it crowds out the rest.

## Augment with synthetic data — carefully

Where real coverage is thin (rare intents, edge cases), generate synthetic examples, often from a stronger teacher model, for the under-represented slices. The discipline is to **validate synthetic data as rigorously as real data**: unchecked, it repeats itself, narrows diversity and under-covers the distribution's tails (the "model collapse" failure mode), and imports the teacher's errors and biases. Keep it a deliberate supplement that fills known gaps — checked for coverage and variety, not just per-example correctness — and never a bulk substitute for real examples.

## Split for eval before you train

Carve out a representative validation/test split **before** training and guarantee it doesn't overlap training — including near-duplicates, the most common leak. This held-out set is your only honest measure of whether the model *generalizes* instead of memorizing, your overfitting detector, and the basis for comparing versions. Deciding the split after you've seen results is self-deception. Wire the eval set into your [eval harness](/guides/evaluation/write-llm-evals) so every fine-tune is scored the same way.

> [!WARNING]
> Data leakage between train and eval is the silent killer of fine-tuning projects: it produces great offline numbers and a model that flops in production. Deduplicate across the *whole* dataset before splitting, and split by a stable key (e.g. source document or entity) so paraphrases of the same item can't land on both sides.

## Putting it together

Build the dataset like it's the deliverable, because it is: fix the production-matching **format**, **curate** representative real examples, **clean and dedup** without mercy, **augment** thin spots with validated synthetic data, and reserve a **leak-free eval split** before training. Then format to the trainer's schema, validate, and version it.

The [Fine-Tune Dataset Builder](/skills/data/finetune-dataset-builder) skill automates the cleaning, dedup, formatting, and splitting; the [finetuning-engineer](/agents/data-ai/finetuning-engineer) takes the prepared dataset through training and evaluation; and the [QLoRA Fine-Tune Runner](/skills/data/qlora-finetune-runner) runs the training itself.

---

_Source: https://agentscamp.com/guides/mlops/finetune-dataset-prep — Guide on AgentsCamp._
