Finetune Dataset Builder

Builds a training-ready fine-tuning dataset from raw examples: normalizes to the trainer's chat/instruction format, deduplicates (including near-duplicates), strips PII, balances, validates schema and token lengths, and carves a leak-free eval split — because the dataset is the model, and most fine-tuning failures are dataset failures.

The dataset is the model — so this skill treats building it as the real work, not a preprocessing afterthought. It takes raw examples and produces a clean, correctly-formatted, deduplicated dataset with a leak-free eval split, ready to hand to a trainer. Get this right and the training run is mechanical; get it wrong and no amount of tuning saves the result.

When to use this skill

You have raw examples (logs, labeled pairs, exported conversations) and need them formatted, cleaned, and split before fine-tuning.
An existing dataset gave a disappointing fine-tune and you suspect duplicates, leakage, PII, or off-distribution noise.
Standing up a repeatable dataset pipeline so each fine-tune is reproducible.

Instructions

Fix the target format first. Determine the trainer's expected schema (commonly JSONL chat records: system/user/assistant, or instruction-response) and that it matches how the model is called in production. Normalize every example to that exact shape — the training format must mirror the inference format.
Deduplicate, including near-duplicates. Remove exact duplicates and fuzzy/near-duplicates (normalized text, embedding similarity). Near-duplicates are the main cause of memorization and the silent leak that inflates eval scores, so be aggressive here.
Clean and correct. Fix label/answer errors, drop malformed records, normalize whitespace/formatting, and strip PII and secrets. A wrong target teaches the wrong thing; sensitive strings risk being memorized and regurgitated.
Balance and check coverage. Make sure no single pattern or class dominates, and that the set covers the real input distribution including edge cases. Flag thin slices that may need real or validated synthetic examples (see Preparing a Fine-Tuning Dataset).
Validate the schema and token lengths. Confirm every record parses against the schema and fits within the model's context length; quarantine the ones that don't rather than silently truncating.
Carve a leak-free split. Split into train/validation (and test) by a stable key (source document, entity, or user) so paraphrases of the same item can't land on both sides, and deduplicate across the split boundary. Report the split sizes and the dedup/cleaning counts so the dataset is auditable.

WARNING

Split by a stable key, not by random row. Random splitting lets near-duplicates and paraphrases of the same underlying item appear in both train and eval — leakage that produces beautiful offline numbers and a model that fails in production.

TIP

Version the output dataset (and record the cleaning/dedup counts and split keys). Reproducibility is what lets you attribute a fine-tune's quality to a specific dataset and iterate deliberately instead of guessing.

Output

A training-ready dataset: normalized to the trainer's format, deduplicated and cleaned (with PII stripped), balanced, schema- and length-validated, and split by a stable key into leak-free train/validation/test files — plus a short report of record counts, duplicates removed, and split sizes so the dataset is auditable and reproducible.

When to use this skill

Instructions

Output

Related