DPO (Direct Preference Optimization)
DPO aligns a model to preferences directly from chosen-vs-rejected pairs — no reward model, no RL loop — simpler and more stable than classic RLHF.
DPO (Direct Preference Optimization) trains a model on preference pairs — this response was chosen, that one rejected — directly through a supervised-style loss, achieving what classic RLHF does without training a reward model or running reinforcement learning.
The 2023 insight behind it: the RLHF objective has a closed-form solution that can be optimized directly on preference data. Practically that deleted the hardest parts of alignment — the separate reward model, the notoriously twitchy PPO loop — and replaced them with something that trains like ordinary fine-tuning. The cost: simplicity trades some ceiling. Frontier labs still run full RL pipelines (now increasingly against verifiable rewards, not just preferences); DPO and its descendants own the broad middle — open-weights post-training, domain alignment, behavioral polish.
For practitioners, DPO is the reachable rung: after supervised fine-tuning, a few thousand chosen/rejected pairs (curation discipline per the dataset guide) teach preferences a prompt can't reliably hold — the difference between asking for a style and baking it in.
Frequently asked questions
- How is DPO different from RLHF?
- RLHF is two systems: train a reward model on preference rankings, then run reinforcement learning against it. DPO collapses that into one supervised-style step — a loss function that directly raises the probability of chosen responses over rejected ones. Same preference data, no reward model, no PPO instability.
- When would I use DPO myself?
- When fine-tuning an open-weights model on behavior with a quality gradient — tone, format adherence, refusal style — and you can collect chosen/rejected pairs (often by generating two candidates and picking). It's the accessible alignment step after SFT: standard libraries support it, and it trains like ordinary fine-tuning.
Related
- RLHF (Reinforcement Learning from Human Feedback)RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.
- Fine-TuningFine-tuning continues training a pretrained model on your own examples, changing its weights to teach durable behavior, format, or domain style.
- Constitutional AIConstitutional AI trains models against written principles — the model critiques and revises its own outputs by them, reducing reliance on human labels.
- Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval SplitsThe dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.