# DPO (Direct Preference Optimization)

> DPO aligns a model to preferences directly from chosen-vs-rejected pairs — no reward model, no RL loop — simpler and more stable than classic RLHF.

**DPO (Direct Preference Optimization) trains a model on preference pairs — this response was chosen, that one rejected — directly through a supervised-style loss, achieving what classic [RLHF](/glossary/rlhf) does without training a reward model or running reinforcement learning.**

The 2023 insight behind it: the RLHF objective has a closed-form solution that can be optimized directly on preference data. Practically that deleted the hardest parts of alignment — the separate reward model, the notoriously twitchy PPO loop — and replaced them with something that trains like ordinary [fine-tuning](/glossary/fine-tuning). The cost: simplicity trades some ceiling. Frontier labs still run full RL pipelines (now increasingly against *verifiable* rewards, not just preferences); DPO and its descendants own the broad middle — open-weights post-training, domain alignment, behavioral polish.

For practitioners, DPO is the reachable rung: after supervised fine-tuning, a few thousand chosen/rejected pairs (curation discipline per [the dataset guide](/guides/mlops/finetune-dataset-prep)) teach preferences a prompt can't reliably hold — the difference between asking for a style and *baking it in*.

---

_Source: https://agentscamp.com/glossary/dpo — Term on AgentsCamp._