# RLHF (Reinforcement Learning from Human Feedback)

> RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.

**RLHF (reinforcement learning from human feedback) is the post-training technique that aligns a model with human preferences: humans rank candidate outputs, a reward model learns those rankings, and the LLM is optimized — via reinforcement learning — to score highly.**

It's the stage that made modern assistants possible: pretraining teaches language and knowledge; RLHF teaches *behavior* — follow instructions, be helpful, refuse harm, format sanely. The classic pipeline (preference data → reward model → PPO optimization) is heavyweight, which spawned a family of successors: [DPO](/glossary/dpo) optimizes on preferences directly without a separate reward model; RLAIF and [Constitutional AI](/glossary/constitutional-ai) substitute AI feedback guided by principles for armies of human raters; and the [reasoning-model](/glossary/reasoning-model) era extended RL beyond preferences to *verifiable rewards* (did the math check out, did the code pass) — arguably the most consequential RL development since.

For practitioners the takeaway is interpretive: model quirks like sycophancy and over-hedging are RLHF artifacts — the model optimizing for approval — worth remembering when [designing evals](/guides/evaluation/write-llm-evals) that measure truth rather than likability.

---

_Source: https://agentscamp.com/glossary/rlhf — Term on AgentsCamp._