Skip to content
agentscamp
Term · Term

RLHF (Reinforcement Learning from Human Feedback)

RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.

Updated Jun 12, 2026
rlhfalignmenttrainingpost-training

RLHF (reinforcement learning from human feedback) is the post-training technique that aligns a model with human preferences: humans rank candidate outputs, a reward model learns those rankings, and the LLM is optimized — via reinforcement learning — to score highly.

It's the stage that made modern assistants possible: pretraining teaches language and knowledge; RLHF teaches behavior — follow instructions, be helpful, refuse harm, format sanely. The classic pipeline (preference data → reward model → PPO optimization) is heavyweight, which spawned a family of successors: DPO optimizes on preferences directly without a separate reward model; RLAIF and Constitutional AI substitute AI feedback guided by principles for armies of human raters; and the reasoning-model era extended RL beyond preferences to verifiable rewards (did the math check out, did the code pass) — arguably the most consequential RL development since.

For practitioners the takeaway is interpretive: model quirks like sycophancy and over-hedging are RLHF artifacts — the model optimizing for approval — worth remembering when designing evals that measure truth rather than likability.

Frequently asked questions

What does RLHF actually change about a model?
It shapes behavior, not knowledge: after pretraining (next-token prediction over the internet) and instruction tuning, RLHF optimizes the model toward responses humans prefer — helpful, honest, harmless, well-formatted. It's the stage that turned raw text predictors into usable assistants; ChatGPT's 2022 breakthrough was substantially an RLHF story.
What are RLHF's known weaknesses?
It optimizes for what raters PREFER, which isn't always what's TRUE — producing sycophancy, confident hedging, and reward hacking (gaming the reward model's blind spots). It's also expensive and unstable to run, which is why simpler preference methods like DPO and AI-feedback variants (RLAIF, Constitutional AI) took over much of the work.

Related