Constitutional AI
Constitutional AI trains models against written principles — the model critiques and revises its own outputs by them, reducing reliance on human labels.
Constitutional AI (CAI) is Anthropic's alignment technique: instead of relying purely on human raters, the model is trained against an explicit written constitution — critiquing and revising its own outputs by those principles, then optimized with AI feedback on which responses follow them best.
It answered two problems in classic RLHF at once. Scale: human preference labels are expensive and inconsistent; CAI substitutes AI-generated feedback (RLAIF) guided by principles, multiplying alignment data cheaply. Transparency: RLHF encodes values implicitly in rater behavior; a constitution states them as text anyone can read — principles drawing on sources from the UN Declaration of Human Rights to practical harmlessness criteria — making "what is this model aligned to?" an answerable question. The technique shaped Claude's character and influenced industry-wide adoption of AI-feedback methods.
For builders, CAI matters as background and as pattern: background, because it explains behavioral texture in the models you use; pattern, because principles-as-explicit-text recurs at the application layer — rules engines like NeMo Guardrails and policy-based guardrails are the same move at runtime, and writing your app's "constitution" (what it must never do, stated plainly) is the first step of every serious safety review.
Frequently asked questions
- How does Constitutional AI work?
- Two phases, both anchored to an explicit list of principles. First, the model generates responses, critiques them against the constitution, and revises — producing self-improved training data. Second, preference optimization uses AI feedback (which response better follows the principles?) instead of armies of human raters — RLAIF. The constitution makes the values inspectable text rather than implicit rater behavior.
- Why does it matter that the principles are written down?
- Transparency and scalability. RLHF's values live implicitly in thousands of rater judgments — unauditable and expensive. A constitution is a document: you can read what the model is being aligned to, debate it, and revise it. Anthropic later extended the idea with Collective Constitutional AI, drafting principles with public input.
Related
- RLHF (Reinforcement Learning from Human Feedback)RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.
- DPO (Direct Preference Optimization)DPO aligns a model to preferences directly from chosen-vs-rejected pairs — no reward model, no RL loop — simpler and more stable than classic RLHF.
- GuardrailsGuardrails are programmatic checks around an LLM — validating inputs and outputs in code — enforcing safety and format rules a prompt alone can't guarantee.
- JailbreakA jailbreak is a prompt crafted to bypass a model's safety training and policies — making it produce output it was trained to refuse.
- Frontier ModelA frontier model is one of the most capable AI models available — the leading edge from labs like Anthropic, OpenAI, and Google, defining the state of the art.