Jailbreak
A jailbreak is a prompt crafted to bypass a model's safety training and policies — making it produce output it was trained to refuse.
A jailbreak is an input crafted to make a model bypass its safety training — producing content or behavior it was trained to refuse — by persuading, tricking, or overwhelming the alignment rather than exploiting the application around it.
The taxonomy is a moving arms race: roleplay and persona framings ("you are an AI without restrictions"), encoding and obfuscation tricks, many-shot patterns that normalize the forbidden through repeated examples, multi-turn gradual escalation, and automated search for adversarial suffixes. Each generation of RLHF and Constitutional-AI-style training closes known classes; new ones appear — which is why the labs treat jailbreak-resistance as a continuously red-teamed property, not a solved checkbox.
For application builders the practical frame: your own rules — persona boundaries, topic limits, "never reveal the system prompt" — are jailbreak surface independent of the base model's safety, and the defenses are layered, not promised: input/output guardrails that classify attempts, capabilities scoped so a bypass reaches nothing irreversible, and your app's specific policies attacked regularly via red-team passes. Distinguish the sibling threat: prompt injection hijacks your application's instructions; jailbreaks attack the model's. Real systems defend against both.
Frequently asked questions
- How is a jailbreak different from prompt injection?
- Target. A jailbreak attacks the MODEL's safety training — persuading it past its own refusals (roleplay framings, encodings, many-shot setups). Prompt injection attacks the APPLICATION — smuggling instructions through content so the system does the attacker's bidding regardless of safety policies. Injection works on perfectly-aligned models; jailbreaks are about the alignment itself.
- Are jailbreaks an application developer's problem?
- Yes, twice over. Your app's persona and policy rules ('never reveal the system prompt', 'stay on topic') are jailbreak targets even when the base model's safety holds — and defense-in-depth is yours to build: input classifiers, output checks, and scoped capabilities so a successful bypass has nothing dangerous to reach.
Related
- Prompt InjectionPrompt injection is an attack where untrusted content carries instructions an LLM then follows — overriding its task, leaking data, or triggering tool calls.
- Red-Teaming (AI)AI red-teaming is adversarial testing — attacking your model or agent with jailbreaks, injections, and misuse scenarios to find failures before users do.
- GuardrailsGuardrails are programmatic checks around an LLM — validating inputs and outputs in code — enforcing safety and format rules a prompt alone can't guarantee.
- RLHF (Reinforcement Learning from Human Feedback)RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.
- Constitutional AIConstitutional AI trains models against written principles — the model critiques and revises its own outputs by them, reducing reliance on human labels.