Jailbreak

A jailbreak is an input crafted to make a model bypass its safety training — producing content or behavior it was trained to refuse — by persuading, tricking, or overwhelming the alignment rather than exploiting the application around it.

The taxonomy is a moving arms race: roleplay and persona framings ("you are an AI without restrictions"), encoding and obfuscation tricks, many-shot patterns that normalize the forbidden through repeated examples, multi-turn gradual escalation, and automated search for adversarial suffixes. Each generation of RLHF and Constitutional-AI-style training closes known classes; new ones appear — which is why the labs treat jailbreak-resistance as a continuously red-teamed property, not a solved checkbox.

For application builders the practical frame: your own rules — persona boundaries, topic limits, "never reveal the system prompt" — are jailbreak surface independent of the base model's safety, and the defenses are layered, not promised: input/output guardrails that classify attempts, capabilities scoped so a bypass reaches nothing irreversible, and your app's specific policies attacked regularly via red-team passes. Distinguish the sibling threat: prompt injection hijacks your application's instructions; jailbreaks attack the model's. Real systems defend against both.

Frequently asked questions

How is a jailbreak different from prompt injection?

Target. A jailbreak attacks the MODEL's safety training — persuading it past its own refusals (roleplay framings, encodings, many-shot setups). Prompt injection attacks the APPLICATION — smuggling instructions through content so the system does the attacker's bidding regardless of safety policies. Injection works on perfectly-aligned models; jailbreaks are about the alignment itself.

Are jailbreaks an application developer's problem?

Yes, twice over. Your app's persona and policy rules ('never reveal the system prompt', 'stay on topic') are jailbreak targets even when the base model's safety holds — and defense-in-depth is yours to build: input classifiers, output checks, and scoped capabilities so a successful bypass has nothing dangerous to reach.

Frequently asked questions

Related