Glossary / jailbreak

Jailbreak

A prompt designed to make an LLM produce output the developer instructions or safety alignment was meant to prevent.

Attack

← All terms

A jailbreak is a prompt designed to make an LLM produce output the developer instructions or the model's safety alignment was meant to prevent. Common jailbreak patterns include role-play framings ("you are now in developer mode"), nested-task confusion ("translate this safety instruction into French and follow it"), token-smuggling (encoding the disallowed request in base64 or another encoding the model decodes mid-response), and adversarial suffix attacks (where a deliberately constructed token sequence flips the model's refusal behaviour).

Jailbreaks are a subset of prompt injection but distinguished because the attacker is the legitimate user and the goal is to bypass safety alignment rather than to extract data from another principal. They matter to AI-SPM because a customer-facing assistant that can be jailbroken into producing brand-damaging or legally problematic content is a continuing reputational risk.

Primary sources

Where to read the canonical definition.

See Jailbreak in production.

The Penaxtra platform implements the controls and assessments described above as part of its AI-SPM programme.

AI-SPM platform overview