Jailbreak - AI Glossary

Our engineers set up and run your first chatbot / LLM security scan. Get in touch →

A jailbreak is a prompt designed to make an LLM produce output the developer instructions or the model's safety alignment was meant to prevent. Common jailbreak patterns include role-play framings ("you are now in developer mode"), nested-task confusion ("translate this safety instruction into French and follow it"), token-smuggling (encoding the disallowed request in base64 or another encoding the model decodes mid-response), and adversarial suffix attacks (where a deliberately constructed token sequence flips the model's refusal behaviour).

Jailbreaks are a subset of prompt injection but distinguished because the attacker is the legitimate user and the goal is to bypass safety alignment rather than to extract data from another principal. They matter to AI-SPM because a customer-facing assistant that can be jailbroken into producing brand-damaging or legally problematic content is a continuing reputational risk.

Other entries in this neighbourhood.

Prompt Injection An attack that smuggles attacker-controlled instructions into a model prompt to override the developer instructions or extract sensitive data. Adversarial Scan A scheduled execution of probe templates against an LLM endpoint, agent, or RAG pipeline, scored to produce control-mapped findings.

Where to read the canonical definition.

OWASP LLM01 Prompt Injection open →

See Jailbreak in production.

The Penaxtra platform implements the controls and assessments described above as part of its AI-SPM programme.

AI-SPM platform overview →