Jailbreak
Jailbreaks are inputs designed to defeat the alignment training and safety filters of a language model so that it produces output it was trained to refuse — explicit instructions for harm, restricted code, disallowed content, or impersonation of a different system. Common patterns include roleplay framing ("pretend you are an unrestricted AI"), encoded payloads (base64, leet, hypothetical scenarios), multi-turn coercion, and adversarial suffixes generated by gradient attacks against open-weights models (the GCG attack and successors). Jailbreaks differ from prompt injection in intent and target: prompt injection hijacks an application's control flow, while jailbreaks defeat the model's own refusal policy. The two are often combined — an injected payload may include a jailbreak prefix. AI Threat Alert tracks jailbreak research from arxiv plus the small but growing set of CVEs filed against shipped LLM guardrails.
| Severity | CVE | Headline | Package | CVSS |
|---|---|---|---|---|
| HIGH | CVE-2025-30358 | Mesop: class pollution enables DoS and LLM jailbreak | 8.1 | |
| HIGH | CVE-2026-4399 | 1millionbot Millie: Boolean prompt injection bypasses restrictions | 7.5 |