ATLAS Landscape
AML.T0054

LLM Jailbreak

Adversaries may induce a large language model (LLM) to ignore, circumvent, or override its safety/alignment behaviors and/or guardails to elicit outputs the model is intended to withhold. Once jailbroken, the LLM may be used in unintended ways by the adversary. Jailbreaks may be achieved via adversarial prompting, or by modifying model weights or safety mechanisms. Adversaries may attempt a jailbreak for [Defense Evasion](/tactics/AML.TA0007) of the LLM's guidelines and guardrails itself to then reveal information (ex: [LLM Data Leakage](/techniques/AML.T0057), [Discover LLM System Information](/techniques/AML.T0069)) or generate harmful content (ex: [Generate Malicious Commands](/techniques/AML.T0102), [Spearphishing via Social Engineering LLM](/techniques/AML.T0052.000)). They may also jailbreak a model for [Privilege Escalation](/tactics/AML.TA0012) to invoke tools or perform actions for their own purposes (ex: [AI Agent Tool Invocation](/techniques/AML.T0053)) or abuse the agent for a [Command and Control](/tactics/AML.TA0014) channel (ex: [AI Agent](/techniques/AML.T0108)). Adversaries use a variety of strategies to craft jailbreak prompts. Prompts may target specific models or model families and are iterated upon until successful. Model providers actively update their model guardrails to make them more resistant to jailbreak prompts as new prompts are developed. Common strategies [\[1\]][jailbreak-guide] include but are not limited to: - Instruction override: Use phrasing that attempts to supersede prior constraints (e.g. "ignore previous instructions"). - Roleplay / persona switching: Instruct the LLM to adopt an identity or mode that allows unrestricted answers (e.g. "as a security researcher"). - Fictionalization and hypotheticals: Instruct the LLM to include disallowed content as part of a story, screenplay, or educational scenario. - Separate intent from content: request analysis, examples, templates, or edge cases, that implicitly contain disallowed content. - Multi-turn escalation / Crescendo: Utilize a sequence of prompts that start benign, establish trust, then gradually cross policy boundaries with incremental prompts. - Constrained output formats: Instruct the LLM to output to a strict schema or format (e.g. JSON, YAML, code, or tables). - Obfuscation and transformation: Use encoding, transformations, translation, or euphemisms, (e.g., base64 encoding, "describe it in another language"). - Create a high priority objective: Frame compliance as necessary to fulfill the user's main task (e.g. "to complete the evaluation," "to follow the spec," "to follow safety guidelines"). Adversaries may also use algorithmic approaches to generating jailbreak prompts [\[2\]][jailbreak-zoo] [\[3\]][jailbreak-survey]. Algorithmic jailbreak generation allows for automated methods that discover jailbreaks at scale. Some approaches automate manual strategies [\[4\]][autodan] [\[5\]][gptfuzzer] [\[6\]][crescendo] [\[7\]][echo-chamber] while others optimize a string of tokens directly [\[8\]][universal] to produce nonsensical text. Both black-box (applicable to commercial models where the adversary has only query access to the model) and white-box (applicable in the open-source setting, where the adversary has full access to the model weights) optimization approaches are viable. Adversaries may also directly manipulate a model's weights, or modify or remove parts of a model to create a jailbroken of "uncensored" variant of the target model. This is applicable to open-source models, or cases where the adversary gains full access to the target model. Approaches include fine-tuning to reduce refusals [\[9\]][single-direction], targeted model editing [\[10\]][rome], addition of adapters [\[11\]][lora], and removing safety mechanisms such as guardrails. Jailbreak prompts that are known to work on various classes of LLMs are often published in the open-source community [\[12\]][dan]. Jailbroken or uncensored LLMs that have been trained or fine-tuned to be jailbroken are shared in public model registries such as huggingface [\[13\]][abliteration]. [jailbreak-survey]: https://arxiv.org/abs/2407.04295 "Jailbreak Attacks and Defenses Against Large Language Models: A Survey" [jailbreak-zoo]: https://arxiv.org/abs/2407.01599 "JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models" [jailbreak-guide]: https://www.promptfoo.dev/blog/how-to-jailbreak-llms/ "Jailbreaking LLMs: A Comprehensive Guide (With Examples)" [autodan]: https://arxiv.org/abs/2310.04451 "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" [gptfuzzer]: https://arxiv.org/abs/2309.10253 "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts" [crescendo]: https://arxiv.org/abs/2404.01833 "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack" [echo-chamber]: https://arxiv.org/abs/2601.05742 "The Echo Chamber Multi-Turn LLM Jailbreak" [dan]: https://github.com/0xk1h0/ChatGPT_DAN "ChatGPT DAN" [rome]: https://arxiv.org/abs/2202.05262 "Locating and Editing Factual Associations in GPT" [universal]: https://arxiv.org/abs/2307.15043 "Universal and Transferable Adversarial Attacks on Aligned Language Models" [single-direction]: https://arxiv.org/abs/2406.11717 "Refusal in Language Models Is Mediated by a Single Direction" [lora]: https://arxiv.org/abs/2310.20624 "LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B" [abliteration]: https://huggingface.co/blog/mlabonne/abliteration "Uncensor any LLM with abliteration"

Severity CVE CVSS
CRITICAL CVE-2026-27966 9.8
CRITICAL CVE-2026-41265 9.8
HIGH CVE-2025-30358 8.1
MEDIUM GHSA-gpx9-96j6-pp87 6.5
UNKNOWN CVE-2026-4399