LLM Reinforcement in Context
adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There
Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks
The rapid proliferation of Large Language Models (LLMs) has raised
AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly
The Echo Chamber Multi-Turn LLM Jailbreak
The availability of Large Language Models (LLMs) has led to
ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected
author can inject hidden prompts inside a PDF that secretly guide or "jailbreak" LLM reviewers into giving overly positive feedback and biased acceptance. On the defense side, we propose
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Large Language Models (LLMs) are vulnerable to adversarial attacks that
Performative Scenario Optimization
demonstrated through an emerging AI safety application: deploying performative guardrails against Large Language Model (LLM) jailbreaks. Numerical results confirm the co-evolution and convergence of the guardrail classifier
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large
Consistency Training Helps Stop Sycophancy and Jailbreaks
LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
effective attacks. We therefore introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors, which we propose as the core of a standardized benchmark for evaluating
EnsembleSHAP: Faithful and Certifiably Robust Attribution for Random Subspace Method
providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state
Bypassing Prompt Guards in Production with Controlled-Release Prompting
attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals
The Evaluation Game: Beyond Static LLM Benchmarking
As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements
Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security
having the LLM generate entire templates, which often compromises intent clarity and reproductibility. To address this gap, this paper introduces the Embedded Jailbreak Template, which preserves the structure of existing
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained
Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems
injection flaw along with a guardrail Rowhammer attack to inject an unaltered jailbreak prompt into an LLM, resulting in an AI safety violation, and (2) Manipulating a knowledge database
"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios
maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs