Paper 2512.20405v2

ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected

author can inject hidden prompts inside a PDF that secretly guide or "jailbreak" LLM reviewers into giving overly positive feedback and biased acceptance. On the defense side, we propose

medium relevance survey
Paper 2511.12217v1

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Large Language Models (LLMs) are vulnerable to adversarial attacks that

high relevance attack
Paper 2603.29982v1

Performative Scenario Optimization

demonstrated through an emerging AI safety application: deploying performative guardrails against Large Language Model (LLM) jailbreaks. Numerical results confirm the co-evolution and convergence of the guardrail classifier

medium relevance attack
Paper 2510.13901v2

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities

high relevance attack
Paper 2510.09471v1

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large

medium relevance benchmark
Paper 2510.27062v1

Consistency Training Helps Stop Sycophancy and Jailbreaks

LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special

high relevance attack
Paper 2604.18660v1

Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks

effective attacks. We therefore introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors, which we propose as the core of a standardized benchmark for evaluating

high relevance attack
Paper 2603.30034v1

EnsembleSHAP: Faithful and Certifiably Robust Attribution for Random Subspace Method

providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state

medium relevance benchmark
Paper 2510.01529v2

Bypassing Prompt Guards in Production with Controlled-Release Prompting

attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals

medium relevance attack
Paper 2511.19009v1

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements

medium relevance defense
Paper 2511.14140v1

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

having the LLM generate entire templates, which often compromises intent clarity and reproductibility. To address this gap, this paper introduces the Embedded Jailbreak Template, which preserves the structure of existing

high relevance attack
Paper 2604.12817v1

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained

medium relevance attack
Paper 2603.12023v1

Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

injection flaw along with a guardrail Rowhammer attack to inject an unaltered jailbreak prompt into an LLM, resulting in an AI safety violation, and (2) Manipulating a knowledge database

high relevance tool
Paper 2511.16278v1

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs

high relevance attack
Paper 2512.20168v1

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography

Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit

high relevance tool
Paper 2604.11309v1

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security

high relevance tool
Paper 2601.19487v1

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off -- reducing jailbreak increases

high relevance attack
Paper 2511.13548v1

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false

high relevance tool
Paper 2601.15706v1

Improving Methodologies for LLM Evaluations Across Global Languages

five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours

medium relevance benchmark
Paper 2510.17904v2

BreakFun: Jailbreaking LLMs via Schema Exploitation

paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM's adherence to structured schemas. BreakFun employs a three-part prompt that

high relevance attack
Previous Page 3 of 12 Next