Paper 2512.06674v1

RunawayEvil: Jailbreaking the Image-to-Video Generative Models

strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based

high relevance attack
Paper 2602.04893v1

A Causal Perspective for Enhancing Jailbreak Attack and Defense

causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com

high relevance attack
Paper 2512.23132v1

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

stages. Results: We identify unreported threats including commercial LLM API model stealing, parameter memorization leakage, and preference-guided text-only jailbreaks. Dominant TTPs include MASTERKEY-style jailbreaking, federated poisoning, diffusion

medium relevance tool
Paper 2603.04459v2

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

based on both automated and human assessment) on LLM safety benchmarks, analyzing 31 benchmarks and 382 non-benchmarks across prompt injection, jailbreak, and hallucination. We find that benchmark papers show

medium relevance benchmark
Paper 2510.11837v1

Countermind: A Multi-Layered Security Architecture for Large Language Models

security of Large Language Model (LLM) applications is fundamentally challenged by "form-first" attacks like prompt injection and jailbreaking, where malicious instructions are embedded within user inputs. Conventional defenses, which

medium relevance benchmark
Paper 2606.10525v1

Assessing Automated Prompt Injection Attacks in Agentic Environments

prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings

high relevance attack
Paper 2601.03265v1

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more

high relevance attack
Paper 2510.10281v1

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real

high relevance attack
Paper 2604.18976v1

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black

high relevance attack
Paper 2603.19127v1

On Optimizing Multimodal Jailbreaks for Spoken Language Models

inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses

high relevance attack
Paper 2601.09625v2

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

prompts engineered to exploit an application's LLM. We introduce a seven-stage promptware kill chain: Initial Access (prompt injection), Privilege Escalation (jailbreaking), Reconnaissance, Persistence (memory and retrieval poisoning), Command

high relevance attack
Paper 2601.02377v1

Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges

landscape and corresponding defense strategies for LLM-controlled robotics. Specifically, we discuss a comprehensive taxonomy of attack vectors, covering topics such as jailbreaking, backdoor attacks, and multi-modal prompt injection

medium relevance survey
Paper 2606.05743v1

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Despite advances in safety alignment, large language models remain vulnerable

medium relevance defense
Paper 2601.10141v1

Understanding and Preserving Safety in Fine-Tuned LLMs

both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning

medium relevance defense
Paper 2512.01353v3

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Existing approaches overwhelmingly operate within the prompt-optimization paradigm: whether through traditional algorithmic

medium relevance defense
Paper 2602.16977v1

Fail-Closed Alignment for Large Language Models

independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety

medium relevance defense
Paper 2601.12460v1

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions

high relevance attack
Paper 2510.21983v1

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data

high relevance attack
Paper 2601.05445v1

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self

high relevance attack
Paper 2510.08646v2

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM's internal activations to an energy

medium relevance benchmark
Previous Page 6 of 13 Next