Paper 2510.01359v1

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text

high relevance tool
Paper 2601.10971v2

AJAR: Adaptive Jailbreak Architecture for Red-teaming

language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops. Existing jailbreak frameworks still leave

high relevance attack
Paper 2601.10589v1

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement

high relevance defense
Paper 2602.11495v2

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals

high relevance attack
Paper 2512.19011v2

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline

Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present PromptScreen, an efficient and systematically evaluated defense architecture that mitigates these threats

high relevance attack
Paper 2602.13321v1

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations

high relevance attack
Paper 2602.01025v1

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white

high relevance attack
Paper 2510.17947v2

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent

high relevance attack
Paper 2603.01942v1

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing

high relevance attack
Paper 2511.10519v1

Say It Differently: Linguistic Styles as Jailbreak Vectors

introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling

high relevance attack
Paper 2510.06594v2

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM

high relevance attack
Paper 2511.07480v1

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge

high relevance tool
Paper 2512.03001v1

Invasive Context Engineering to Control Large Language Models

good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control

medium relevance attack
Paper 2601.02670v1

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Tree Search (), addressing these limitations through an attacker-LLM-free method that operates purely via lexical anchor injection. LATS reformulates jailbreaking as a breadth-first tree search over multi-turn

high relevance attack
Paper 2510.02999v4

Untargeted Jailbreak Attack

Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed

high relevance attack
Paper 2512.23173v1

EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion

significant concern, as they are still susceptible to jailbreak attacks aimed at eliciting inappropriate or harmful responses. However, existing jailbreak attacks mainly operate at the natural language level and rely

high relevance attack
Paper 2601.00867v1

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical

medium relevance survey
Paper 2512.06674v1

RunawayEvil: Jailbreaking the Image-to-Video Generative Models

strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based

high relevance attack
Paper 2602.04893v1

A Causal Perspective for Enhancing Jailbreak Attack and Defense

causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com

high relevance attack
Paper 2512.23132v1

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

stages. Results: We identify unreported threats including commercial LLM API model stealing, parameter memorization leakage, and preference-guided text-only jailbreaks. Dominant TTPs include MASTERKEY-style jailbreaking, federated poisoning, diffusion

medium relevance tool
Previous Page 4 of 10 Next