256 results in 64ms
Paper 2605.29629v1

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Attack Success Rate (ASR) evaluates each jailbreak with a single

high relevance attack
Paper 2601.10589v1

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement

high relevance defense
Paper 2602.11495v2

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals

high relevance attack
Paper 2512.19011v2

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline

Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present PromptScreen, an efficient and systematically evaluated defense architecture that mitigates these threats

high relevance attack
Paper 2602.13321v1

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations

high relevance attack
Paper 2602.01025v1

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white

high relevance attack
Paper 2510.17947v2

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent

high relevance attack
Paper 2605.29659v1

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large

high relevance attack
Paper 2603.25176v1

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints

high relevance attack
Paper 2603.01942v1

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing

high relevance attack
Paper 2511.10519v1

Say It Differently: Linguistic Styles as Jailbreak Vectors

introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling

high relevance attack
Paper 2510.06594v2

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM

high relevance attack
Paper 2606.22673v1

AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent

behavioral control during execution. Meanwhile, recent mechanistic interpretability methods for LLM safety are mostly confined to single-turn or jailbreak-style QA settings, limiting their ability to capture the evolving

medium relevance defense
Paper 2511.07480v1

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge

high relevance tool
Paper 2512.03001v1

Invasive Context Engineering to Control Large Language Models

good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control

medium relevance attack
Paper 2601.02670v1

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Tree Search (), addressing these limitations through an attacker-LLM-free method that operates purely via lexical anchor injection. LATS reformulates jailbreaking as a breadth-first tree search over multi-turn

high relevance attack
Paper 2603.24543v1

Analysing the Safety Pitfalls of Steering Vectors

that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions

medium relevance defense
Paper 2510.02999v4

Untargeted Jailbreak Attack

Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed

high relevance attack
Paper 2512.23173v1

EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion

significant concern, as they are still susceptible to jailbreak attacks aimed at eliciting inappropriate or harmful responses. However, existing jailbreak attacks mainly operate at the natural language level and rely

high relevance attack
Paper 2601.00867v1

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical

medium relevance survey
Previous Page 5 of 13 Next