Paper 2606.09084v1

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather

high relevance tool
Paper 2603.15417v1

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency

medium relevance defense
Paper 2603.01414v1

Jailbreaking Embodied LLMs via Action-level Manipulation

than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that

high relevance attack
Paper 2602.16943v1

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system

medium relevance tool
Paper 2510.14207v2

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn

high relevance benchmark
Paper 2605.21834v1

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

Aligned models can misbehave in several ways: they are often

medium relevance defense
Paper 2602.13234v1

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work

medium relevance attack
Paper 2511.15304v3

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate

high relevance attack
Paper 2510.11834v2

Don't Walk the Line: Boundary Guidance for Filtered Generation

margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive

medium relevance benchmark
Paper 2602.09629v1

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates

high relevance tool
Paper 2511.22044v1

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate

high relevance attack
Paper 2601.02671v1

Extracting books from production language models

recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini

medium relevance attack
Paper 2510.06790v2

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test

medium relevance attack
Paper 2605.10582v1

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present

high relevance attack
Paper 2604.01444v1

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

canonical jailbreak strategies. Second, when compromised, LLMs frequently generate actionable yet harmful instructions, inadvertently empowering malicious actors and posing tangible risks. Third, existing LLM-based guardrails systematically overlook these domain

medium relevance benchmark
Paper 2602.03155v1

Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs

vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design

medium relevance attack
Paper 2512.06655v2

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Large language models (LLMs) face critical safety challenges, as they

medium relevance defense
Paper 2510.01586v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing

medium relevance attack
Paper 2603.04355v1

Efficient Refusal Ablation in LLM through Optimal Transport

Safety-aligned language models refuse harmful requests through learned refusal

medium relevance attack
Paper 2510.07968v2

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Large Language Models (LLMs) have shown remarkable performance across various

medium relevance defense
Previous Page 7 of 13 Next