256 results in 37ms
Paper 2512.24044v1

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input

high relevance attack
Paper 2605.10779v1

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent

high relevance benchmark
Paper 2510.15017v1

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive

high relevance tool
Paper 2601.09321v1

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to filter harmful

high relevance attack
Paper 2601.03300v1

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

Large language models remain vulnerable to jailbreak attacks, and single

high relevance attack
Paper 2510.03417v2

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing

high relevance attack
Paper 2602.01587v1

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that

high relevance tool
Paper 2510.06994v1

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak

high relevance attack
Paper 2510.09023v1

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

How should we evaluate the robustness of language model defenses

high relevance attack
Paper 2601.03600v1

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Despite rich safety alignment strategies, large language models (LLMs) remain

high relevance attack
Paper 2602.05444v2

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever

high relevance attack
Paper 2510.22628v1

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

This paper presents a real-time modular defense system named

high relevance tool
Paper 2509.23558v1

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose

high relevance attack
Paper 2512.18755v1

MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking

optimizes them using a simulated annealing strategy guided by semantic similarity, toxicity, and jailbreak effectiveness. Extensive experiments on both closed-source and open-source models, including GPT-4, Claude

high relevance attack
Paper 2510.26096v1

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose

medium relevance defense
Paper 2509.25624v2

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

As LLMs advance into autonomous agents with tool-use capabilities

high relevance tool
Paper 2511.02356v1

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

The widespread deployment of Large Language Models (LLMs) as public

high relevance tool
Paper 2511.19517v2

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark

high relevance attack
Paper 2511.13788v2

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM

high relevance attack
Paper 2512.05485v2

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited

high relevance benchmark
Previous Page 2 of 13 Next