NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that
RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
How should we evaluate the robustness of language model defenses
ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification
Despite rich safety alignment strategies, large language models (LLMs) remain
Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks
This paper presents a real-time modular defense system named
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose
MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking
optimizes them using a simulated annealing strategy guided by semantic similarity, toxicity, and jailbreak effectiveness. Extensive experiments on both closed-source and open-source models, including GPT-4, Claude
ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models
defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
As LLMs advance into autonomous agents with tool-use capabilities
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
The widespread deployment of Large Language Models (LLMs) as public
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark
Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments
jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM
TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations
high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited
LLM Reinforcement in Context
adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There
Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks
The rapid proliferation of Large Language Models (LLMs) has raised
AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly
The Echo Chamber Multi-Turn LLM Jailbreak
The availability of Large Language Models (LLMs) has led to
AI Threat Alert