Search: LLM jailbreak | AI Threat Alert

Severity:

256 results in 118ms

Paper 2512.20168v1

2025-12-23

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography

Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit

high relevance tool

Paper 2604.11309v1

2026-04-13

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security

high relevance tool

Paper 2601.19487v1

2026-01-27

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off -- reducing jailbreak increases

high relevance attack

Paper 2511.13548v1

2025-11-17

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false

high relevance tool

Paper 2601.15706v1

2026-01-22

Improving Methodologies for LLM Evaluations Across Global Languages

five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours

medium relevance benchmark

Paper 2510.17904v2

2025-10-19

BreakFun: Jailbreaking LLMs via Schema Exploitation

paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM's adherence to structured schemas. BreakFun employs a three-part prompt that

high relevance attack

Paper 2509.23037v1

2025-09-27

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments

high relevance attack

Paper 2601.22240v1

2026-01-29

A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy

The rapid advancement and widespread adoption of generative artificial intelligence

high relevance survey

Paper 2604.19274v1

2026-04-21

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against

high relevance benchmark

Paper 2601.18998v1

2026-01-26

Malicious Repurposing of Open Science Artefacts by Using Large Language Models

introducing an end-to-end pipeline that first bypasses LLM safeguards through persuasion-based jailbreaking, then reinterprets NLP papers to identify and repurpose their artefacts (datasets, methods, and tools

medium relevance benchmark

Paper 2601.01627v1

2026-01-04

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust

medium relevance benchmark

Paper 2509.23882v2

2025-09-28

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning

medium relevance benchmark

Paper 2606.20408v1

2026-06-18

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Large language model (LLM) agents are increasingly proposed as supervisory

high relevance benchmark

Paper 2509.21761v2

2025-09-26

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand

medium relevance attack

Paper 2602.14161v1

2026-02-15

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack

medium relevance benchmark

Paper 2511.18581v2

2025-11-23

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Many recent studies showed that LLMs are vulnerable to jailbreak attacks, where an attacker can perturb the input of an LLM to induce it to generate an output

high relevance attack

Paper 2601.04034v1

2026-01-07

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic

high relevance attack

Paper 2510.01359v1

2025-10-01

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text

high relevance tool

Paper 2605.06605v1

2026-05-07

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events

high relevance benchmark

Paper 2601.10971v2

2026-01-16

AJAR: Adaptive Jailbreak Architecture for Red-teaming

language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops. Existing jailbreak frameworks still leave

high relevance attack

Previous Page 4 of 13 Next