Paper 2512.10415v2

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs

high relevance benchmark
Paper 2602.12418v1

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Jailbreak attacks remain a persistent threat to large language model

high relevance attack
Paper 2511.01375v1

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Identifying the vulnerabilities of large language models (LLMs) is crucial

high relevance attack
Paper 2510.10271v1

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

systemically evaluated our method, named MetaBreak, on both lab environment and commercial LLM platforms. Our approach achieves jailbreak rates comparable to SOTA prompt-engineering-based solutions when no content moderation

high relevance attack
Paper 2510.05052v2

Proactive defense against LLM Jailbreak

adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method

high relevance attack
Paper 2603.01291v1

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering

high relevance benchmark
Paper 2510.01644v2

Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses

high relevance attack
Paper 2602.16752v1

The Vulnerability of LLM Rankers to Prompt Injection Attacks

alter an LLM's ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures

high relevance attack
Paper 2511.17874v2

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

LLM applications (i.e., LLM apps) leverage the powerful capabilities of LLMs to provide users with customized services, revolutionizing traditional application development. While the increasing prevalence of LLM-powered applications provides

high relevance attack
Paper 2511.21718v1

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit

medium relevance defense
Paper 2601.05339v1

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions

high relevance attack
Paper 2602.04294v1

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects

high relevance attack
Paper 2602.06440v1

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards

high relevance attack
Paper 2512.24044v1

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input

high relevance attack
Paper 2510.15017v1

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive

high relevance tool
Paper 2601.09321v1

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to filter harmful

high relevance attack
Paper 2601.03300v1

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

Large language models remain vulnerable to jailbreak attacks, and single

high relevance attack
Paper 2510.03417v2

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing

high relevance attack
Paper 2602.01587v1

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that

high relevance tool
Paper 2510.06994v1

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak

high relevance attack
Page 1 of 10 Next