How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation
impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Jailbreak attacks remain a persistent threat to large language model
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Identifying the vulnerabilities of large language models (LLMs) is crucial
MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
systemically evaluated our method, named MetaBreak, on both lab environment and commercial LLM platforms. Our approach achieves jailbreak rates comparable to SOTA prompt-engineering-based solutions when no content moderation
Proactive defense against LLM Jailbreak
adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering
Machine Learning for Detection and Analysis of Novel LLM Jailbreaks
undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses
The Vulnerability of LLM Rankers to Prompt Injection Attacks
alter an LLM's ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures
Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries
LLM applications (i.e., LLM apps) leverage the powerful capabilities of LLMs to provide users with customized services, revolutionizing traditional application development. While the increasing prevalence of LLM-powered applications provides
When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit
Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models
MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions
How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks
multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects
TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking
historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards
Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?
moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive
SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails
powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to filter harmful
TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering
Large language models remain vulnerable to jailbreak attacks, and single
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that
RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak