TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization
model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption
GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking
through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass
SoK: Robustness in Large Language Models against Jailbreak Attacks
automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis
How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation
impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Jailbreak attacks remain a persistent threat to large language model
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Identifying the vulnerabilities of large language models (LLMs) is crucial
MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
systemically evaluated our method, named MetaBreak, on both lab environment and commercial LLM platforms. Our approach achieves jailbreak rates comparable to SOTA prompt-engineering-based solutions when no content moderation
Proactive defense against LLM Jailbreak
adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
Existing white-box jailbreak attacks against aligned LLMs typically append
Exploring and Developing a Pre-Model Safeguard with Draft Models
Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering
Machine Learning for Detection and Analysis of Novel LLM Jailbreaks
undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses
The Vulnerability of LLM Rankers to Prompt Injection Attacks
alter an LLM's ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures
Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries
LLM applications (i.e., LLM apps) leverage the powerful capabilities of LLMs to provide users with customized services, revolutionizing traditional application development. While the increasing prevalence of LLM-powered applications provides
When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit
Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models
MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions
Adaptive Instruction Composition for Automated LLM Red-Teaming
Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling
How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks
multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects
TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking
historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards