256 results in 29ms
Paper 2606.23496v1

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption

medium relevance attack
Paper 2606.15788v1

GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking

through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass

high relevance attack
Paper 2605.05058v1

SoK: Robustness in Large Language Models against Jailbreak Attacks

automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis

high relevance survey
Paper 2512.10415v2

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs

high relevance benchmark
Paper 2602.12418v1

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Jailbreak attacks remain a persistent threat to large language model

high relevance attack
Paper 2511.01375v1

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Identifying the vulnerabilities of large language models (LLMs) is crucial

high relevance attack
Paper 2510.10271v1

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

systemically evaluated our method, named MetaBreak, on both lab environment and commercial LLM platforms. Our approach achieves jailbreak rates comparable to SOTA prompt-engineering-based solutions when no content moderation

high relevance attack
Paper 2510.05052v2

Proactive defense against LLM Jailbreak

adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method

high relevance attack
Paper 2604.24983v1

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Existing white-box jailbreak attacks against aligned LLMs typically append

high relevance attack
Paper 2605.19321v1

Exploring and Developing a Pre-Model Safeguard with Draft Models

Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before

medium relevance attack
Paper 2603.01291v1

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering

high relevance benchmark
Paper 2510.01644v2

Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses

high relevance attack
Paper 2602.16752v1

The Vulnerability of LLM Rankers to Prompt Injection Attacks

alter an LLM's ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures

high relevance attack
Paper 2511.17874v2

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

LLM applications (i.e., LLM apps) leverage the powerful capabilities of LLMs to provide users with customized services, revolutionizing traditional application development. While the increasing prevalence of LLM-powered applications provides

high relevance attack
Paper 2511.21718v1

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit

medium relevance defense
Paper 2601.05339v1

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions

high relevance attack
Paper 2604.21159v1

Adaptive Instruction Composition for Automated LLM Red-Teaming

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error

high relevance attack
Paper 2605.12869v1

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling

high relevance attack
Paper 2602.04294v1

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects

high relevance attack
Paper 2602.06440v1

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards

high relevance attack
Page 1 of 13 Next