Attack HIGH relevance

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Mengqi He Xinyu Tian Xin Shen Shu Zou Jinhong Ni Zhaoyuan Yang Weikang Li Xuesong Li Jing Zhang

cs.CV cs.AI

Published

May 11, 2026

Updated

May 11, 2026

Links

PDF arxiv

Abstract

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

Metadata

Comment: Preprint. 17 pages, 8 figures, 6 tables

Pro Analysis

Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.

Threat Deep-Dive

ATLAS Mapping

Compliance Reports

Actionable Recommendations

Start 14-Day Free Trial

Back to Research