Benchmark HIGH relevance

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

Yanhang Li Zhichao Fan Zexin Zhuang
Published
June 22, 2026
Updated
June 22, 2026

Abstract

Hidden-state probing -- a linear classifier on a frozen vision-language model's internal activations -- has emerged as an attractive evaluation tool for flagging indirect prompt injection (IPI) in multimodal computer-use agents before the agent emits a corrupted action. We argue, on a single-backbone cautionary case study (Qwen2.5-VL-7B on Mind2Web, teacher-forced replay), that a high probing AUC on a clean-vs-attack split is not, on its own, evidence of malicious-content detection. Two post-hoc diagnostics -- a paired-construction scalar baseline on text-side injections, and same-step nuisance-matched visual controls on the overlay surface -- do not license an unqualified malicious-content interpretation of the headline while leaving room for partly-semantic readings. We package the diagnostics as a candidate control set with reporting heuristics for what a high clean-vs-attack AUC does and does not license. Labels are injection-surface-present, not attack success; generalisation beyond this backbone and benchmark is a conjecture.

Metadata

Comment
17 pages, 3 figures. Camera-ready version for EvalMG '26, The 2nd Workshop on Evaluation for Multimodal Generation, co-located with SIGIR 2026

Pro Analysis

Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.

Threat Deep-Dive
ATLAS Mapping
Compliance Reports
Actionable Recommendations
Start 14-Day Free Trial