Benchmark MEDIUM relevance

Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

Yan-Lun Chen Pin-Yu Chen Chia-Mu Yu Ying-Dar Lin Yu-Sung Wu Wei-Bin Lee

cs.CR cs.CL cs.IR

Published

June 24, 2026

Updated

June 24, 2026

Links

PDF arxiv

Abstract

Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or additional LLM-based verification, introducing substantial computational overhead. We present TRACE, a lightweight detection framework that identifies poisoning attacks by tracing answer-related tokens through token influence attribution. TRACE first discovers recurrent high-influence keywords across retrieved documents and then performs a secondary verification to confirm their influence on model predictions. Experiments on three QA benchmarks and six LLMs demonstrate strong detection performance while simultaneously uncovering attacker-specified target answers.

Pro Analysis

Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.

Threat Deep-Dive

ATLAS Mapping

Compliance Reports

Actionable Recommendations

Start 14-Day Free Trial

Back to Research