WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Prompt injection attacks pose a pervasive threat to the security of Large Language Models (LLMs). State-of-the-art prevention-based defenses typically rely on fine-tuning
A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this
DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion
they process user data according to predefined instructions. However, conventional LLMs remain vulnerable to prompt injection, where malicious users inject directive tokens into the data to subvert model behavior. Existing
Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing
find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight
Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification
classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model
WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks
ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data
Prompt injection attacks aim to contaminate the input data of an LLM to mislead it into completing an attacker-chosen task instead of the intended task. In many applications
Defending Against Prompt Injection with DataFilter
agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs
Privacy-Preserving Prompt Injection Detection for LLMs Using Federated Learning and Embedding-Based NLP Classification
designed inputs. Existing detection approaches often require centralizing prompt data, creating significant privacy risks. This paper proposes a privacy-preserving prompt injection detection framework based on federated learning and embedding
PINA: Prompt Injection Attack against Navigation Agents
actions. Compared to text-based applications, their security is far more critical: a successful prompt injection attack does not just alter outputs but can directly misguide physical navigation, leading
Physical Prompt Injection Attacks on Large Vision-Language Models
reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user
Better Privilege Separation for Agents by Restricting Data Types
systems, such as AI agents. Unfortunately, these advantages have come with a vulnerability to prompt injections, an attack where an adversary subverts the LLM's intended functionality with an injected
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate
Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models
Prompt injection evaluations typically treat refusal as a stable, binary indicator of safety. This study challenges that paradigm by modeling refusal as a local decision boundary and examining its stability
Prompt injections as a tool for preserving identity in GAI image descriptions
have been described, but most require top down or external intervention. An emerging strategy, prompt injections, provides an empowering alternative: indirect users can mitigate harm against them, from within their
Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis
Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present
SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement
extend tool-augmented behaviors. This abstraction introduces an under-measured attack surface: skill-based prompt injection, where poisoned skills can steer agents away from user intent and safety policies
Quantifying Return on Security Controls in LLM Systems
subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities