Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction
CourtGuard: A Local, Multiagent Prompt Injection Classifier
system, where a "defense attorney" model argues the prompt is benign, a "prosecution attorney" model argues the prompt is a prompt injection, and a "judge" model gives the final classification
PromptLocate: Localizing Prompt Injection Attacks
segments, (2) identifying segments contaminated by injected instructions, and (3) pinpointing segments contaminated by injected data. We show PromptLocate accurately localizes injected prompts across eight existing and eight adaptive attacks
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker
Prompt Injection as Role Confusion
reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs
Prompt Injection as Role Confusion
reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs
PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels
Adversarial Prompt Injection Attack on Multimodal Large Language Models
methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs
The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism
malware execution mechanism triggered through prompts engineered to exploit an application's LLM. We introduce a seven-stage promptware kill chain: Initial Access (prompt injection), Privilege Escalation (jailbreaking), Reconnaissance, Persistence
AgentWatcher: A Rule-based Prompt Injection Monitor
Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations
Are AI-assisted Development Tools Immune to Prompt Injection?
Prompt injection is listed as the number-one vulnerability class in the OWASP Top 10 for LLM Applications that can subvert LLM guardrails, disclose sensitive data, and trigger unauthorized tool
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against
PIArena: A Platform for Prompt Injection Evaluation
Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers
PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features
injection attacks, where an attacker contaminates the input to inject malicious instructions, causing the LLM to follow the attacker's intent instead of the original user's. Existing prompt injection
AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
models (LVLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that
Learning to Inject: Automated Prompt Injection via Reinforcement Learning
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs
manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers
reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present
The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection
Prompt injection defenses are often framed as semantic understanding problems and delegated to increasingly large neural detectors. For the first screening layer, however, the requirements are different: the detector runs
AI Threat Alert