Paper 2510.13351v1

Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context

medium relevance tool
Paper 2510.08917v1

"I know it's not right, but that's what it said to do": Investigating Trust in AI Chatbots for Cybersecurity Policy

chatbots are an emerging security attack vector, vulnerable to threats such as prompt injection, and rogue chatbot creation. When deployed in domains such as corporate security policy, they could

medium relevance attack
Paper 2510.01586v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each

medium relevance attack
Paper 2509.26584v1

Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models

concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced

medium relevance benchmark
Paper 2509.25705v1

How Diffusion Models Memorize

under memorization due to classifier-free guidance amplifying predictions and inducing overestimation; (ii) memorized prompts inject training images into noise predictions, forcing latent trajectories to converge and steering denoising toward

low relevance other
Paper 2509.23519v2

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting

medium relevance defense
Paper 2605.05846v1

LoopTrap: Termination Poisoning Attacks on LLM Agents

this self-directed loop facilitates autonomy, it also introduces a critical risk: by injecting malicious prompts into the agent's context, an adversary can distort the agent's termination judgment

high relevance attack
Paper 2606.09499v1

Targeting World Models to Compromise Robot Learning Pipelines

directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once

medium relevance benchmark
Paper 2603.17239v1

LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems

pipelines, and external tool connectors face a class of attacks - Logic-layer Prompt Control Injection (LPCI) - for which no automated red-teaming instrument existed. We present LAAF (Logic-layer Automated

high relevance attack
Paper 2603.12644v1

Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw

OpenClaw ecosystem. We systematically investigate its current threat landscape, highlighting critical vulnerabilities such as prompt injection-driven Remote Code Execution (RCE), sequential tool attack chains, context amnesia, and supply chain

medium relevance defense
Paper 2512.17146v1

Biosecurity-Aware AI: Agentic Risk Auditing of Soft Prompt Attacks on ESM-Based Variant Predictors

GFMs. SAGE functions through an interpretable and automated risk auditing loop. It injects soft prompt perturbations, monitors model behavior across training checkpoints, computes risk metrics such as AUROC and AUPR

high relevance attack
Paper 2605.04808v1

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying

high relevance tool
Paper 2512.17259v1

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

detection under stealthy strategies, and (iii) resilience of verifiability mechanisms to adversarial prompt and persona injection. Our approach shifts the evaluation focus from how likely misalignment is to how quickly

medium relevance tool

LangChain Core has Path Traversal vulnerabilites in legacy `load_prompt

CVSS 7.5 langchain-core View details

PraisonAI: Arbitrary File Read via `@file:` Mention Path Traversal

CVSS 7.5 praisonaiagents View details
Paper 2603.08387v1

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into

low relevance benchmark
Paper 2606.18120v1

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace

high relevance attack
Paper 2602.05401v1

BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models

chat templates allows an attacker who controls the template to inject arbitrary strings into the system prompt without the user's notice. Building on this, we propose a training-free

high relevance attack
Paper 2604.21700v1

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek

high relevance attack
Paper 2604.21829v1

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

prior prompt-stealing methods and build an automated stealing prompt generation agent. This agent starts from model-generated seed prompts, expands them through scenario rationalization and structure injection, and enforces

high relevance attack
Previous Page 22 of 25 Next