Search: model poisoning | AI Threat Intelligence

195 results in 26ms

Paper 2511.14301v3

2025-11-18

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior

high relevance attack

Paper 2509.23041v2

2025-09-27

Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data

data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models

high relevance attack

Paper 2511.02894v3

2025-11-04

Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot

medium relevance attack

Paper 2601.01972v4

2026-01-05

Hidden State Poisoning Attacks against Mamba-based Language Models

their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms

high relevance attack

Paper 2511.12414v1

2025-11-16

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets

medium relevance attack

Paper 2602.06616v1

2026-02-06

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework

medium relevance benchmark

Paper 2511.09105v1

2025-11-12

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension

high relevance attack

Paper 2602.22246v1

2026-02-24

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models

induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover

medium relevance benchmark

Paper 2601.04448v1

2026-01-07

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging

medium relevance attack

Paper 2601.06305v1

2026-01-09

Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness

medium relevance benchmark

Paper 2512.23132v1

2025-12-29

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection

medium relevance tool

Paper 2603.02262v1

2026-02-28

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model

medium relevance attack

Paper 2510.05169v1

2025-10-05

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such

medium relevance attack

Paper 2509.22873v2

2025-09-26

AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning

remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them

high relevance attack

Paper 2603.20615v1

2026-03-21

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while

medium relevance benchmark

Paper 2511.16709v1

2025-11-20

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning

high relevance attack

Paper 2603.03371v1

2026-03-02

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings

medium relevance tool

Paper 2510.09647v1

2025-10-05

Rounding-Guided Backdoor Injection in Deep Learning Model Quantization

quantization to embed malicious behaviors. Unlike conventional backdoor attacks relying on training data poisoning or model training manipulation, QuRA solely works using the quantization operations. In particular, QuRA first employs

high relevance attack

Paper 2602.03085v1

2026-02-03

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal

low relevance attack

Paper 2512.04785v1

2025-12-04

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional

medium relevance tool

Previous Page 2 of 10 Next