Search: model poisoning | AI Threat Alert

Severity:

250 results in 116ms

Paper 2510.26829v3

2025-10-29

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving

low relevance attack

Paper 2603.17174v1

2026-03-17

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

generation large language models (LLMs) are increasingly integrated into modern software development workflows. Recent work has shown that these models are vulnerable to backdoor and poisoning attacks that induce

high relevance attack

Paper 2512.06556v1

2025-12-06

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

Model Context Protocol (MCP) enables Large Language Models to integrate external tools through structured descriptors, increasing autonomy in decision-making, task execution, and multi-agent workflows. However, this autonomy creates

high relevance tool

Paper 2510.19145v4

2025-10-22

HAMLOCK: HArdware-Model LOgically Combined attacK

networks (DNNs) introduces new security vulnerabilities. Conventional model-level backdoor attacks, which only poison a model's weights to misclassify inputs with a specific trigger, are often detectable because

high relevance attack

Paper 2601.01972v3

2026-01-05

Hidden State Poisoning Attacks against Mamba-based Language Models

their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms

high relevance attack

Paper 2511.14301v3

2025-11-18

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior

high relevance attack

Paper 2605.02110v1

2026-05-04

Adversarial Update-Based Federated Unlearning for Poisoned Model Recovery

Federated learning (FL) is vulnerable to poisoning attacks, where malicious clients upload manipulated updates to degrade the performance of the global model. Although detection methods can identify and remove malicious

medium relevance attack

Paper 2509.23041v2

2025-09-27

Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data

data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models

high relevance attack

Paper 2511.02894v3

2025-11-04

Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot

medium relevance attack

Paper 2601.01972v4

2026-01-05

Hidden State Poisoning Attacks against Mamba-based Language Models

their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms

high relevance attack

Paper 2511.12414v1

2025-11-16

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets

medium relevance attack

Paper 2602.06616v1

2026-02-06

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework

medium relevance benchmark

Paper 2604.21416v1

2026-04-23

CSC: Turning the Adversary's Poison against Itself

compromise model utility through unlearning methods that lead to accuracy degradation. This paper conducts a comprehensive analysis of backdoor attack dynamics during model training, revealing that poisoned samples form isolated

medium relevance benchmark

Paper 2511.09105v1

2025-11-12

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension

high relevance attack

Paper 2602.22246v1

2026-02-24

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models

induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover

medium relevance benchmark

Paper 2601.04448v1

2026-01-07

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging

medium relevance attack

Paper 2601.06305v1

2026-01-09

Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness

medium relevance benchmark

Paper 2512.23132v1

2025-12-29

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection

medium relevance tool

Paper 2603.24857v1

2026-03-25

AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

emph{data decryption attacks and watermark removal attacks}; (2) Data$\rightarrow$Model (D$\rightarrow$M): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model$\rightarrow$Data

medium relevance survey

Paper 2603.02262v1

2026-02-28

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model

medium relevance attack

Previous Page 2 of 13 Next