Search: model poisoning | AI Threat Alert

294 results in 53ms

Paper 2604.21416v1

2026-04-23

CSC: Turning the Adversary's Poison against Itself

compromise model utility through unlearning methods that lead to accuracy degradation. This paper conducts a comprehensive analysis of backdoor attack dynamics during model training, revealing that poisoned samples form isolated

medium relevance benchmark

Paper 2511.09105v1

2025-11-12

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension

high relevance attack

Paper 2602.22246v1

2026-02-24

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models

induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover

medium relevance benchmark

Paper 2605.26574v1

2026-05-26

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data

medium relevance benchmark

Paper 2601.04448v1

2026-01-07

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging

medium relevance attack

Paper 2605.09822v1

2026-05-10

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data

medium relevance attack

Paper 2601.06305v1

2026-01-09

Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness

medium relevance benchmark

Paper 2512.23132v1

2025-12-29

Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection

medium relevance tool

Paper 2603.24857v1

2026-03-25

AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

emph{data decryption attacks and watermark removal attacks}; (2) Data$\rightarrow$Model (D$\rightarrow$M): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model$\rightarrow$Data

medium relevance survey

Paper 2603.02262v1

2026-02-28

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model

medium relevance attack

Paper 2605.04698v1

2026-05-06

Gray-Box Poisoning of Continuous Malware Ingestion Pipelines

high volume of novel threats. This work investigates a realistic gray-box poisoning threat model targeting these pipelines. Using the secml_malware framework, we generate problem-space adversarial binaries through

medium relevance attack

Paper 2510.05169v1

2025-10-05

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such

medium relevance attack

Paper 2509.22873v2

2025-09-26

AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning

remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them

high relevance attack

Paper 2606.09151v1

2026-06-08

Customization under Fire: Plugin Poisoning in Text-to-Image Ecosystem

could share and distribute seemingly benign LoRA plugins that contain hidden functionalities to poison the model-sharing market, like Civitai or Liblib, severely undermining the user trust that underpins this

medium relevance attack

Paper 2605.26595v1

2026-05-26

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean

high relevance attack

Paper 2604.10611v1

2026-04-12

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

suspicious rate $\leq$ 0.36), robustness against both watermark and poisoning attacks (recall $\leq$ 0.57), and a substantial drop in model performance upon watermark removal (Pass@1 drops by 28.6%), underscoring

medium relevance benchmark

Paper 2603.20615v1

2026-03-21

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while

medium relevance benchmark

Paper 2511.16709v1

2025-11-20

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning

high relevance attack

Paper 2605.19147v1

2026-05-18

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested

high relevance attack

Paper 2603.03371v1

2026-03-02

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings

medium relevance tool

Previous Page 3 of 15 Next