CSC: Turning the Adversary's Poison against Itself
compromise model utility through unlearning methods that lead to accuracy degradation. This paper conducts a comprehensive analysis of backdoor attack dynamics during model training, revealing that poisoned samples form isolated
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension
Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models
induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover
GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning
Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data
Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data
Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models
large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness
Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems
making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection
AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective
emph{data decryption attacks and watermark removal attacks}; (2) Data$\rightarrow$Model (D$\rightarrow$M): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model$\rightarrow$Data
Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs
poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model
Gray-Box Poisoning of Continuous Malware Ingestion Pipelines
high volume of novel threats. This work investigates a realistic gray-box poisoning threat model targeting these pipelines. Using the secml_malware framework, we generate problem-space adversarial binaries through
From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such
AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning
remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them
Customization under Fire: Plugin Poisoning in Text-to-Image Ecosystem
could share and distribute seemingly benign LoRA plugins that contain hidden functionalities to poison the model-sharing market, like Civitai or Liblib, severely undermining the user trust that underpins this
Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean
DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
suspicious rate $\leq$ 0.36), robustness against both watermark and poisoning attacks (recall $\leq$ 0.57), and a substantial drop in model performance upon watermark removal (Pass@1 drops by 28.6%), underscoring
Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice
perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested
Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings