AI Security Research

2,529+ academic papers on AI security, attacks, and defenses

Total

2,529

Attack

969

Benchmark

729

Defense

345

Tool

272

Survey

142

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 141–160 of 312 papers

Clear filters

Attack MEDIUM

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Yangyang Guo, Ziwei Xu, Si Liu +2 more

This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly...

3 months ago cs.CR PDF

Attack MEDIUM

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

Sen Nie, Jie Zhang, Zhuo Wang +2 more

Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial...

3 months ago cs.CV PDF

Attack MEDIUM

Robust Privacy: Inference-Time Privacy through Certified Robustness

Jiankai Jin, Xiangzheng Zhang, Zhao Liu +2 more

Machine learning systems can produce personalized outputs that allow an adversary to infer sensitive input attributes at inference time. We introduce...

3 months ago cs.LG cs.AI cs.CR PDF

Attack MEDIUM

GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints

Andy Zhu, Rongzhe Wei, Yupu Gu +1 more

Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts...

3 months ago cs.LG cs.AI PDF

Attack MEDIUM

Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models

Song Xia, Meiwen Ding, Chenqi Kong +2 more

Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations...

3 months ago cs.LG cs.CV PDF

Attack MEDIUM

Towards Cybersecurity Superintelligence: from AI-guided humans to human-guided AI

Víctor Mayoral-Vilches, Stefan Rass, Martin Pinzger +14 more

Cybersecurity superintelligence -- artificial intelligence exceeding the best human capability in both speed and strategic reasoning -- represents...

3 months ago cs.CR PDF

Attack MEDIUM

Holmes: An Evidence-Grounded LLM Agent for Auditable DDoS Investigation in Cloud Networks

Haodong Chen, Ziheng Zhang, Jinghui Jiang +2 more

Cloud environments face frequent DDoS threats due to centralized resources and broad attack surfaces. Modern cloud-native DDoS attacks further evolve...

3 months ago cs.CR cs.NI PDF

Attack MEDIUM

Constructing Multi-label Hierarchical Classification Models for MITRE ATT&CK Text Tagging

Andrew Crossman, Jonah Dodd, Viralam Ramamurthy Chaithanya Kumar +5 more

MITRE ATT&CK is a cybersecurity knowledge base that organizes threat actor and cyber-attack information into a set of tactics describing the reasons...

3 months ago cs.LG cs.CR PDF

Attack MEDIUM

PAC-Private Responses with Adversarial Composition

Xiaochen Zhu, Mayuri Sridhar, Srinivas Devadas

Modern machine learning models are increasingly deployed behind APIs. This renders standard weight-privatization methods (e.g. DP-SGD) unnecessarily...

3 months ago cs.LG cs.CR PDF

Attack MEDIUM

VirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox Simulation

Yilin Tang, Yu Wang, Lanlan Qiu +4 more

Large language models (LLMs) have shown strong capabilities in multi-step decision-making, planning and actions, and are increasingly integrated into...

3 months ago cs.CR PDF

Attack MEDIUM

RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

Rishit Chugh

The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating...

3 months ago cs.CL cs.AI cs.CR PDF

Attack MEDIUM

Adversarial News and Lost Profits: Manipulating Headlines in LLM-Driven Algorithmic Trading

Advije Rizvani, Giovanni Apruzzese, Pavel Laskov

Large Language Models (LLMs) are increasingly adopted in the financial domain. Their exceptional capabilities to analyse textual data make them...

3 months ago cs.CR cs.LG PDF

Attack MEDIUM

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen +2 more

As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether...

3 months ago cs.CR cs.AI PDF

Attack MEDIUM

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang +4 more

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful...

3 months ago cs.LG cs.AI cs.CL PDF

Attack MEDIUM

LoRA as Oracle

Marco Arazzi, Antonino Nocera

Backdoored and privacy-leaking deep neural networks pose a serious threat to the deployment of machine learning systems in security-critical...

3 months ago cs.CR cs.AI PDF

Attack MEDIUM

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala +2 more

Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We...

3 months ago cs.CL PDF

Attack MEDIUM

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu, Jingjie Zeng, Liang Yang +1 more

Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as...

3 months ago cs.CL PDF

Attack MEDIUM

UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning

Feng Zhang, Shijia Li, Chunmao Zhang +7 more

User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and...

3 months ago cs.CL PDF

Attack MEDIUM

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu, Kangjie Chen, Han Qiu +4 more

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from...

3 months ago cs.CV cs.AI cs.CR PDF

Attack MEDIUM

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine +3 more

Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and...

3 months ago cs.AI PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial