AI Security Research

2,583+ academic papers on AI security, attacks, and defenses

Total

2,583

Attack

994

Benchmark

740

Defense

355

Tool

275

Survey

146

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 1961–1980 of 2,583 papers

Benchmark LOW

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward, Teun van der Weij, Hanna Gábor +6 more

AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning...

6 months ago cs.AI PDF

Defense MEDIUM

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu, Kecen Li, Zhicong Huang +3 more

Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code...

6 months ago cs.CL cs.CR PDF

Benchmark MEDIUM

Taught by the Flawed: How Dataset Insecurity Breeds Vulnerable AI Code

Catherine Xia, Manar H. Alalfi

AI programming assistants have demonstrated a tendency to generate code containing basic security vulnerabilities. While developers are ultimately...

6 months ago cs.CR cs.AI PDF

Survey MEDIUM

Unlearning Imperative: Securing Trustworthy and Responsible LLMs through Engineered Forgetting

James Jin Kang, Dang Bui, Thanh Pham +1 more

The growing use of large language models in sensitive domains has exposed a critical weakness: the inability to ensure that private information can...

6 months ago cs.LG PDF

Survey MEDIUM

I've Seen Enough: Measuring the Toll of Content Moderation on Mental Health

Gabrielle M Gauthier, Eesha Ali, Amna Asim +2 more

Human content moderators (CMs) routinely review distressing digital content at scale. Beyond exposure, the work context (e.g., workload, team...

6 months ago cs.HC PDF

Benchmark LOW

CARScenes: Semantic VLM Dataset for Safe Autonomous Driving

Yuankai He, Weisong Shi

CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable,...

6 months ago cs.CV cs.RO PDF

Defense MEDIUM

Slice-Aware Spoofing Detection in 5G Networks Using Lightweight Machine Learning

Daniyal Ganiuly, Nurzhau Bolatbek

The increasing virtualization of fifth generation (5G) networks expands the attack surface of the user plane, making spoofing a persistent threat to...

6 months ago cs.CR cs.NI PDF

Benchmark LOW

Toward Honest Language Models for Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang +7 more

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty...

6 months ago cs.CL PDF

Attack LOW

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Xin Zhao, Xiaojun Chen, Bingshan Liu +3 more

Generative vision-language models like Stable Diffusion demonstrate remarkable capabilities in creative media synthesis, but they also pose...

6 months ago cs.AI cs.CR cs.CV PDF

Benchmark MEDIUM

One Signature, Multiple Payments: Demystifying and Detecting Signature Replay Vulnerabilities in Smart Contracts

Zexu Wang, Jiachi Chen, Zewei Lin +7 more

Smart contracts have significantly advanced blockchain technology, and digital signatures are crucial for reliable verification of contract...

6 months ago cs.CR cs.SE PDF

Attack HIGH

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo +3 more

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data...

6 months ago cs.LG cs.AI PDF

Attack HIGH

StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak

Hongyi Li, Chengxuan Zhou, Chu Wang +5 more

Large Audio-language Models (LAMs) have recently enabled powerful speech-based interactions by coupling audio encoders with Large Language Models...

6 months ago cs.SD PDF

Benchmark LOW

Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback

Shengbo Wang, Hong Sun, Ke Li

Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems....

6 months ago cs.LG PDF

Benchmark MEDIUM

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks

Yunfei Yang, Xiaojun Chen, Yuexin Xuan +3 more

Model watermarking techniques can embed watermark information into the protected model for ownership declaration by constructing specific...

6 months ago cs.CR cs.LG PDF

Benchmark MEDIUM

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Kazuki Iwahana, Yusuke Yamasaki, Akira Ito +2 more

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into...

6 months ago cs.LG cs.CR PDF

Attack MEDIUM

iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification

Zixun Xiong, Gaoyi Wu, Qingyang Yu +5 more

Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial....

6 months ago cs.CR cs.AI PDF

Other LOW

Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

Jiahang He, Rishi Ramachandran, Neel Ramachandran +5 more

As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and...

6 months ago cs.CL PDF

Attack HIGH

A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula +5 more

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We...

6 months ago cs.CL PDF

Defense LOW

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko +3 more

We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety...

6 months ago cs.AI PDF

Attack MEDIUM

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau +3 more

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing...

6 months ago cs.AI cs.LG PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial