AI Security Research

2,529+ academic papers on AI security, attacks, and defenses

Total

2,529

Attack

969

Benchmark

729

Defense

345

Tool

272

Survey

142

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 41–60 of 222 papers

Clear filters

Defense MEDIUM

FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning

Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab +3 more

Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to...

1 months ago cs.LG cs.CR cs.DC PDF

Defense MEDIUM

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang +5 more

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning...

1 months ago cs.AI cs.CR PDF

Defense MEDIUM

Analysing the Safety Pitfalls of Steering Vectors

Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas +2 more

Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and...

1 months ago cs.CR cs.CL PDF

Defense MEDIUM

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually...

1 months ago cs.CR cs.AI cs.MM PDF

Defense MEDIUM

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg +1 more

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on...

1 months ago cs.CL cs.AI cs.CY PDF

Defense MEDIUM

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete...

1 months ago cs.CR cs.AI cs.LG PDF

Defense MEDIUM

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa, Clemens Grange, Bernard Ghanem

Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However,...

1 months ago cs.CV cs.AI cs.CL PDF

Defense MEDIUM

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang, Jinxi He, Junyi He +2 more

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability...

1 months ago cs.CV cs.CL cs.CR PDF

Defense MEDIUM

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Zhenheng Tang, Xiang Liu, Qian Wang +3 more

As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first...

1 months ago cs.AI cs.CY PDF

Defense MEDIUM

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury +4 more

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the...

1 months ago cs.LG cs.AI cs.CL PDF

Defense MEDIUM

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Yewon Han, Yumin Seol, EunGyung Kong +2 more

Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety...

1 months ago cs.CV cs.AI PDF

Defense MEDIUM

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large...

1 months ago cs.CL PDF

Defense MEDIUM

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Pengcheng Li, Jie Zhang, Tianwei Zhang +5 more

Safety alignment in large language models is typically evaluated under isolated queries, yet real-world use is inherently multi-turn. Although...

1 months ago cs.CR cs.AI PDF

Defense MEDIUM

FraudFox: Adaptable Fraud Detection in the Real World

Matthew Butler, Yi Fan, Christos Faloutsos

The proposed method (FraudFox) provides solutions to adversarial attacks in a resource constrained environment. We focus on questions like the...

2 months ago cs.CR cs.LG PDF

Defense MEDIUM

Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw

Zonghao Ying, Xiao Yang, Siyang Wu +7 more

The rapid evolution of Large Language Models (LLMs) into autonomous, tool-calling agents has fundamentally altered the cybersecurity landscape....

2 months ago cs.CR PDF

Defense MEDIUM

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

Xinhao Deng, Yixiang Zhang, Jiaqing Wu +15 more

Autonomous Large Language Model (LLM) agents, exemplified by OpenClaw, demonstrate remarkable capabilities in executing complex, long-horizon tasks....

2 months ago cs.CR cs.AI PDF

Defense MEDIUM

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu +2 more

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal...

2 months ago cs.AI PDF

Defense MEDIUM

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture...

2 months ago cs.CR cs.AI cs.CL PDF

Defense MEDIUM

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and...

2 months ago cs.CR cs.AI cs.CL PDF

Defense MEDIUM

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas +1 more

Voice interfaces are quickly becoming a common way for people to interact with AI systems. This also brings new security risks, such as prompt...

2 months ago cs.SD cs.AI PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial