AI Security Research

2,529+ academic papers on AI security, attacks, and defenses

Total

2,529

Attack

969

Benchmark

729

Defense

345

Tool

272

Survey

142

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 1–20 of 266 papers

Clear filters

Defense HIGH

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Nikita Kezins, Urbas Ekka, Pascal Berrang +1 more

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no...

Yesterday cs.LG PDF

Defense LOW

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano +2 more

Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly...

Yesterday physics.soc-ph cs.CL cs.MA PDF

Defense MEDIUM

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Krishak Aneja, Manas Mittal, Anmol Goel +2 more

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent...

Yesterday cs.CL cs.AI PDF

Defense LOW

GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

Tianyuan Zhang, Peng Yue, Zihao Peng +8 more

Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse...

Yesterday cs.AI PDF

Defense HIGH

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

Wenxin Tang, Xiang Zhang, Junliang Liu +11 more

Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the...

2 days ago cs.AI PDF

Defense LOW

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau +1 more

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as...

5 days ago cs.AI PDF

Defense MEDIUM

ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel

Leo Linqian Gan, Jeffery Wu, Longyuan Ge +6 more

Autonomous LLM agents face a critical security risk known as workflow hijacking, where attackers subtly alter tool and skill invocations. Existing...

5 days ago cs.CR PDF

Defense MEDIUM

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang +4 more

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on...

5 days ago cs.CR cs.AI cs.CL PDF

Defense MEDIUM

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

Siyuan Li, Aodu Wulianghai, Xi Lin +6 more

The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from...

5 days ago cs.CL PDF

Defense MEDIUM

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu +6 more

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful...

5 days ago cs.CL cs.AI cs.CR PDF

Defense LOW

SLAM: Structural Linguistic Activation Marking for Language Models

Fabrice Harel-Canada, Amit Sahai

LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection...

6 days ago cs.CL cs.AI PDF

Defense MEDIUM

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera +2 more

The open-source ecosystem has accelerated the democratization of Large Language Models (LLMs) through the public distribution of specialized Low-Rank...

6 days ago cs.CR PDF

Defense LOW

RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs

Hanum Ko, Sangheum Yeon, Jong Hwan Ko +1 more

As DRAM scales in density and adopts 3D integration, raw fault rates increase and multi-bit errors are no longer rare. Such errors can severely...

6 days ago cs.AR PDF

Defense LOW

Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

Zhenning Yang, Yuhan Chen, Patrick Tser Jern Kon +5 more

To unleash the full potential of AI for Science, we must untether the agents from a purely digital environment. The agent's ability to control and...

1 weeks ago eess.SY cs.AI PDF

Defense LOW

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

Srinath Perera, Kaviru Hapuarachchi, Frank Leymann +1 more

We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension...

1 weeks ago cs.AI PDF

Defense MEDIUM

Self-Mined Hardness for Safety Fine-Tuning

Prakhar Gupta, Garv Shah, Donghua Zhang

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's...

1 weeks ago cs.LG cs.AI cs.CR PDF

Defense LOW

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey +1 more

Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural...

1 weeks ago eess.AS cs.CR cs.LG PDF

Defense MEDIUM

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Sadia Asif, Mohammad Mohammadi Amiri

Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable...

1 weeks ago cs.LG cs.AI cs.CE PDF

Defense MEDIUM

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Xiaokun Luan, Yihao Zhang, Pengcheng Su +2 more

Large Language Model (LLM) watermarking is crucial for establishing the provenance of machine-generated text, but most existing methods rely on a...

1 weeks ago cs.CR PDF

Defense HIGH

Learning Generalizable Multimodal Representations for Software Vulnerability Detection

Zeming Dong, Yuejun Guo, Qiang Hu +5 more

Source code and its accompanying comments are complementary yet naturally aligned modalities-code encodes structural logic while comments capture...

2 weeks ago cs.SE cs.AI PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial