AI Security Research

2,529+ academic papers on AI security, attacks, and defenses

Total

2,529

Attack

969

Benchmark

729

Defense

345

Tool

272

Survey

142

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 1–13 of 13 papers

Clear filters

Defense HIGH

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Nikita Kezins, Urbas Ekka, Pascal Berrang +1 more

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no...

Yesterday cs.LG PDF

Defense LOW

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano +2 more

Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly...

Yesterday physics.soc-ph cs.CL cs.MA PDF

Defense MEDIUM

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Krishak Aneja, Manas Mittal, Anmol Goel +2 more

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent...

Yesterday cs.CL cs.AI PDF

Defense LOW

GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

Tianyuan Zhang, Peng Yue, Zihao Peng +8 more

Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse...

Yesterday cs.AI PDF

Defense HIGH

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

Wenxin Tang, Xiang Zhang, Junliang Liu +11 more

Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the...

2 days ago cs.AI PDF

Defense LOW

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau +1 more

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as...

5 days ago cs.AI PDF

Defense MEDIUM

ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel

Leo Linqian Gan, Jeffery Wu, Longyuan Ge +6 more

Autonomous LLM agents face a critical security risk known as workflow hijacking, where attackers subtly alter tool and skill invocations. Existing...

5 days ago cs.CR PDF

Defense MEDIUM

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang +4 more

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on...

5 days ago cs.CR cs.AI cs.CL PDF

Defense MEDIUM

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

Siyuan Li, Aodu Wulianghai, Xi Lin +6 more

The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from...

5 days ago cs.CL PDF

Defense MEDIUM

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu +6 more

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful...

5 days ago cs.CL cs.AI cs.CR PDF

Defense LOW

SLAM: Structural Linguistic Activation Marking for Language Models

Fabrice Harel-Canada, Amit Sahai

LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection...

6 days ago cs.CL cs.AI PDF

Defense MEDIUM

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera +2 more

The open-source ecosystem has accelerated the democratization of Large Language Models (LLMs) through the public distribution of specialized Low-Rank...

6 days ago cs.CR PDF

Defense LOW

RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs

Hanum Ko, Sangheum Yeon, Jong Hwan Ko +1 more

As DRAM scales in density and adopts 3D integration, raw fault rates increase and multi-bit errors are no longer rare. Such errors can severely...

6 days ago cs.AR PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial