AI Security Research

2,560+ academic papers on AI security, attacks, and defenses

Total

2,560

Attack

982

Benchmark

736

Defense

350

Tool

275

Survey

144

Type

All Attack Defense Survey Benchmark Tool

Relevance

All High Medium

Date

All time 7 days 30 days 6 months

Showing 241–260 of 1,936 papers

Clear filters

Benchmark HIGH

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Euntae Kim, Soomin Han, Buru Chang

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to...

3 weeks ago cs.CL PDF

Defense MEDIUM

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Kun Wang, Cheng Qian, Miao Yu +6 more

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is...

3 weeks ago cs.CR cs.AI PDF

Attack HIGH

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

MinJae Jung, YongTaek Lim, Chaeyun Kim +3 more

While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses....

3 weeks ago cs.CL PDF

Defense MEDIUM

Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was...

3 weeks ago cs.LG cs.CR PDF

Tool HIGH

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage +5 more

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an...

3 weeks ago cs.AI cs.CR cs.LG PDF

Attack HIGH

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Hanrui Luo, Shreyank N Gowda

Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only...

3 weeks ago cs.CL cs.LG PDF

Benchmark MEDIUM

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse...

3 weeks ago cs.LG cs.AI cs.CL PDF

Attack MEDIUM

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Ruixuan Liu, David Evans, Li Xiong

Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to...

3 weeks ago cs.CR cs.CL cs.LG PDF

Attack HIGH

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Md Rysul Kabir, Zoran Tiganj

Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in...

3 weeks ago cs.CR cs.AI cs.CL PDF

Attack HIGH

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

Thamilvendhan Munirathinam

Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer...

3 weeks ago cs.CR cs.CL PDF

Benchmark MEDIUM

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Sina Abdollahi, Mohammad M Maheri, Javad Forough +5 more

Large Language Model (LLM) agents provide powerful automation capabilities, but they also create a substantially broader attack surface than...

3 weeks ago cs.CR cs.OS PDF

Benchmark LOW

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang, Zhicheng Qian +1 more

Prompt-injected memory can improve reasoning without updating model weights, but it also creates a control problem: retrieved content helps only when...

3 weeks ago cs.AI PDF

Defense MEDIUM

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such...

3 weeks ago cs.CR cs.AI PDF

Attack HIGH

Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

Wentao Zhang, Yan Zhuang, ZhuHang Zheng +3 more

Existing jamming attacks on Retrieval-Augmented Generation (RAG) systems typically induce explicit refusals or denial-of-service behaviors, which are...

3 weeks ago cs.CR cs.AI PDF

Benchmark LOW

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Sua Lee, Sanghee Park, Jinbae Im

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their...

3 weeks ago cs.CL cs.AI cs.CV PDF

Attack HIGH

Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks

Jin Zhao, Marta Knežević, Tanja Käser

Large Language Models (LLMs) are increasingly used in education, yet their default helpfulness often conflicts with pedagogical principles. Prior...

3 weeks ago cs.CR cs.AI PDF

Defense LOW

Architectural Design Decisions in AI Agent Harnesses

Hu Wei

AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety...

3 weeks ago cs.AI PDF

Defense MEDIUM

Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang

Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a...

3 weeks ago cs.CR cs.AI cs.CL PDF

Other LOW

MASFuzzer: Fuzz Driver Generation and Adaptive Scheduling via Multidimensional API Sequences

Xingyu Liu, Zengqin Huang, Xiang Gao +1 more

Fuzz testing of software libraries relies on fuzz drivers to invoke library APIs. Traditionally, these drivers are written manually by developers - a...

3 weeks ago cs.SE PDF

Benchmark HIGH

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

Parteek Jamwal, Minghao Shao, Boyuan Chen +15 more

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification,...

3 weeks ago cs.CR cs.AI cs.MA PDF

Track AI security vulnerabilities in real time

Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.

Start 14-Day Free Trial