Benchmark LOW
Junjie Li, Fazle Rabbi, Bo Yang +2 more
Although Large Language Models (LLMs) show promising solutions to automated code generation, they often produce insecure code that threatens software...
Benchmark MEDIUM
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi +1 more
We analyze answers generated by generative engines (GEs) from the perspectives of citation publishers and the content-injection barrier, defined as...
7 months ago cs.CR cs.CL cs.IR
PDF
Benchmark MEDIUM
Zhiyuan Wei, Xiaoxuan Yang, Jing Sun +1 more
The increasing complexity of modern software systems exacerbates the prevalence of security vulnerabilities, posing risks of severe breaches and...
7 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Weidi Luo, Qiming Zhang, Tianyu Lu +9 more
Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can...
Benchmark MEDIUM
Ali Naseh, Anshuman Suri, Yuefeng Peng +3 more
Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is...
7 months ago cs.LG cs.CR
PDF
Benchmark LOW
Neeraja Kirtane, Yuvraj Khanna, Peter Relan
Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work...
Benchmark MEDIUM
Shadi Rahimian, Mario Fritz
Single nucleotide polymorphism (SNP) datasets are fundamental to genetic studies but pose significant privacy risks when shared. The correlation of...
7 months ago cs.LG cs.CR q-bio.GN
PDF
Benchmark MEDIUM
Mary Llewellyn, Annie Gray, Josh Collyer +1 more
Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be...
7 months ago cs.CR cs.AI cs.CL
PDF
Benchmark MEDIUM
Yongan Yu, Xianda Du, Qingchen Hu +7 more
Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have...
7 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Ruoxing Yang
Large language models (LLMs) such as ChatGPT have evolved into powerful and ubiquitous tools. Fine-tuning on small datasets allows LLMs to acquire...
7 months ago cs.LG cs.AI cs.CR
PDF
Benchmark HIGH
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri +5 more
AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause...
Benchmark HIGH
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri +5 more
AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause...
Benchmark MEDIUM
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj +2 more
Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing...
7 months ago cs.CL cs.AI cs.LG
PDF
Benchmark LOW
Peichao Lai, Jinhui Zhuang, Kexuan Zhang +6 more
Automating the conversion of UI images into web code is a critical task for front-end development and rapid prototyping. Advances in multimodal large...
Benchmark MEDIUM
Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh
Large language models (LLMs) are increasingly deployed in agentic systems where they map user intents to relevant external tools to fulfill a task. A...
7 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Chengxiao Wang, Isha Chaudhary, Qian Hu +3 more
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security....
7 months ago cs.AI cs.CR cs.LG
PDF
Benchmark LOW
Ardalan Aryashad, Parsa Razmara, Amin Mahjoub +3 more
Autonomous driving perception systems are particularly vulnerable in foggy conditions, where light scattering reduces contrast and obscures fine...
Benchmark MEDIUM
Hangting Ye, Jinmeng Li, He Zhao +4 more
Existing anomaly detection (AD) methods for tabular data usually rely on some assumptions about anomaly patterns, leading to inconsistent performance...
Benchmark LOW
Arina Kharlamova, Bowei He, Chen Ma +1 more
Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs)...
7 months ago cs.AI cs.CR
PDF
Benchmark LOW
Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu +2 more
Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models...
7 months ago cs.CL cs.AI cs.LG
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial