Benchmark MEDIUM
Pedro Conde, Henrique Branquinho, Valerio Mazzone +3 more
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will...
Yesterday cs.AI cs.CR
PDF
Benchmark MEDIUM
Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy
Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat...
Yesterday cs.CR cs.AI
PDF
Benchmark HIGH
Chiyu Zhang, Huiqin Yang, Bendong Jiang +8 more
The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content...
Yesterday cs.CR cs.CL
PDF
Benchmark LOW
Dahlia Shehata, Ming Li
Multi-agent systems (MAS) assume that collaborating inherently improves Large Language Model (LLM) reasoning. We challenge this by demonstrating that...
Yesterday cs.MA cs.AI
PDF
Benchmark LOW
Hui Lu, Xueyuan Chen, Huimeng Wang +4 more
Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language...
Yesterday cs.CL eess.AS
PDF
Benchmark MEDIUM
Qinghua Mao, Xi Lin, Jinze Gu +3 more
Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces...
Yesterday cs.AI cs.CR
PDF
Benchmark MEDIUM
Xia Hu, Zhenrui Yue, Brian Potetz +4 more
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores...
Yesterday cs.CV cs.AI
PDF
Benchmark MEDIUM
Huy Hoang Ha, Benoit Favre, Francois Portet
Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order...
2 days ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Jingshen Zhang, Bo Wang, Yanlin Fu +4 more
In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional...
Benchmark MEDIUM
Yilin Zhang, Yingkai Hua, Chunyu Wei +2 more
Vision-language model (VLM) based web agents demonstrate impressive autonomous GUI interaction but remain vulnerable to deceptive interface elements....
2 days ago cs.AI cs.CR
PDF
Benchmark HIGH
Shai Feldman, Yaniv Romano
Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally...
Benchmark HIGH
Mohammad Mamun, Mohamed Gaber, Scott Buffett +1 more
Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary...
Benchmark MEDIUM
Di Lu, Bo Zhang, Xiyuan Li +5 more
Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including...
Benchmark MEDIUM
Qinfeng Li, Yuntai Bao, Jianghui Hu +5 more
LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property....
5 days ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Christopher G. Pedraza Pohlenz, Hassan Jalil Hadi, Ali Hassan +1 more
LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and...
5 days ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier
Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely...
Benchmark MEDIUM
Dasol Choi, Eugenia Kim, Jaewon Noh +14 more
Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover,...
5 days ago cs.CL cs.AI
PDF
Benchmark LOW
Hoin Jung, Xiaoqian Wang
While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the...
5 days ago cs.CL cs.CV cs.LG
PDF
Benchmark MEDIUM
Chenglin Yang
Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A...
6 days ago cs.AI cs.CR
PDF
Benchmark MEDIUM
Rishi Raj Sahoo, Jyotirmaya Shivottam, Subhankar Mishra
Regulatory frameworks such as GDPR increasingly require that ML predictions be accompanied by post-hoc explanations, even when raw data and trained...
1 weeks ago cs.LG cs.CR
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial