Benchmark MEDIUM
Gengxin Sun, Ruihao Yu, Liangyi Yin +3 more
Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose...
1 months ago cs.MA cs.AI
PDF
Benchmark LOW
Lingyu Li, Yan Teng, Yingchun Wang
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal...
1 months ago cs.CL cs.AI
PDF
Benchmark LOW
Trishita Dhara, Siddhesh Sheth
Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this,...
Benchmark LOW
Simone Aonzo, Merve Sahin, Aurélien Francillon +1 more
Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over...
1 months ago cs.CR cs.AI
PDF
Benchmark LOW
Taeyun Roh, Wonjune Jang, Junha Jung +1 more
Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store...
1 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Yu Pan, Wenlong Yu, Tiejun Wu +4 more
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to...
1 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Ye Wang, Jing Liu, Toshiaki Koike-Akino
The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain...
2 months ago cs.LG cs.AI cs.CL
PDF
Benchmark MEDIUM
Yuhuan Liu, Haitian Zhong, Xinyuan Xia +3 more
Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems...
Benchmark MEDIUM
Jinhu Qi, Yifan Li, Minghao Zhao +4 more
As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased...
2 months ago cs.CL cs.DB
PDF
Benchmark HIGH
Lidor Erez, Omer Hofman, Tamir Nizri +1 more
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the...
2 months ago cs.CR cs.PF
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
2 months ago cs.CV cs.LG
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
2 months ago cs.CV cs.LG
PDF
Benchmark LOW
Xiaoya Lu, Yijin Zhou, Zeren Chen +6 more
Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where...
Benchmark MEDIUM
Ivan Lopez, Selin S. Everett, Bryan J. Bunning +10 more
Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during...
2 months ago cs.HC cs.LG
PDF
Benchmark MEDIUM
Arjun Chakraborty, Sandra Ho, Adam Cook +1 more
CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat...
Benchmark LOW
Ziyu Liu, Shengyuan Ding, Xinyu Fang +7 more
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured...
2 months ago cs.CV cs.AI
PDF
Benchmark MEDIUM
Zhifang Zhang, Bojun Yang, Shuo He +5 more
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where...
2 months ago cs.CV cs.CR
PDF
Benchmark HIGH
Siddharth Srikanth, Freddie Liang, Sophie Hsu +9 more
Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks....
2 months ago cs.RO cs.AI cs.CL
PDF
Benchmark MEDIUM
Ninghui Li, Kaiyuan Zhang, Kyle Polley +1 more
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and...
2 months ago cs.LG cs.AI cs.CR
PDF
Benchmark LOW
Chiyuan He, Zihuan Qiu, Fanman Meng +4 more
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without...
2 months ago cs.CV cs.LG
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial