Benchmark LOW
Taeyun Roh, Wonjune Jang, Junha Jung +1 more
Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store...
1 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Yu Pan, Wenlong Yu, Tiejun Wu +4 more
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to...
1 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Ye Wang, Jing Liu, Toshiaki Koike-Akino
The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain...
1 months ago cs.LG cs.AI cs.CL
PDF
Benchmark MEDIUM
Yuhuan Liu, Haitian Zhong, Xinyuan Xia +3 more
Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems...
Benchmark MEDIUM
Jinhu Qi, Yifan Li, Minghao Zhao +4 more
As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased...
1 months ago cs.CL cs.DB
PDF
Benchmark HIGH
Lidor Erez, Omer Hofman, Tamir Nizri +1 more
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the...
1 months ago cs.CR cs.PF
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
1 months ago cs.CV cs.LG
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
1 months ago cs.CV cs.LG
PDF
Benchmark LOW
Xiaoya Lu, Yijin Zhou, Zeren Chen +6 more
Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where...
Benchmark MEDIUM
Ivan Lopez, Selin S. Everett, Bryan J. Bunning +10 more
Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during...
1 months ago cs.HC cs.LG
PDF
Benchmark MEDIUM
Arjun Chakraborty, Sandra Ho, Adam Cook +1 more
CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat...
Benchmark LOW
Ziyu Liu, Shengyuan Ding, Xinyu Fang +7 more
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured...
2 months ago cs.CV cs.AI
PDF
Benchmark MEDIUM
Zhifang Zhang, Bojun Yang, Shuo He +5 more
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where...
2 months ago cs.CV cs.CR
PDF
Benchmark HIGH
Siddharth Srikanth, Freddie Liang, Sophie Hsu +9 more
Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks....
2 months ago cs.RO cs.AI cs.CL
PDF
Benchmark MEDIUM
Ninghui Li, Kaiyuan Zhang, Kyle Polley +1 more
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and...
2 months ago cs.LG cs.AI cs.CR
PDF
Benchmark LOW
Chiyuan He, Zihuan Qiu, Fanman Meng +4 more
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without...
2 months ago cs.CV cs.LG
PDF
Benchmark LOW
Chiyuan He, Zihuan Qiu, Fanman Meng +4 more
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without...
2 months ago cs.CV cs.LG
PDF
Benchmark MEDIUM
Junjie Chu, Yiting Qu, Ye Leng +4 more
Large Language Models (LLMs) are increasingly trained to align with human values, primarily focusing on task level, i.e., refusing to execute...
2 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Qizhi Chen, Chao Qi, Yihong Huang +5 more
Graph-based Retrieval-Augmented Generation (GraphRAG) constructs the Knowledge Graph (KG) from external databases to enhance the timeliness and...
2 months ago cs.LG cs.AI cs.CR
PDF
Benchmark LOW
Yan Tan, Xiangchen Meng, Zijun Jiang +1 more
Large language models (LLMs) have demonstrated impressive capabilities in generating software code for high-level programming languages such as...
2 months ago cs.PL cs.AR
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial