Benchmark LOW
Cheng Xu, Changhong Jin, Yingjie Niu +5 more
The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to...
1 months ago cs.CL cs.AI
PDF
Benchmark LOW
Houzhe Wang, Xiaojie Zhu, Chi Chen
With the increasing importance of data privacy and security, federated unlearning has emerged as a novel research field dedicated to ensuring that...
1 months ago cs.LG cs.CR
PDF
Benchmark LOW
Kanishk Jain, Qian Yang, Shravan Nayak +3 more
Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that...
1 months ago cs.CV cs.AI
PDF
Benchmark MEDIUM
Zhuohao Yu, Zhiwei Steven Wu, Adam Block
Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question...
Benchmark MEDIUM
Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen +2 more
Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse...
Benchmark MEDIUM
Shuyao Gao, Minghao Huang
The deployment of Large Language Models (LLMs) has ignited concerns about technological unemployment. Existing task-based evaluations predominantly...
1 months ago cs.CY econ.GN
PDF
Benchmark LOW
Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy +2 more
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular,...
Benchmark LOW
Jaemin Kim, Jae O Lee, Sumyeong Ahn +1 more
Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to...
1 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca +3 more
Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk...
Benchmark LOW
O. Clerc, R. Abdelghani, C. Desvaux +3 more
The rapid adoption of generative artificial intelligence (GenAI) in schools raises concerns about students' uncritical reliance on its outputs....
Benchmark MEDIUM
Yiheng Huang, Zhijia Zhao, Bihuan Chen +5 more
The model context protocol (MCP) standardizes how LLMs connect to external tools and data sources, enabling faster integration but introducing new...
1 months ago cs.CR cs.SE
PDF
Benchmark LOW
Yukai Ma, Honglin He, Selina Song +2 more
Long-horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and...
Benchmark MEDIUM
Weidi Luo, Xiaofei Wen, Tenghao Huang +5 more
Large language models (LLMs) are increasingly deployed for everyday tasks, including food preparation and health-related guidance. However, food...
Benchmark MEDIUM
Kıvanç Kuzey Dikici, Serdar Kara, Semih Çağlar +2 more
As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination...
1 months ago cs.SE cs.CR
PDF
Benchmark LOW
Yao Qin, Yangyang Yan, Jinhua Pang +1 more
The integration of Large Language Models (LLMs) into life sciences has catalyzed the development of "AI Scientists." However, translating these...
Benchmark MEDIUM
Yanting Wang, Jinyuan Jia
Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building...
Benchmark MEDIUM
Yubo Li, Lu Zhang, Tianchong Jiang +2 more
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a...
1 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Yicheng Cai, Mitchell John DeStefano, Guodong Dong +5 more
As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations,...
1 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Quan Zhang, Lianhang Fu, Lvsi Lian +5 more
Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the...
1 months ago cs.CR cs.AI
PDF
Benchmark LOW
Kesheng Chen, Yamin Hu, Qi Zhou +2 more
Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual...
1 months ago cs.CV cs.AI cs.CL
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial