Benchmark MEDIUM
Igor Santos-Grueiro
Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens...
3 months ago cs.AI cs.CR cs.LG
PDF
Benchmark MEDIUM
Pouria Arefijamal, Mahdi Ahmadlou, Bardia Safaei +1 more
Federated learning (FL) is a decentralized learning paradigm widely adopted in resource-constrained Internet of Things (IoT) environments. These...
3 months ago cs.LG cs.CR cs.DC
PDF
Benchmark HIGH
Yuhang Wang, Feiming Xu, Zheng Lin +6 more
Although large language model (LLM)-based agents, exemplified by OpenClaw, are increasingly evolving from task-oriented systems into personalized AI...
Benchmark MEDIUM
Liwen Wang, Zongjie Li, Yuchong Xie +4 more
The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant...
3 months ago cs.AI cs.CR
PDF
Benchmark MEDIUM
Shadman Rabby, Md. Hefzul Hossain Papon, Sabbir Ahmed +3 more
Sycophancy in Vision-Language Models (VLMs) refers to their tendency to align with user opinions, often at the expense of moral or factual accuracy....
Benchmark HIGH
Nanda Rani, Kimberly Milner, Minghao Shao +9 more
Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty,...
3 months ago cs.CR cs.AI cs.MA
PDF
Benchmark LOW
Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy +5 more
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they...
Benchmark MEDIUM
Sai Puppala, Ismail Hossain, Md Jahangir Alam +5 more
Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety...
3 months ago cs.CR cs.AI
PDF
Benchmark HIGH
Tianyi Wu, Mingzhe Du, Yue Liu +4 more
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to...
3 months ago cs.CR cs.AI cs.CL
PDF
Benchmark MEDIUM
Kunal Pai, Parth Shah, Harshil Patel
AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red-teaming or static benchmarks that...
3 months ago cs.AI cs.MA
PDF
Benchmark MEDIUM
Xiang Li, Pin-Yu Chen, Wenqi Wei
With the rapid advancement and adoption of Audio Large Language Models (ALLMs), voice agents are now being deployed in high-stakes domains such as...
3 months ago cs.CR cs.MA
PDF
Benchmark MEDIUM
Qi Sun, Ahmed Abdo, Luis Burbano +4 more
Autonomous Vehicles (AVs), especially vision-based AVs, are rapidly being deployed without human operators. As AVs operate in safety-critical...
3 months ago cs.CR cs.LG
PDF
Benchmark HIGH
Li Lu, Yanjie Zhao, Hongzhou Rao +2 more
Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists:...
Benchmark MEDIUM
Haoyang Hu, Zhejun Jiang, Yueming Lyu +3 more
Retrieval-augmented generation (RAG) is increasingly deployed in real-world applications, where its reference-grounded design makes outputs appear...
3 months ago cs.CR cs.LG
PDF
Benchmark MEDIUM
Yi Liu, Zhihao Chen, Yanjun Zhang +5 more
Third-party agent skills extend LLM-based agents with instruction files and executable code that run on users' machines. Skills execute with user...
3 months ago cs.CR cs.AI cs.CL
PDF
Benchmark HIGH
Junhyeok Lee, Han Jang, Kyu Sung Choi
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt...
3 months ago cs.CL cs.LG
PDF
Benchmark MEDIUM
Navita Goyal, Hal Daumé
Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for...
3 months ago cs.LG cs.AI cs.CL
PDF
Benchmark MEDIUM
José Ramón Pareja Monturiol, Juliette Sinnott, Roger G. Melko +1 more
Machine learning in clinical settings must balance predictive accuracy, interpretability, and privacy. Models such as logistic regression (LR) offer...
3 months ago cs.LG cs.CR quant-ph
PDF
Benchmark LOW
Rui Jia, Ruiyi Lan, Fengrui Liu +7 more
Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often...
Benchmark LOW
Nelu D. Radpour
Contemporary benchmarks for agentic artificial intelligence (AI) frequently evaluate safety through isolated task-level accuracy thresholds,...
3 months ago cs.CY cs.AI cs.HC
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial