Benchmark MEDIUM
Yuhang Li, Yajie Wang, Xiangyun Tang +3 more
Secure aggregation is a foundational building block of privacy-preserving learning, yet achieving robustness under adversarial behavior remains...
Benchmark MEDIUM
Pearl Mody, Mihir Panchal, Rishit Kar +2 more
Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many...
Benchmark MEDIUM
Junjie Chu, Xinyue Shen, Ye Leng +3 more
The rapid growth of research in LLM safety makes it hard to track all advances. Benchmarks are therefore crucial for capturing key trends and...
2 months ago cs.CR cs.AI cs.SE
PDF
Benchmark LOW
Hongduan Tian, Xiao Feng, Ziyuan Zhao +3 more
Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks...
2 months ago cs.CL cs.LG
PDF
Benchmark MEDIUM
Minseok Choi, Dongjin Kim, Seungbin Yang +5 more
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their...
Benchmark MEDIUM
Zhongxi Wang, Yueqian Lin, Jingyang Zhang +2 more
Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to...
2 months ago cs.LG cs.CL cs.CV
PDF
Benchmark LOW
Rong Fu, Yiqing Lyu, Chunlei Meng +9 more
Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt...
Benchmark LOW
Xiangyang Zhu, Yuan Tian, Qi Jia +14 more
The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their...
2 months ago cs.LG cs.AI
PDF
Benchmark MEDIUM
Yu Lin, Qizhi Zhang, Wenqiang Ruan +6 more
The rapid development of large language models (LLMs) has driven the widespread adoption of cloud-based LLM inference services, while also bringing...
2 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Rahul Marchand, Art O Cathain, Jerome Wynne +5 more
Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating...
2 months ago cs.CR cs.AI
PDF
Benchmark HIGH
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases...
2 months ago cs.LG cs.CL
PDF
Benchmark HIGH
Mingcheng Jiang, Jiancheng Huang, Jiangfei Wang +5 more
Static Application Security Testing (SAST) tools often suffer from high false positive rates, leading to alert fatigue that consumes valuable...
Benchmark MEDIUM
Huajie Chen, Tianqing Zhu, Yuchen Zhong +7 more
Dataset distillation compresses a large real dataset into a small synthetic one, enabling models trained on the synthetic data to achieve performance...
2 months ago cs.CR cs.AI cs.LG
PDF
Benchmark LOW
Zihang Wang, Xu Li, Benwu Wang +7 more
Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes...
2 months ago cs.RO cs.AI
PDF
Benchmark MEDIUM
Haodong Zhao, Jinming Hu, Zhaomin Wu +7 more
Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a...
Benchmark MEDIUM
Om Tailor
Colluding language-model agents can hide coordination in messages that remain policy-compliant at the surface level. We present CLBC, a protocol...
2 months ago cs.CR cs.AI eess.SY
PDF
Benchmark LOW
Rahul Baxi
AI agents are increasingly granted economic agency (executing trades, managing budgets, negotiating contracts, and spawning sub-agents), yet current...
Benchmark LOW
Yashas Hariprasad, Subhash Gurappa, Sundararaj S. Iyengar +3 more
The Forensics Investigations Network in Digital Sciences (FINDS) Research Center of Excellence (CoE), funded by the U.S. Army Research Laboratory,...
2 months ago cs.CR cs.AI
PDF
Benchmark HIGH
Zhicheng Fang, Jingjie Zheng, Chenxu Fu +1 more
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare...
2 months ago cs.CR cs.AI cs.CL
PDF
Benchmark HIGH
Xuhui Dou, Hayretdin Bahsi, Alejandro Guerra-Manzanares
Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits...
2 months ago cs.CR cs.AI cs.LG
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial