Benchmark MEDIUM
Shivam Ratnakar, Sanjay Raghavendra
Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that...
6 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
David Peer, Sebastian Stabinger
Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent...
6 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Shuai Li, Kejiang Chen, Jun Jiang +5 more
Large Language Models (LLMs) have demonstrated remarkable capabilities, but their training requires extensive data and computational resources,...
Benchmark LOW
Mohammad Abdul Rehman, Syed Imad Ali Shah, Abbas Anwar +2 more
The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential...
6 months ago cs.CR cs.AI cs.LG
PDF
Benchmark LOW
Yao Huang, Yitong Sun, Yichi Zhang +3 more
Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also...
6 months ago cs.CL cs.AI cs.LG
PDF
Benchmark LOW
Luca Belli, Kate Bentley, Will Alexander +5 more
We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental...
6 months ago cs.CY cs.AI cs.HC
PDF
Benchmark HIGH
Bin Liu, Yanjie Zhao, Guoai Xu +1 more
Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code...
6 months ago cs.SE cs.CR
PDF
Benchmark HIGH
Trilok Padhi, Pinxian Lu, Abdulkadir Erol +5 more
Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior...
Benchmark LOW
Matan Levi, Daniel Ohayon, Ariel Blobstein +3 more
Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality,...
6 months ago cs.CL cs.AI cs.CR
PDF
Benchmark MEDIUM
Qiushi Wu, Yue Xiao, Dhilung Kirat +3 more
Fixing bugs in large programs is a challenging task that demands substantial time and effort. Once a bug is found, it is reported to the project...
6 months ago cs.SE cs.AI
PDF
Benchmark MEDIUM
Yibo Peng, James Song, Lei Li +6 more
Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively...
6 months ago cs.CR cs.SE
PDF
Benchmark LOW
Xiuyuan Chen, Tao Sun, Dexin Su +38 more
Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and...
Benchmark MEDIUM
Jonghyun Park, Minhyuk Seo, Jonghyun Choi
One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But...
Benchmark HIGH
Ivan Dubrovsky, Anastasia Orlova, Illarion Iov +3 more
Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically...
Benchmark MEDIUM
Xin Zhao, Xiaojun Chen, Bingshan Liu +3 more
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs...
Benchmark LOW
Yuan Feng, Haoyu Guo, JunLin Lv +2 more
Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime...
Benchmark MEDIUM
Juan Ren, Mark Dras, Usman Naseem
Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs...
Benchmark LOW
Ruoyu Sun, Da Song, Jiayang Song +2 more
As Large Language Models (LLMs) continue to revolutionize Natural Language Processing (NLP) applications, critical concerns about their...
7 months ago cs.SE cs.AI cs.CL
PDF
Benchmark MEDIUM
João A. Leite, Arnav Arora, Silvia Gargova +5 more
Large Language Models (LLMs) can generate human-like disinformation, yet their ability to personalise such content across languages and demographics...
Benchmark MEDIUM
Blazej Manczak, Eric Lin, Francisco Eiras +2 more
Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions...
7 months ago cs.CL cs.AI
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial