Benchmark HIGH
Chiyu Zhang, Huiqin Yang, Bendong Jiang +8 more
The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content...
Yesterday cs.CR cs.CL
PDF
Benchmark HIGH
Shai Feldman, Yaniv Romano
Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally...
Benchmark HIGH
Mohammad Mamun, Mohamed Gaber, Scott Buffett +1 more
Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary...
Benchmark HIGH
Priyal Deep, Shane Emmons, Amy Fox +3 more
LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that...
2 weeks ago cs.CR cs.AI
PDF
Benchmark HIGH
Hanzhi Liu, Chaofan Shou, Xiaonan Liu +4 more
LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets...
Benchmark HIGH
Euntae Kim, Soomin Han, Buru Chang
Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to...
Benchmark HIGH
Parteek Jamwal, Minghao Shao, Boyuan Chen +15 more
Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification,...
3 weeks ago cs.CR cs.AI cs.MA
PDF
Benchmark HIGH
Ivan Bercovich, Ivgeni Segal, Kexun Zhang +3 more
We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably...
3 weeks ago cs.CR cs.AI
PDF
Benchmark HIGH
Runpeng Geng, Chenlong Yin, Yanting Wang +2 more
Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the...
1 months ago cs.CR cs.AI cs.CL
PDF
Benchmark HIGH
Phan The Duy, Nguyen Viet Duy, Khoa Ngo-Khanh +2 more
While recent approaches leverage large language models (LLMs) and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits...
Benchmark HIGH
Baoshun Tong, Haoran He, Ling Pan +2 more
Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains...
1 months ago cs.RO cs.CV
PDF
Benchmark HIGH
Sen Fang, Weiyuan Ding, Zhezhen Cao +2 more
Large Language Models (LLMs) are increasingly adopted for vulnerability detection, yet their reasoning remains fundamentally unsound. We identify a...
1 months ago cs.SE cs.AI cs.CR
PDF
Benchmark HIGH
Iakovos-Christos Zarkadis, Christos Douligeris
Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time...
1 months ago cs.CR cs.AI stat.AP
PDF
Benchmark HIGH
Lidor Erez, Omer Hofman, Tamir Nizri +1 more
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the...
1 months ago cs.CR cs.PF
PDF
Benchmark HIGH
Siddharth Srikanth, Freddie Liang, Sophie Hsu +9 more
Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks....
2 months ago cs.RO cs.AI cs.CL
PDF
Benchmark HIGH
Zheng Yu, Wenxuan Shi, Xinqian Sun +3 more
Automated Vulnerability Repair (AVR) systems, especially those leveraging large language models (LLMs), have demonstrated promising results in...
Benchmark HIGH
Zheng Yu, Wenxuan Shi, Xinqian Sun +3 more
Automated Vulnerability Repair (AVR) systems, especially those leveraging large language models (LLMs), have demonstrated promising results in...
Benchmark HIGH
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases...
2 months ago cs.LG cs.CL
PDF
Benchmark HIGH
Mingcheng Jiang, Jiancheng Huang, Jiangfei Wang +5 more
Static Application Security Testing (SAST) tools often suffer from high false positive rates, leading to alert fatigue that consumes valuable...
Benchmark HIGH
Zhicheng Fang, Jingjie Zheng, Chenxu Fu +1 more
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare...
2 months ago cs.CR cs.AI cs.CL
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial