Benchmark MEDIUM
Pablo Mateo-Torrejón, Alfonso Sánchez-Macián
The rapid integration of Large Language Models (LLMs) into Multi-Agent Systems (MAS) has significantly enhanced their collaborative problem-solving...
2 weeks ago cs.CR cs.AI cs.MA
PDF
Benchmark MEDIUM
Zijun Feng, Yuming Feng, Yu Wang +4 more
Cross-chain bridges, the critical infrastructure of the multi-chain ecosystem, have become a primary target for attackers, resulting in over $2.8...
Benchmark MEDIUM
Víctor Mayoral-Vilches, María Sanz-Gómez, Francesco Balassone +6 more
As LLM-driven agents advance in cybersecurity, Jeopardy CTF benchmarks are approaching saturation and cyber ranges, the natural next evaluation...
Benchmark MEDIUM
Eungyu Woo, Yooshin Kim, Wonje Heo +1 more
Industrial Control Systems (ICS) integrate computing, physical processes, and communication to operate critical infrastructures such as power grids,...
Benchmark HIGH
Priyal Deep, Shane Emmons, Amy Fox +3 more
LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that...
2 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Qi Li, Bo Yin, Weiqi Huang +6 more
Vision-Language-Action (VLA) models are emerging as a unified substrate for embodied intelligence. This shift raises a new class of safety...
Benchmark LOW
Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny +3 more
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs...
2 weeks ago cs.CV cs.AI cs.CL
PDF
Benchmark MEDIUM
Yuchen Shi, Xin Guo, Huajie Chen +3 more
Poisoning-based backdoor attacks pose significant threats to deep neural networks by embedding triggers in training data, causing models to...
2 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Vishal Rajput
We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises...
2 weeks ago cs.LG cs.AI cs.CV
PDF
Benchmark LOW
Yongcan Yu, Lingxiao He, Jian Liang +5 more
Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization...
2 weeks ago cs.LG cs.AI cs.CL
PDF
Benchmark MEDIUM
Ari Azarafrooz
AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips...
2 weeks ago cs.CR cs.AI cs.CL
PDF
Benchmark MEDIUM
Mohammad Farhad, Shuvalaxmi Dass
Software security relies on effective vulnerability detection and patching, yet determining whether a patch fully eliminates risk remains an...
2 weeks ago cs.SE cs.CR
PDF
Benchmark HIGH
Hanzhi Liu, Chaofan Shou, Xiaonan Liu +4 more
LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets...
Benchmark MEDIUM
Hoang Nguyen, Lu Wang, Marta Gaia Bras
Freight brokerages negotiate thousands of carrier rates daily under dynamic pricing conditions where models frequently revise targets...
2 weeks ago cs.MA cs.AI cs.CL
PDF
Benchmark MEDIUM
He Yang Yuan, Xin Wang, Kundi Yao +3 more
Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring....
2 weeks ago cs.SE cs.AI cs.CR
PDF
Benchmark MEDIUM
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan +1 more
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic...
Benchmark MEDIUM
Robert Stanley, Avi Verma, Lillian Tsai +2 more
AI agents promise to serve as general-purpose personal assistants for their users, which requires them to have access to private user data (e.g.,...
3 weeks ago cs.CR cs.AI cs.OS
PDF
Benchmark MEDIUM
Alankrit Chona, Igor Kozlov, Ambuj Kumar
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of...
3 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Alankrit Chona, Igor Kozlov, Ambuj Kumar
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of...
3 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek +3 more
Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive...
3 weeks ago cs.AI cs.CR cs.SE
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial