Benchmark MEDIUM
Ishan Kavathekar, Hemang Jain, Ameya Rathod +2 more
Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision-making abilities,...
4 months ago cs.MA cs.AI
PDF
Benchmark MEDIUM
Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen +3 more
Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing...
Benchmark MEDIUM
Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy +1 more
As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they...
Benchmark MEDIUM
Shiyin Lin
Software fuzzing has become a cornerstone in automated vulnerability discovery, yet existing mutation strategies often lack semantic awareness,...
4 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Jon Kutasov, Chloe Loughridge, Yuqi Sun +4 more
As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to...
Benchmark MEDIUM
Patrick Karlsen, Even Eilertsen
This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data...
4 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Hanzhong Liang, Yue Duan, Xing Su +5 more
As the Web3 ecosystem evolves toward a multi-chain architecture, cross-chain bridges have become critical infrastructure for enabling...
Benchmark MEDIUM
Ariyan Hossain, Khondokar Mohammad Ahanaf Hannan, Rakinul Haque +4 more
Gender bias in language models has gained increasing attention in the field of natural language processing. Encoder-based transformer models, which...
Benchmark MEDIUM
Heehwan Kim, Sungjune Park, Daeseon Choi
Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always...
4 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Arnabh Borah, Md Tanvirul Alam, Nidhi Rastogi
Security applications are increasingly relying on large language models (LLMs) for cyber threat detection; however, their opaque reasoning often...
4 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park +2 more
As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from...
4 months ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Shaked Zychlinski, Yuval Kainan
Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings...
4 months ago cs.CR cs.AI cs.CL
PDF
Benchmark MEDIUM
Yingjia Wang, Ting Qiao, Xing Liu +3 more
The rapid advancement of deep neural networks (DNNs) heavily relies on large-scale, high-quality datasets. However, unauthorized commercial use of...
4 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Zheng Zhang, Haonan Li, Xingyu Li +2 more
Bug bisection has been an important security task that aims to understand the range of software versions impacted by a bug, i.e., identifying the...
Benchmark MEDIUM
André V. Duarte, Xuying li, Bin Zeng +3 more
If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling...
Benchmark MEDIUM
Simon Yu, Peilin Yu, Hongbo Zheng +3 more
We present VISAT, a novel open dataset and benchmarking suite for evaluating model robustness in the task of traffic sign recognition with the...
4 months ago cs.CR cs.AI cs.LG
PDF
Benchmark MEDIUM
Zheng Zhang, Guanlong Wu, Sen Deng +2 more
In the rapidly expanding landscape of Large Language Model (LLM) applications, real-time output streaming has become the dominant interaction...
Benchmark MEDIUM
Juan Ren, Mark Dras, Usman Naseem
Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to...
Benchmark MEDIUM
Yifan Wu, Xuewei Feng, Yuxiang Yang +1 more
As the core of the Internet infrastructure, the TCP/IP protocol stack undertakes the task of network data transmission. However, due to the...
4 months ago cs.CR cs.NI
PDF
Benchmark MEDIUM
María Sanz-Gómez, Víctor Mayoral-Vilches, Francesco Balassone +3 more
Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks...
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial