Benchmark LOW
Khang Tran, Khoa Nguyen, Cristian Borcea +1 more
Recent advances in large language models for test case generation have improved branch coverage via prompt-engineered mutations. However, they still...
3 weeks ago cs.SE cs.LG
PDF
Benchmark HIGH
Ivan Bercovich, Ivgeni Segal, Kexun Zhang +3 more
We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably...
3 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Dongwook Lee, Eunwoo Song, Che Hyun Lee +2 more
While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party...
3 weeks ago cs.CL cs.AI cs.SD
PDF
Benchmark MEDIUM
Rina Mishra, Gaurav Varshney, Doddipatla Sesha Sahithi
The rapid adoption of open-source Large Language Models (LLMs) in offline and enterprise environments has introduced a largely unexamined security...
Benchmark LOW
Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara +1 more
Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made...
4 weeks ago cs.CV cs.AI
PDF
Benchmark MEDIUM
Djiré Albérick Euraste, Kaboré Abdoul Kader, Jordan Samhi +3 more
The lack of transparency about code datasets used to train large language models (LLMs) makes it difficult to detect, evaluate, and mitigate data...
Benchmark MEDIUM
Xixun Lin, Yang Liu, Yancheng Chen +9 more
The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use,...
4 weeks ago cs.CR cs.AI
PDF
Benchmark LOW
Eun Woo Im, Dhruv Madhwal, Vivek Gupta
Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word...
Benchmark MEDIUM
Prajas Wadekar, Venkata Sai Pranav Bachina, Kunal Bhosikar +2 more
3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this...
1 months ago cs.CV cs.CR cs.LG
PDF
Benchmark MEDIUM
Joel Fokou
Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise...
1 months ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Miit Daga, Swarna Priya Ramu
Organisations increasingly outsource privacy-sensitive data transformations to cloud providers, yet no practical mechanism lets the data owner verify...
1 months ago cs.CR cs.DB cs.LG
PDF
Benchmark MEDIUM
Rui Yin, Tianxu Han, Naen Xu +8 more
Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain...
1 months ago cs.CR cs.CL
PDF
Benchmark MEDIUM
Pei-Yu Tseng, Lan Zhang, ZihDwo Yeh +3 more
Cyber Threat Intelligence (CTI) reports contain Indicators of Compromise (IOCs) that are critical for security operations. To operationalize these...
Benchmark MEDIUM
Ricardo Bessa, Rui Claro, João Trindade +1 more
Large Language Models (LLMs) are redefining offensive cybersecurity by allowing the generation of harmful machine code with minimal human...
Benchmark LOW
Javad M Alizadeh, Genhui Zheng, Chiu C Tan +7 more
People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG...
Benchmark MEDIUM
Hanbo Huang, Xuan Gong, Yiran Zhang +2 more
Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to...
Benchmark LOW
Jinhua Wang, Biswa Sengupta
Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We...
1 months ago cs.SE cs.AI
PDF
Benchmark MEDIUM
Ricardo Bessa, Rui Claro, João Trindade +1 more
The application of Machine Learning techniques in code generation is now a common practice for most developers. Tools such as ChatGPT from OpenAI...
Benchmark LOW
Dzenan Hamzic, Florian Skopik, Max Landauer +2 more
Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented...
1 months ago cs.AI cs.CR
PDF
Benchmark MEDIUM
Xiaomeng Hu, Yinger Zhang, Fei Huang +7 more
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor...
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial