Provable Watermarking for Data Poisoning Attacks
Yifan Zhu, Lijia Yu, Xiao-Shan Gao
In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying...
2,077+ academic papers on AI security, attacks, and defenses
Showing 701–720 of 798 papers
Clear filtersYifan Zhu, Lijia Yu, Xiao-Shan Gao
In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying...
Milad Nasr, Nicholas Carlini, Chawin Sitawarin +11 more
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an...
Brandon Lit, Edward Crowder, Daniel Vogel +1 more
AI chatbots are an emerging security attack vector, vulnerable to threats such as prompt injection, and rogue chatbot creation. When deployed in...
Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai +1 more
Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints...
Abhishek K. Mishra, Antoine Boutet, Lucas Magnana
Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic...
Aofan Liu, Lulu Tang
Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However,...
Muxi Diao, Yutao Mou, Keqing He +6 more
The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on...
Jiyang Qiu, Xinbei Ma, Yunqing Xu +2 more
The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness....
Stanisław Pawlak, Jan Dubiński, Daniel Marczak +1 more
Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks....
Kazuki Egashira, Robin Staab, Thibaud Gloaguen +2 more
Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models...
Weisen Jiang, Sinno Jialin Pan
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We...
Renhua Ding, Xiao Yang, Zhengwei Fang +3 more
Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities in their perception...
Christos Ziakas, Nicholas Loo, Nishita Jain +1 more
Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack...
Artur Horal, Daniel Pina, Henrique Paz +7 more
This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework,...
Tavish McDonald, Bo Lei, Stanislav Fort +2 more
Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba...
Tiancheng Xing, Jerry Li, Yixuan Du +1 more
Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small,...
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs....
Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak +4 more
Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a...
Nouar Aldahoul, Yasir Zaki
The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has...
Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė +3 more
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically...
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial