LLMs Can Unlearn Refusal with Only 1,000 Benign Samples
Yangyang Guo, Ziwei Xu, Si Liu +2 more
This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly...
2,529+ academic papers on AI security, attacks, and defenses
Showing 141–160 of 312 papers
Clear filtersYangyang Guo, Ziwei Xu, Si Liu +2 more
This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly...
Sen Nie, Jie Zhang, Zhuo Wang +2 more
Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial...
Jiankai Jin, Xiangzheng Zhang, Zhao Liu +2 more
Machine learning systems can produce personalized outputs that allow an adversary to infer sensitive input attributes at inference time. We introduce...
Andy Zhu, Rongzhe Wei, Yupu Gu +1 more
Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts...
Song Xia, Meiwen Ding, Chenqi Kong +2 more
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations...
Víctor Mayoral-Vilches, Stefan Rass, Martin Pinzger +14 more
Cybersecurity superintelligence -- artificial intelligence exceeding the best human capability in both speed and strategic reasoning -- represents...
Haodong Chen, Ziheng Zhang, Jinghui Jiang +2 more
Cloud environments face frequent DDoS threats due to centralized resources and broad attack surfaces. Modern cloud-native DDoS attacks further evolve...
Andrew Crossman, Jonah Dodd, Viralam Ramamurthy Chaithanya Kumar +5 more
MITRE ATT&CK is a cybersecurity knowledge base that organizes threat actor and cyber-attack information into a set of tactics describing the reasons...
Xiaochen Zhu, Mayuri Sridhar, Srinivas Devadas
Modern machine learning models are increasingly deployed behind APIs. This renders standard weight-privatization methods (e.g. DP-SGD) unnecessarily...
Yilin Tang, Yu Wang, Lanlan Qiu +4 more
Large language models (LLMs) have shown strong capabilities in multi-step decision-making, planning and actions, and are increasingly integrated into...
Rishit Chugh
The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating...
Advije Rizvani, Giovanni Apruzzese, Pavel Laskov
Large Language Models (LLMs) are increasingly adopted in the financial domain. Their exceptional capabilities to analyse textual data make them...
Murat Bilgehan Ertan, Emirhan Böge, Min Chen +2 more
As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether...
János Kramár, Joshua Engels, Zheng Wang +4 more
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful...
Marco Arazzi, Antonino Nocera
Backdoored and privacy-leaking deep neural networks pose a serious threat to the deployment of machine learning systems in security-critical...
Christina Lu, Jack Gallagher, Jonathan Michala +2 more
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We...
Luoming Hu, Jingjie Zeng, Liang Yang +1 more
Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as...
Feng Zhang, Shijia Li, Chunmao Zhang +7 more
User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and...
Renyang Liu, Kangjie Chen, Han Qiu +4 more
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from...
Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine +3 more
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and...
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial