Semantic Containment as a Fundamental Property of Emergent Misalignment
Rohan Saxena
Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training...
2,529+ academic papers on AI security, attacks, and defenses
Showing 101–120 of 222 papers
Clear filtersRohan Saxena
Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training...
Zeming Wei, Zhixin Zhang, Chengcan Wu +3 more
Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also...
Ali Mahdavi, Santa Aghapour, Azadeh Zamanifar +1 more
Existing Byzantine robust aggregation mechanisms typically rely on fulldimensional gradi ent comparisons or pairwise distance computations, resulting...
Siqi Wen, Shu Yang, Shaopeng Fu +3 more
Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very...
Siqi Wen, Shu Yang, Shaopeng Fu +3 more
Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very...
Zeyuan He, Yupeng Chen, Lang Lin +7 more
Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation...
Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok +3 more
Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population...
Saeid Jamshidi, Omar Abdul Wahab, Foutse Khomh +1 more
Federated learning (FL) has become an effective paradigm for privacy-preserving, distributed Intrusion Detection Systems (IDS) in cyber-physical and...
Edward Y. Chang, Longling Geng
Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty. We present RAudit, a diagnostic...
Yanghao Su, Wenbo Zhou, Tianwei Zhang +4 more
Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned...
Charles Westphal, Keivan Navaie, Fernando E. Rosas
Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on...
Haoyun Yang, Ronghong Huang, Yong Fang +4 more
Transport Layer Security (TLS) is fundamental to secure online communication, yet vulnerabilities in certificate validation that enable...
Holly Trikilis, Pasindu Marasinghe, Fariza Rashid +1 more
Phishing continues to be one of the most prevalent attack vectors, making accurate classification of phishing URLs essential. Recently, large...
Binyan Xu, Fan Yang, Xilin Dai +2 more
Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal...
Henry Chen, Victor Aranda, Samarth Keshari +2 more
Prompt-based attack techniques are one of the primary challenges in securely deploying and protecting LLM-based AI systems. LLM inputs are an...
Jiahe Guo, Xiangran Guo, Yulin Hu +8 more
Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized...
Xianya Fang, Xianying Luo, Yadong Wang +8 more
Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models...
Saswat Das, Ferdinando Fioretto
This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the...
Renmiao Chen, Yida Lu, Shiyao Cui +6 more
As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may...
William Pan, Guiran Liu, Binrong Zhu +4 more
The rapid expansion of IoT deployments has intensified cybersecurity threats, notably Distributed Denial of Service (DDoS) attacks, characterized by...
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial