DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation
Bo Jiang
Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and...
AI Threat Alert indexes 3,023+ peer-reviewed and preprint papers on AI/ML security — covering adversarial attacks, model defenses, red-teaming benchmarks, surveys, and security tooling. Papers are sourced from arXiv, classified by type and by relevance to real-world threats, and cross-referenced with the CVEs and incidents they relate to.
Showing 101–120 of 264 papers
Clear filtersBo Jiang
Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and...
Sumit Ranjan, Sugandha Sharma, Ubaid Abbas +1 more
Voice interfaces are quickly becoming a common way for people to interact with AI systems. This also brings new security risks, such as prompt...
Xisen Jin, Michael Duan, Qin Lin +4 more
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces...
Jinman Wu, Yi Xie, Shen Lin +2 more
Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the...
Ved Sriraman, Adam Block
Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a...
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul +1 more
The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded...
Zeyu Zhang, Xiangxiang Dai, Ziyi Han +2 more
Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during...
Manisha Mukherjee, Vincent J. Hellendoorn
Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in...
Ming Wen, Kun Yang, Xin Chen +4 more
Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as...
Chang Xue, Fang Liu, Jiaye Wang +2 more
Decentralized financial platforms rely heavily on Web of Trust reputation systems to mitigate counterparty risk in the absence of centralized...
Lan Zhang, Chengsi Liang, Zeming Zhuang +4 more
Semantic communication (SemCom) redefines wireless communication from reproducing symbols to transmitting task-relevant semantics. However, this...
Xuan Chen, Hao Liu, Tao Yuan +3 more
Traditional phishing website detection relies on static heuristics or reference lists, which lag behind rapidly evolving attacks. While recent...
Mengxuan Hu, Vivek V. Datla, Anoop Kumar +4 more
Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct...
Morteza Eskandarian, Mahdi Rabbani, Arun Kaniyamattam +6 more
The current generation of large language models produces sophisticated social-engineering content that bypasses standard text screening systems in...
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav +4 more
Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when...
Zachary Coalson, Beth Sohler, Aiden Gabriel +1 more
We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches...
Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami +1 more
Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight...
Ahmed Ryan, Ibrahim Khalil, Abdullah Al Jahid +4 more
The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large...
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit +4 more
LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging...
Tianyu Chen, Dongrui Liu, Xia Hu +2 more
Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises...
AI security research studies how AI and machine-learning systems can be attacked and defended — covering adversarial examples, prompt injection, model poisoning, training-data extraction, and the mitigations against them. AI Threat Alert curates this research from academic sources so security teams can track the threats behind emerging AI risks.
AI Threat Alert indexes 3,023+ papers on AI/ML security, classified across attack, defense, benchmark, survey, and tool categories and updated continuously.
Papers are sourced from arXiv, then classified by type and by relevance to real-world AI/ML threats, and cross-referenced with the CVEs and incidents they relate to.
Coverage spans adversarial attacks, model and system defenses, red-teaming benchmarks, literature surveys, and security tooling for LLMs, ML libraries, AI agents, and inference pipelines.
Every paper is filtered for AI security relevance and linked to the vulnerabilities, vendors, and incidents it relates to, so the research connects directly to operational threat intelligence.
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial