Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
Ziyang Liu
Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such...
AI Threat Alert indexes 3,023+ peer-reviewed and preprint papers on AI/ML security — covering adversarial attacks, model defenses, red-teaming benchmarks, surveys, and security tooling. Papers are sourced from arXiv, classified by type and by relevance to real-world threats, and cross-referenced with the CVEs and incidents they relate to.
Showing 61–80 of 264 papers
Clear filtersZiyang Liu
Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such...
Dongcheng Zhang, Yiqing Jiang
Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a...
Ting Zhang, Yikun Li, Chengran Yang +15 more
Software vulnerabilities remain one of the most persistent threats to modern digital infrastructure. While static application security testing (SAST)...
Hailin Liu, Eugene Ilyushin, Jie Ni +1 more
Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and...
Xiaohua Wang, Muzhao Tian, Yuqi Zeng +20 more
Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and...
Sujan Ghimire, Parsa Mirfasihi, Muhtasim Alam Chowdhury +6 more
The globalization of integrated circuit (IC) design and manufacturing has increased the exposure of hardware intellectual property (IP) to untrusted...
Willy Carlos Tchuitcheu, Tan Lu, Ann Dooms
Historical approaches to Table Representation Learning (TRL) have largely adopted the sequential paradigms of Natural Language Processing (NLP). We...
Adam Stein, Davis Brown, Hamed Hassani +2 more
To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare,...
Junxiao Yang, Haoran Liu, Jinzhe Tu +9 more
Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried...
Xuwei Ding, Skylar Zhai, Linxin Song +6 more
Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to...
Weiwei Qi, Zefeng Wu, Tianhang Zheng +4 more
Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of...
Rui Zhang, Hongwei Li, Yun Shen +6 more
The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve...
Nikolaos D. Tantaroudas, Ilias Karachalios, Andrew J. McCracken
The field of cybersecurity is confronted with two interrelated challenges: a worldwide deficit of qualified practitioners and ongoing human-factor...
Peigui Qi, Kunsheng Tang, Yanpu Yu +7 more
Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual...
Igor Maljkovic, Maria Rosaria Briglia, Iacopo Masi +2 more
Vision-Language Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual...
Md Shamimul Islam, Luis G. Jaimes, Ayesha S. Dina
Network Intrusion Detection Systems (NIDS) face important limitations. Signature-based methods are effective for known attack patterns, but they...
Purva Chiniya, Kevin Scaria, Sagar Chaturvedi
Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently...
Zijun Wang, Haoqin Tu, Letian Zhang +11 more
OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services...
Shams Tarek, Dipayan Saha, Khan Thamid Hasan +3 more
The increasing complexity of modern system-on-chip designs amplifies hardware security risks and makes manual security property specification a major...
Bowen Wei, Yunbei Zhang, Jinhao Pan +5 more
Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak...
AI security research studies how AI and machine-learning systems can be attacked and defended — covering adversarial examples, prompt injection, model poisoning, training-data extraction, and the mitigations against them. AI Threat Alert curates this research from academic sources so security teams can track the threats behind emerging AI risks.
AI Threat Alert indexes 3,023+ papers on AI/ML security, classified across attack, defense, benchmark, survey, and tool categories and updated continuously.
Papers are sourced from arXiv, then classified by type and by relevance to real-world AI/ML threats, and cross-referenced with the CVEs and incidents they relate to.
Coverage spans adversarial attacks, model and system defenses, red-teaming benchmarks, literature surveys, and security tooling for LLMs, ML libraries, AI agents, and inference pipelines.
Every paper is filtered for AI security relevance and linked to the vulnerabilities, vendors, and incidents it relates to, so the research connects directly to operational threat intelligence.
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act), and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial