Steering Vectors are an Adversarial Attack Surface
Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
Fine-tuning APIs offered by major AI providers create new
CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing
repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce
CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against
ShallowJail: Steering Jailbreaks against Large Language Models
from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts
External Data Extraction Attacks against Retrieval-Augmented Large Language Models
extracted verbatim. These risks are particularly acute when RAG is used to customize specialized LLM applications with private knowledge bases. Despite initial studies exploring these risks, they often lack
Position: AI Security Policy Should Target Systems, Not Models
present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both
A Framework for Formalizing LLM Agent Security
whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility
MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval
Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic
Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning
ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious
ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs
models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon
Persona Jailbreaking in Large Language Models
Large Language Models (LLMs) are increasingly deployed in domains such
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when
Dual-Space Smoothness for Robust and Balanced LLM Unlearning
With the rapid advancement of large language models, Machine Unlearning
SecureBreak -- A dataset towards safe and secure models
reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies