The Rogue Scalpel: Activation Steering Compromises LLM Safety
Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed
Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization
Current LLM unlearning methods face a critical security vulnerability that
SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration
have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering
DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs
While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates
TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
As increasingly capable open-weight large language models (LLMs) are
Trust The Typical
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety
Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution
Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety
Large language models (LLMs) are increasingly used for mental health
What Matters For Safety Alignment?
This paper presents a comprehensive empirical study on the safety
Adversarial Contrastive Learning for LLM Quantization Attacks
Model quantization is critical for deploying large language models (LLMs
Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation, yet their robustness against sophisticated adversarial attacks remains poorly characterized. This study
Diffusion LLMs are Natural Adversaries for any LLM
We introduce a novel framework that transforms the resource-intensive
Reasoning Up the Instruction Ladder for Controllable Language Models
As large language model (LLM) based systems take on high
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates
Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models
provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost
Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense
Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that