Don't Walk the Line: Boundary Guidance for Filtered Generation
margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive
Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate
Extracting books from production language models
recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini
Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test
Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs
vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
Large language models (LLMs) face critical safety challenges, as they
AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning
LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing
Efficient Refusal Ablation in LLM through Optimal Transport
Safety-aligned language models refuse harmful requests through learned refusal
From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses
Large Language Models (LLMs) have shown remarkable performance across various
Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks
Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models
LLM memorization be detected through methods beyond manual prompting? And can the detection of data leakage be automated? To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering
Deep Research Brings Deeper Harm
LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics
Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon
ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
step LLM interaction, and often stagnate in suboptimal regions due to surface-level optimization. In this paper, we characterize the Intent-Context Coupling phenomenon, revealing that LLM safety constraints
Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense
paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such
One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs
jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM
Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation
susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer
RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria