197 results in 70ms
Paper 2510.11834v2

Don't Walk the Line: Boundary Guidance for Filtered Generation

margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive

medium relevance benchmark
Paper 2602.09629v1

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates

high relevance tool
Paper 2511.22044v1

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate

high relevance attack
Paper 2601.02671v1

Extracting books from production language models

recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini

medium relevance attack
Paper 2510.06790v2

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test

medium relevance attack
Paper 2602.03155v1

Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs

vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design

medium relevance attack
Paper 2512.06655v2

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Large language models (LLMs) face critical safety challenges, as they

medium relevance defense
Paper 2510.01586v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing

medium relevance attack
Paper 2603.04355v1

Efficient Refusal Ablation in LLM through Optimal Transport

Safety-aligned language models refuse harmful requests through learned refusal

medium relevance attack
Paper 2510.07968v2

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Large Language Models (LLMs) have shown remarkable performance across various

medium relevance defense
Paper 2602.03402v2

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks

high relevance attack
Paper 2601.02002v1

Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

LLM memorization be detected through methods beyond manual prompting? And can the detection of data leakage be automated? To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering

medium relevance benchmark
Paper 2510.11851v2

Deep Research Brings Deeper Harm

LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods

medium relevance benchmark
Paper 2510.08859v2

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics

high relevance attack
Paper 2512.13703v1

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon

high relevance attack
Paper 2601.20903v1

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

step LLM interaction, and often stagnate in suboptimal regions due to surface-level optimization. In this paper, we characterize the Intent-Context Coupling phenomenon, revealing that LLM safety constraints

high relevance attack
Paper 2601.03594v1

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such

high relevance benchmark
Paper 2512.14751v1

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM

high relevance attack
Paper 2602.08062v1

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer

medium relevance defense
Paper 2602.02280v1

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria

medium relevance defense
Previous Page 6 of 10 Next