256 results in 49ms
Paper 2602.03402v2

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks

high relevance attack
Paper 2601.02002v1

Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

LLM memorization be detected through methods beyond manual prompting? And can the detection of data leakage be automated? To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering

medium relevance benchmark
Paper 2510.11851v2

Deep Research Brings Deeper Harm

LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods

medium relevance benchmark
Paper 2605.27823v1

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security

medium relevance attack
Paper 2510.08859v2

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics

high relevance attack
Paper 2512.13703v1

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon

high relevance attack
Paper 2606.11817v1

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak

high relevance attack
Paper 2601.20903v1

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

step LLM interaction, and often stagnate in suboptimal regions due to surface-level optimization. In this paper, we characterize the Intent-Context Coupling phenomenon, revealing that LLM safety constraints

high relevance attack
Paper 2601.03594v1

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such

high relevance benchmark
Paper 2512.14751v1

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM

high relevance attack
Paper 2604.20994v1

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation

high relevance attack
Paper 2602.08062v1

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer

medium relevance defense
Paper 2602.02280v1

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria

medium relevance defense
Paper 2512.03356v2

From static to adaptive: immune memory-based jailbreak detection for large language models

these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard

high relevance attack
Paper 2511.19218v2

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real

high relevance attack
Paper 2605.05662v1

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally

medium relevance benchmark
Paper 2606.12709v1

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection

medium relevance benchmark
Paper 2605.10611v1

Re-Triggering Safeguards within LLMs for Jailbreak Detection

defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts

high relevance attack
Paper 2603.11132v2

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains

medium relevance tool
Paper 2605.27701v1

Cross-Entropy Games and Frost Training

method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient

medium relevance attack
Previous Page 8 of 13 Next