Paper 2512.03356v2

From static to adaptive: immune memory-based jailbreak detection for large language models

these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard

high relevance attack
Paper 2511.19218v2

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real

high relevance attack
Paper 2603.11132v2

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains

medium relevance tool
Paper 2510.02422v3

Dynamic Target Attack

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely

high relevance attack
Paper 2603.14278v1

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components

high relevance attack
Paper 2601.15801v1

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white

high relevance attack
Paper 2511.06852v4

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama

medium relevance tool
Paper 2510.02194v1

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face

medium relevance defense
Paper 2602.05252v2

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than

medium relevance tool
Paper 2511.18790v2

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Large language models (LLMs) are becoming increasingly integrated into mainstream

medium relevance benchmark
Paper 2511.09880v1

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility

medium relevance defense
Paper 2601.22169v1

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both

medium relevance defense
Paper 2510.15476v2

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both

medium relevance survey
Paper 2603.20122v1

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely

high relevance attack
Paper 2511.06396v3

Efficient LLM Safety Evaluation through Multi-Agent Debate

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured

medium relevance benchmark
Paper 2512.21236v1

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns

medium relevance benchmark
Paper 2603.10068v1

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction

medium relevance defense
Paper 2510.15068v1

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic

high relevance attack
Paper 2512.13741v1

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical

high relevance attack
Paper 2512.09403v1

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time

medium relevance defense
Previous Page 7 of 10 Next