From static to adaptive: immune memory-based jailbreak detection for large language models
these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains
Dynamic Target Attack
Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely
Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt
Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face
Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks
first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
Large language models (LLMs) are becoming increasingly integrated into mainstream
EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility
In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement
based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both
Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models
increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely
Efficient LLM Safety Evaluation through Multi-Agent Debate
Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking
malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns
ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction
Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling
Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic
The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models
Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical
Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs
harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time