Search: LLM jailbreak | AI Threat Intelligence

Severity:

197 results in 115ms

Paper 2512.03356v2

2025-12-03

From static to adaptive: immune memory-based jailbreak detection for large language models

these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard

high relevance attack

Paper 2511.19218v2

2025-11-24

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real

high relevance attack

Paper 2603.11132v2

2026-03-11

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains

medium relevance tool

Paper 2510.02422v3

2025-10-02

Dynamic Target Attack

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely

high relevance attack

Paper 2603.14278v1

2026-03-15

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components

high relevance attack

Paper 2601.15801v1

2026-01-22

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white

high relevance attack

Paper 2511.06852v4

2025-11-10

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama

medium relevance tool

Paper 2510.02194v1

2025-10-02

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face

medium relevance defense

Paper 2602.05252v2

2026-02-05

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than

medium relevance tool

Paper 2511.18790v2

2025-11-24

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Large language models (LLMs) are becoming increasingly integrated into mainstream

medium relevance benchmark

Paper 2511.09880v1

2025-11-13

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility

medium relevance defense

Paper 2601.22169v1

2026-01-19

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both

medium relevance defense

Paper 2510.15476v2

2025-10-17

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both

medium relevance survey

Paper 2603.20122v1

2026-03-20

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely

high relevance attack

Paper 2511.06396v3

2025-11-09

Efficient LLM Safety Evaluation through Multi-Agent Debate

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured

medium relevance benchmark

Paper 2512.21236v1

2025-12-24

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns

medium relevance benchmark

Paper 2603.10068v1

2026-03-10

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction

medium relevance defense

Paper 2510.15068v1

2025-10-16

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic

high relevance attack

Paper 2512.13741v1

2025-12-14

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical

high relevance attack

Paper 2512.09403v1

2025-12-10

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time

medium relevance defense

Previous Page 7 of 10 Next