Search: LLM jailbreak | AI Threat Alert

Severity:

257 results in 126ms

Paper 2606.22841v1

2026-06-22

IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages

Indic languages, systematically curated to capture regional harms, sensitive socio-political contexts, and adversarial jailbreaks. Leveraging this corpus, we fine-tune a 4B-parameter instruction-tuned model based on Gemma

medium relevance benchmark

Paper 2603.18433v1

2026-03-19

Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems

often treat prompts as flat strings and rely on ad hoc filtering or static jailbreak detection. This paper proposes Prompt Control-Flow Integrity (PCFI), a priority-aware runtime defense that

high relevance tool

Paper 2604.21152v1

2026-04-22

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong

medium relevance attack

Paper 2605.20351v1

2026-05-19

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

validated (or not) against different inter-rater reliability standards. Existing surveys treat code security, jailbreak taxonomy, or vulnerability detection as the central object and mention these corpora only in passing

medium relevance survey

Paper 2510.19169v2

2025-10-22

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior

medium relevance tool

Paper 2604.07223v1

2026-04-08

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck

medium relevance tool

Paper 2603.21354v1

2026-03-22

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing

medium relevance attack

Paper 2604.07615v1

2026-04-08

ADAG: Automatically Describing Attribution Graphs

gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role

medium relevance benchmark

Paper 2605.21362v1

2026-05-20

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each

high relevance attack

Paper 2510.02609v2

2025-10-02

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools. In this work, we propose RedCodeAgent, the first automated red-teaming agent designed

high relevance benchmark

Paper 2606.17114v1

2026-06-15

An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios

leakage risks in agents has focused on adversarial data exfiltration through prompt injections and jailbreaks. However, sensitive information may also be exposed during non-adversarial use, creating leakage risks even

medium relevance benchmark

Paper 2603.24511v1

2026-03-25

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style

high relevance attack

Paper 2602.00388v1

2026-01-30

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Diffusion large language models (D-LLMs) offer an alternative to

medium relevance defense

Paper 2510.20129v1

2025-10-23

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering

medium relevance defense

Paper 2602.06630v1

2026-02-06

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average

high relevance attack

Paper 2511.17666v1

2025-11-21

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations

medium relevance attack

Paper 2604.23775v1

2026-04-26

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

mitigated. We first define the scope of VLA safety, distinguishing it from text-only LLM safety and classical robotic safety, and review the foundations of VLA models, including architectures, training

medium relevance benchmark

Paper 2601.03273v1

2025-12-22

GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

As large language models (LLMs) become deeply embedded in daily

medium relevance benchmark

Paper 2602.21236v1

2026-02-11

@GrokSet: multi-party Human-LLM Interactions in Social Media

million tweets involving the @Grok LLM on X. Our analysis reveals a distinct functional shift: rather than serving as a general assistant, the LLM is frequently invoked as an authoritative

medium relevance benchmark

Paper 2511.19171v2

2025-11-24

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer

medium relevance benchmark

Previous Page 10 of 13 Next