Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems
often treat prompts as flat strings and rely on ad hoc filtering or static jailbreak detection. This paper proposes Prompt Control-Flow Integrity (PCFI), a priority-aware runtime defense that
OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents
they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools. In this work, we propose RedCodeAgent, the first automated red-teaming agent designed
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode
Diffusion large language models (D-LLMs) offer an alternative to
SAID: Empowering Large Language Models with Self-Activating Internal Defense
Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering
TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking
inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average
Evaluating Adversarial Vulnerabilities in Modern Large Language Models
determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations
GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators
As large language models (LLMs) become deeply embedded in daily
@GrokSet: multi-party Human-LLM Interactions in Social Media
million tweets involving the @Grok LLM on X. Our analysis reveals a distinct functional shift: rather than serving as a general assistant, the LLM is frequently invoked as an authoritative
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model
CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing
repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce
CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against
ShallowJail: Steering Jailbreaks against Large Language Models
from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts
External Data Extraction Attacks against Retrieval-Augmented Large Language Models
extracted verbatim. These risks are particularly acute when RAG is used to customize specialized LLM applications with private knowledge bases. Despite initial studies exploring these risks, they often lack
A Framework for Formalizing LLM Agent Security
whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility