Paper 2606.05958v1

Steering Vectors are an Adversarial Attack Surface

Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors

high relevance attack
Paper 2602.01600v1

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform

medium relevance benchmark
Paper 2510.02833v4

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular

high relevance attack
Paper 2510.25941v3

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model

medium relevance benchmark
Paper 2603.29038v1

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Fine-tuning APIs offered by major AI providers create new

high relevance attack
Paper 2512.08967v1

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce

medium relevance attack
Paper 2602.20170v1

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that

high relevance benchmark
Paper 2510.04885v1

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against

high relevance attack
Paper 2602.07107v2

ShallowJail: Steering Jailbreaks against Large Language Models

from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts

high relevance attack
Paper 2510.02964v1

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

extracted verbatim. These risks are particularly acute when RAG is used to customize specialized LLM applications with private knowledge bases. Despite initial studies exploring these risks, they often lack

high relevance attack
Paper 2605.09504v1

Position: AI Security Policy Should Target Systems, Not Models

present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both

medium relevance tool
Paper 2603.19469v1

A Framework for Formalizing LLM Agent Security

whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility

medium relevance tool
Paper 2512.16962v1

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic

medium relevance benchmark
Paper 2601.05466v1

Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious

high relevance tool
Paper 2602.13274v1

ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs

models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance

medium relevance defense
Paper 2603.06594v2

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon

medium relevance attack
Paper 2601.16466v1

Persona Jailbreaking in Large Language Models

Large Language Models (LLMs) are increasingly deployed in domains such

high relevance attack
Paper 2602.19396v1

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when

high relevance attack
Paper 2509.23362v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

With the rapid advancement of large language models, Machine Unlearning

medium relevance attack
Paper 2603.21975v1

SecureBreak -- A dataset towards safe and secure models

reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies

medium relevance benchmark
Previous Page 11 of 13 Next