197 results in 98ms
Paper 2512.16962v1

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic

medium relevance benchmark
Paper 2601.05466v1

Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious

high relevance tool
Paper 2602.13274v1

ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs

models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance

medium relevance defense
Paper 2603.06594v2

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon

medium relevance attack
Paper 2601.16466v1

Persona Jailbreaking in Large Language Models

Large Language Models (LLMs) are increasingly deployed in domains such

high relevance attack
Paper 2602.19396v1

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when

high relevance attack
Paper 2509.23362v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

With the rapid advancement of large language models, Machine Unlearning

medium relevance attack
Paper 2603.21975v1

SecureBreak -- A dataset towards safe and secure models

reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies

medium relevance benchmark
Paper 2510.07835v1

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful

high relevance attack
Paper 2512.15782v1

Auto-Tuning Safety Guardrails for Black-Box Large Language Models

three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate

medium relevance defense
Paper 2603.10807v1

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy

high relevance survey
Paper 2602.06854v1

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level

high relevance attack
Paper 2602.16520v1

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic

high relevance tool
Paper 2510.16794v1

Black-box Optimization of LLM Outputs by Asking for Directions

We present a novel approach for attacking black-box large

medium relevance attack
Paper 2602.16346v2

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent

medium relevance benchmark
Paper 2512.10766v1

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models

high relevance attack
Paper 2603.16734v1

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Large language models (LLMs) are increasingly deployed as tool-using

medium relevance benchmark
Paper 2510.17000v1

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Adversarial attacks by malicious users that threaten the safety of

high relevance attack
Paper 2510.12993v2

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

Large Language Models (LLMs) can generate human-like disinformation, yet

medium relevance benchmark
Paper 2510.07985v2

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that

high relevance attack
Previous Page 9 of 10 Next