Paper 2606.24585v1

LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context

even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several

medium relevance attack
Paper 2510.07835v1

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful

high relevance attack
Paper 2512.15782v1

Auto-Tuning Safety Guardrails for Black-Box Large Language Models

three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate

medium relevance defense
Paper 2603.10807v1

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy

high relevance survey
Paper 2602.06854v1

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level

high relevance attack
Paper 2606.16751v1

Automated jailbreak attack targeting multiple defense strategies

Large language models (LLMs) have demonstrated remarkable capabilities across a

high relevance attack
Paper 2606.01322v1

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings

high relevance benchmark
Paper 2602.16520v1

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic

high relevance tool
Paper 2510.16794v1

Black-box Optimization of LLM Outputs by Asking for Directions

We present a novel approach for attacking black-box large

medium relevance attack
Paper 2606.06054v1

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents

medium relevance attack
Paper 2602.16346v2

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent

medium relevance benchmark
Paper 2512.10766v1

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models

high relevance attack
Paper 2606.19755v1

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt

medium relevance benchmark
Paper 2606.12341v1

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

Large language model (LLM) agents increasingly act on a user

medium relevance benchmark
Paper 2603.16734v1

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Large language models (LLMs) are increasingly deployed as tool-using

medium relevance benchmark
Paper 2510.17000v1

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Adversarial attacks by malicious users that threaten the safety of

high relevance attack
Paper 2510.12993v2

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

Large Language Models (LLMs) can generate human-like disinformation, yet

medium relevance benchmark
Paper 2510.07985v2

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that

high relevance attack
Paper 2509.22067v2

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed

medium relevance defense
Paper 2509.20230v3

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Current LLM unlearning methods face a critical security vulnerability that

medium relevance benchmark
Previous Page 12 of 13 Next