LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context
even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful
Auto-Tuning Safety Guardrails for Black-Box Large Language Models
three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate
Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services
through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level
Automated jailbreak attack targeting multiple defense strategies
Large language models (LLMs) have demonstrated remarkable capabilities across a
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings
Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic
Black-box Optimization of LLM Outputs by Asking for Directions
We present a novel approach for attacking black-box large
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent
Metaphor-based Jailbreaking Attacks on Text-to-Image Models
models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models
SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt
OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents
Large language model (LLM) agents increasingly act on a user
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Large language models (LLMs) are increasingly deployed as tool-using
Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs
Adversarial attacks by malicious users that threaten the safety of
A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation
Large Language Models (LLMs) can generate human-like disinformation, yet
Fewer Weights, More Problems: A Practical Attack on LLM Pruning
pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that
The Rogue Scalpel: Activation Steering Compromises LLM Safety
Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed
Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization
Current LLM unlearning methods face a critical security vulnerability that