Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs
Abstract
In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. Observing that simple weight amplification of COCO neurons yields only marginal gains, we propose two training-free, lightweight editing strategies: Local Enhancement (LE-COCO) and Networked Enhancement (NE-COCO). Comprehensive evaluations show that our methods bolster robustness against adversarial jailbreaks and achieve strong performance on open-ended safety benchmarks, while preserving foundational generative proficiency. While this study primarily addresses social stereotypes, the COCO mechanism holds significant potential for diverse domains like hallucination detection, offering valuable insights toward the development of self-evolving AI agents.
Pro Analysis
Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.
AI Threat Alert