Attack MEDIUM relevance

Enhancing Moral Diagnosis and Correction in Large Language Models

Bocheng Chen Xi Chen Han Zi Haitao Mao Zimo Qi Xitong Zhang Kristen Johnson Guangliang Liu
Published
January 6, 2026
Updated
March 17, 2026

Abstract

Identifying specific moral errors in an input and generating appropriate corrections require moral sensitivity in large language models (LLMs), which is fundamental for developing their moral performance, yet a challenging task. This study leverages a pragmatic inference-based approach to enhance both the moral diagnosis and corrections of models. Crucially, our method generalizes across a diverse set of different tasks, including moral reasoning, toxic language detection, social bias detection, and jailbreaks, despite substantial differences in their semantic formulations. To enable such generalization, the study also introduces a unifying variable, pragmatic inference load, which captures the degree of pragmatic reasoning required across tasks. Experimental results show that our approach enables LLMs to produce high-quality diagnostic outputs of moral errors, make effective corrections, and consistently outperform a range of baseline methods. Further analyses reveal that these improvements do not arise from heuristic-based response patterns, but from learned inferential processes, highlighting the effectiveness of our approach.

Pro Analysis

Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.

Threat Deep-Dive
ATLAS Mapping
Compliance Reports
Actionable Recommendations
Start 14-Day Free Trial