QUICK REVIEW

[论文解读] Race, Ethnicity and Their Implication on Bias in Large Language Models

Shiyue Hu, Ruizhe Li|arXiv (Cornell University)|Jan 19, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

该论文提出一个机械解释性框架来定位大语言模型中种族与族裔的编码方式，展示分布式、任务相关的表征，并证明抑制所识别的神经元能够降低但不能消除在毒性与临床任务中的偏见。

ABSTRACT

Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.

研究动机与目标

在像医疗保健和毒性生成等高风险环境中，激发理解人口统计属性如何影响LLM行为。
研究种族/族裔线索是作为高级特征、与任务相关的表征，还是在LLM内部的虚假捷径进行编码。
开发可重复的可解释性流程，结合探针分析、神经元级归因与定向干预。
描述跨模型（三种开源LLM）中的种族表征及其对跨任务偏见行为的影响。

提出的方法

提出一个机械解释性（MI）流程，结合多类探针分析、神经元级归因与定向干预。
使用线性探针定位最终层残差信道中的种族方向。
通过神经元输出与种族方向的余弦相似度来识别候选种族神经元。
通过激活分析和推理阶段抑制目标神经元激活来验证神经元的因果作用。
在两个任务上进行评估：ToxiGen毒性生成和C-REACT临床文本，使用三种模型：Qwen2.5-7B-IT、Mistral-7B-IT、Llama-3.1-8B。
用独立探针分析直接线索与间接线索，以区分显性术语与代理线索，分析直接与间接种族线索的差异。

Figure 1: With MLP, we locate neurons relevant to race information and inspect them via Logit Lens. For the higher activation score for target race, we adjust its value to steer model’s behavior.

实验结果

研究问题

RQ1不同模型中，内部LLM表征如何编码种族/族裔？
RQ2种族相关表征是否对毒性与临床文本任务的输出具有因果影响？
RQ3对识别出的神经元进行定向干预是否能减轻偏见，直接线索与间接线索的效果差异如何？

主要发现

种族/族裔信息在内部单元中分布，且模型在语义方面（如地理、语言、文化或历史背景）呈现不同的强调。
某些神经元编码显性的人口统计分类，其他神经元则通过相关属性来编码种族，揭示了多条编码路径。
激活分析表明编码种族的神经元在目标群体上更易激活，支持其在人口统计编码中的作用。
定向干预抑制识别出的神经元可以降低偏见，在中等放大下甚至完全消除某些观察到的偏差，但残留偏差仍然存在，原因在于行为层面的改变而非纯粹表征的变化。
直接对神经元的干预（影响显性种族标记）通常在降低偏见方面优于对间接干预（影响代理线索）在所测试模型中的表现。

Figure 2: Mean activation values of race encoding neurons when processing text from each racial group (ToxiGen). Diagonal cells represent neurons processing their target group. Higher values (red) indicate stronger activation; lower values (blue) indicate weak, negative activations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。