QUICK REVIEW

[论文解读] Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Po-Sen Huang, Huan Zhang|arXiv (Cornell University)|Nov 8, 2019

Topic Modeling参考文献 55被引用 23

一句话总结

本文提出一种框架，通过反事实评估和潜在表征的正则化来减少大语言模型中的情感偏见。通过应用嵌入正则化和基于情感预测的正则化，该方法显著降低了个体公平性得分（情感偏见的衡量指标），同时保持了较低的困惑度和较高的语义相似度，通过自动评估和人工评估验证了其有效性。

ABSTRACT

Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the language model's latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.

研究动机与目标

通过在职业、国家和姓名等敏感属性上进行反事实评估，量化语言模型中的情感偏见。
基于Wasserstein距离，提出新颖的公平性度量——个体公平性和群体公平性，用于衡量生成文本中的情感偏见。
提出一种可泛化的框架，用于在指定公平性约束下减少文本生成中的情感偏见。
评估潜在表征正则化技术在提升公平性的同时保持语义质量和困惑度的效果。
验证自动度量与人工标注的情感、语义相似度和公平性判断的相关性。

提出的方法

作者通过系统性地改变条件上下文中的敏感属性（如职业、国家）并测量生成文本中情感得分的变化，进行反事实评估。
通过不同属性取值下情感分布之间的Wasserstein距离定义个体公平性，捕捉情感输出中的偏见。
群体公平性被定义为所有属性取值下个体公平性的平均值，提供全局偏见度量。
引入两种正则化技术：(1) 嵌入正则化，用于约束潜在表征；(2) 基于BERT的情感分类器导出的情感预测正则化。
将正则化项添加到语言模型的训练目标中，并使用超参数λ平衡公平性与生成质量。
在两个数据集（WMT-19 和 WikiText-103）上评估该框架，使用自动度量（困惑度、语义相似度）和人工评估。

实验结果

研究问题

RQ1当提示中的敏感属性（如职业或国家）发生变化时，大规模语言模型是否表现出系统性的情感偏见？
RQ2基于Wasserstein距离的个体和群体公平性度量能否有效量化生成文本中的情感偏见？
RQ3对潜在表征的正则化是否能在不降低困惑度或语义相似度的前提下减少情感偏见？
RQ4自动公平性度量与人工标注的情感和相关性判断之间是否存在相关性？
RQ5在偏见缓解中，公平性、困惑度和语义相似度之间存在何种权衡？

主要发现

基线GPT-2模型表现出显著的情感偏见：在相同提示上下文中，对“baker”生成更积极的情感，对“accountant”生成更消极的情感。
所提出的基于情感的正则化方法在“designer”与“accountant”提示对上，将个体公平性得分从0.333降低至0.056，降幅达83%。
在“Libya”与“Iceland”提示对上，个体公平性得分从基线的0.291降至情感正则化模型的0.155，证实了偏见的减少。
情感正则化方法在降低公平性得分方面比嵌入正则化更有效，平均使公平性得分降低70%。
情感和语义相似度的自动度量与人工标注结果高度相关（情感的Spearman相关系数ρ = 0.75–0.79，相似度ρ = 0.63–0.72）。
两种正则化方法均保持了与基线相近的困惑度（PPL ≈ 17.6–18.5）和语义相似度，表明生成质量未明显下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。