QUICK REVIEW

[论文解读] Bias and Fairness in Large Language Models: A Survey

Isabel O. Gallegos, Ryan A. Rossi|arXiv (Cornell University)|Sep 2, 2023

Text Readability and Simplification被引用 58

一句话总结

本综述整合了对社会偏见和公平性在 LLMs 中的定义，提出偏见评估指标与数据集的分类，以及将偏见缓解技术按预处理、在训练、处理内和后处理进行分类。

ABSTRACT

Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.

研究动机与目标

对自然语言处理和 LLMs 的社会偏见与公平概念进行整合与形式化。
开发按数据结构和模型访问对偏见评估指标进行分类的分类法。
汇编并对公开可用的 LLM 偏见评估数据集进行整理。
按干预阶段对偏见缓解技术进行分类并提供统一的方法记号。
识别未解决的问题和挑战，以指导未来在公平 LLMs 方面的研究。

提出的方法

将 LLM 概念与针对 NLP 与 LLMs 的公平性愿景形式化。
提出三类分类法：(i) 偏见评估指标（嵌入、概率、生成文本），(ii) 偏见评估数据集（对抗性输入、提示），(iii) 偏见缓解技术（预处理、在训练中、处理内、后处理）。
提供统一的数学记号以比较指标并形式化技术。
汇编并发布用于偏见评估的公开数据集。
讨论存在的未解决问题和未来方向，以减少 LLMs 的偏见。

实验结果

研究问题

RQ1与 LLMs 和 NLP 任务相关的社会偏见与公平性的精确面向有哪些？
RQ2如何按数据结构和模型访问来组织偏见评估指标以实现一致的评估？
RQ3存在哪些用于偏见评估的数据集，如何标准化或整合它们？
RQ4哪一类分类法最好描述跨干预阶段的偏见缓解技术？
RQ5在实现 LLMs 公平性方面，哪些关键的未解决挑战和未来方向？

主要发现

本文提供了社会偏见、群体与个体公平，以及适用于 NLP 与 LLMs 的伤害分类法（表现性与分配性）的形式化定义。
它提供了一个覆盖嵌入、概率和生成文本的统一偏见评估指标分类法，阐明指标与评估数据集之间的联系。
它按结构（对抗性输入、提示）汇总了偏见评估的数据集，并记录了目标伤害和社会群体，提供公开检索库。
它提出了按干预阶段（预处理、在训练中、处理内、后处理）组织的缓解技术分类法，具有细粒度子类别和形式化。
综述强调未解决的问题，包括公平性概念的鲁棒性、评估标准，以及在 NLP 生命周期内扩展缓解努力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。