QUICK REVIEW

[论文解读] The Capacity for Moral Self-Correction in Large Language Models

Deep Ganguli, Amanda Askell|arXiv (Cornell University)|Feb 15, 2023

Topic Modeling被引用 48

一句话总结

本文表明，通过自然语言指令对使用 RLHF 训练的大型语言模型进行引导，可以避免有害输出；在参数规模超过 22B 且有足够的 RLHF 微调时效果更强。

ABSTRACT

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

研究动机与目标

激发研究：当被指示避免伤害时，RLHF 训练的大型语言模型是否能够在道德上自我纠正。
研究模型规模和 RLHF 训练量如何影响对刻板印象偏见和歧视性输出的易感性。
评估自然语言提示是否能够在多个基准上引导模型实现更公平的行为。

提出的方法

研究在 810M 至 175B 参数范围内，使用 RLHF 微调的解码器-仅 Transformer 模型。
使用 BBQ（偏见）、Winogender（性别代词偏见）以及基于法学院招生的歧视基准进行三项实验的评估。
应用三种提示干预：Q（基线问题）、Q+IF（遵循指令）、Q+IF+CoT（思路链变体）。
使用 RLHF 训练步骤（50 到 1000）来分析训练量的影响。
分析模型规模和 RLHF 步骤如何影响偏见、与真实世界统计数据的相关性以及人口统计平等。

Figure 1: Metrics for stereotype bias or discrimination (y-axes) vary with model size (x-axis) and experimental conditions (colors) for three experiments (panels, details in § 3 ). (Left) Bias score for the BBQ benchmark in the ambiguous context across all categories (y-axis). As models become large

实验结果

研究问题

RQ1当被指示如此时，经过 RLHF 训练的大型语言模型能否避免产生有害输出？
RQ2模型规模和 RLHF 训练量如何影响减少刻板印象偏见与歧视的能力？
RQ3自然语言指示和思路链提示是否在不同的公平基准上实现道德自我纠正？
RQ4在偏见相关任务中，模型输出与现实世界人口统计数据之间存在哪些关系？

主要发现

道德自我纠正能力在约 22B 参数时出现，且随模型变大和更多 RLHF 训练而提升。
遵循指令（Q+IF）和思路链提示（Q+IF+CoT）在 BBQ 中显著降低偏见，尤其是在较大模型规模和更多 RLHF 步骤时。
RLHF 训练通常在各基准上减少偏见，在 BBQ 实验中对 Q+IF 条件的削减最强。
在 Winogender 中，较大模型配合提示会将代词选择引导为中性或与统计数据一致的行为，取决于提示。
在歧视基准中，当指示不以种族为决定依据时，在某些模型规模和 RLHF 步骤的组合下可以实现人口统计平等；否则，平等性不被保证。
在所有实验中，较大模型配合 RLHF 步骤可能表现出偏见减少或增强，取决于情境和提示。

Figure 2: Influence of RLHF training (x-axes) for metrics for metrics for stereotype bias or discrimination (y-axes) for the 175B parameter model. (Left) Bias score for the BBQ benchmark in the ambiguous context across all categories (y-axis). Increasing the amount of RLHF steps decreases bias acros

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。