QUICK REVIEW

[论文解读] Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu|arXiv (Cornell University)|Mar 4, 2022

Topic Modeling被引用 4,260

一句话总结

这篇论文表明通过对 GPT-3 进行人类示例和偏好（RLHF）的微调可得到 InstructGPT，该模型在显著更少参数的情况下优于 GPT-3 基线，并在广泛任务集上提高真实性并降低有害性。

ABSTRACT

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

研究动机与目标

证明通过人类反馈对大语言模型进行微调可以使它们在多样化任务中对用户指令保持对齐。
表明较小、经过指令微调的模型在指令跟随提示上可以超越更大基线模型。
评估使用 RLHF 时对真实性、毒性以及对显式约束的遵循的变化。
评估对未包含在训练数据中的标签者以及超出训练数据的真实 API 提示的泛化能力。

提出的方法

从标注者处收集演示输出以训练一个监督微调（SFT）模型。.
收集对模型输出的成对人类偏好以训练一个 reward 模型（RM）。
将 RM 作为奖励，通过 Proximal Policy Optimization (PPO) 对策略进行微调。
将 PPO 更新与预训练梯度混合（PPO-ptx），以减少在公开 NLP 数据集上的性能回归。
使用未包含在训练分布中的已标注 API 提示和公开 NLP 数据集对比评估；与 GPT-3 和 FLAN/T0 基线比较。

实验结果

研究问题

RQ1 RLHF 微调是否能将语言模型对齐以遵循跨越广泛任务分布的用户指令？
RQ2较小的 InstructGPT 模型是否在指令跟随提示上超越更大规模的 GPT-3 基线？
RQ3与基线模型相比，RLHF 如何影响真实性、毒性和对显式约束的遵循？
RQ4InstructGPT 模型是否能对未见的标签者和训练分布之外的提示实现泛化？
RQ5对齐收益与在标准 NLP 基准上的性能之间是否存在权衡（对齐成本）？

主要发现

InstructGPT 的输出被偏好于 GPT-3 的输出；在 API 提示分布中，1.3B 的 InstructGPT 相较于 175B 的 GPT-3 更受偏好。
InstructGPT 模型在 TruthfulQA 上的真实性约为 GPT-3 的两倍，在开放域任务上的幻觉率约为一半（21% vs 41%）。
在被提示要保持礼貌时，InstructGPT 将有害输出减少约 25% 相对于 GPT-3。
对一些公开 NLP 数据集（SQuAD、DROP、HellaSwag、WMT 2015 FR→EN）在对齐后存在小幅但可测量的性能回归，可以通过混合 PPO 与预训练梯度（PPO-ptx）来缓解。
即使只有 1.3B 参数，InstructGPT 也能在指令跟随任务上超越 175B 的 GPT-3，表明对齐价值胜过纯规模。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。