QUICK REVIEW

[论文解读] Pretraining Language Models with Human Preferences

Tomasz Korbak, Kejian Shi|arXiv (Cornell University)|Feb 16, 2023

Hate Speech and Cyberbullying Detection被引用 26

一句话总结

本文表明，用人类偏好目标进行预训练语言模型（特别是条件训练）可以生成显著减少不良内容的模型，同时保持下游性能，优于先进行标准的 MLE 预训练再进行带反馈的微调。

ABSTRACT

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

研究动机与目标

说明在预训练阶段使语言模型与人类偏好保持一致的必要性，而不仅仅是在微调阶段。
研究五种人类反馈预训练目标，并将它们与标准 MLE 预训练进行比较。
在有害性、PII 泄露和符合 PEP8 的代码任务上评估对齐性和能力。
识别帕累托最优目标，并为 PHF 方法提供实用指南。

提出的方法

为每个训练段设定分段级奖励函数 R 来公式化预训练。
将五种 PHF 目标（条件训练、数据集筛选、非似然损失、奖励加权回归、优势加权回归）与标准 MLE 进行比较。
在 124M 参数的 GPT-2 小模型架构上，对 3.32B 标记数据集进行有害性、PII 和 PEP8 任务。
通过错配分数（负奖励）评估对齐性，并通过与 GPT-3 的 KL 散度及下游基准来评估模型能力。
评估跨任务对红队与对抗性提示的鲁棒性。

Pretraining Language Models with Human Preferences

实验结果

研究问题

RQ1在对齐性和能力指标上，带有人工反馈的预训练（PHF）能否超过标准 MLE 预训练再加反馈微调？
RQ2在有害性、PII 和 PEP8 任务中，哪个 PHF 目标在对齐与能力之间提供最佳权衡？
RQ3在预训练阶段加入人类偏好是否会为语言模型的安全性和有用性创造帕累托最优解？
RQ4与 MLE 基线相比，PHF 训练的语言模型对红队攻击的鲁棒性如何？
RQ5PHF 训练的模型在零-shot 和类似 GLUE 的下游任务上是否保留性能？

主要发现

条件训练在所有三个任务上是帕累托最优的，对于有害性和 PEP8 常常是严格最优。
PHF 方法使不良内容减少最多一个数量级，相比标准 MLE。
PHF 在对齐方面等同或优于标准 MLE 加上带反馈的微调，并且可以与零-shot 和 GLUE 基准相匹配。
使用 MLE 预训练模型的带反馈微调通常不及从头开始的 PHF。
PHF 目标在对抗鲁棒性方面优于 MLE，但并未消除所有漏洞。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。