QUICK REVIEW

[论文解读] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma|arXiv (Cornell University)|May 29, 2023

Topic Modeling被引用 271

一句话总结

直接偏好优化（DPO）直接从人类偏好优化语言模型策略，无需显式的奖励建模或强化学习。它实现的对齐效果可与基于 PPO 的 RLHF 相媲美或更好，且实现和训练更为简单。

ABSTRACT

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

研究动机与目标

通过利用人类偏好来推动可引导、安全、以及对齐的大型语言模型的需求。
提出一种直接从偏好优化策略的新范式，无需显式奖励建模或 RL。
展示重新参数化可得到闭式解的最优策略，从而实现简单的分类损失。
将 DPO 与基于 PPO 的 RLHF 在情感控制、摘要、对话等任务上进行比较。
展示 DPO 在高达 6B 参数模型上的稳定性、效率和可扩展性。

提出的方法

引入一种奖励模型参数化，允许以闭式形式提取最优策略（式(4)）。
通过 r(x,y)=β log(π(y|x)/π_ref(y|x)) 重新参数化奖励并推导基于 Bradley-Terry 的偏好损失在策略输出上的形式（式(7)）。
使用隐式奖励对偏好/非偏好对建立二元交叉熵目标（式(7)）。
通过隐式奖励排序误差对损失进行加权，以防止退化（讨论梯度形式）。
概述一个实用的 DPO 流水线：从 π_ref 采样，收集人类偏好，并用 DPO 损失进行优化。
讨论理论性质，表明在 Plackett-Luce/Bradley-Terry 模型下等价于基于奖励的 RL，并相对于 actor-critic 方法具有鲁棒性优势。

Figure 1: DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes th

实验结果

研究问题

RQ1直接从人类偏好优化策略，是否能在情感控制、摘要、对话等任务上匹配或超越 RLHF 的 PPO？
RQ2DPO 的重新参数化是否在没有显式奖励建模或 RL 循环的情况下恢复最优策略？
RQ3与基线方法在情感控制、摘要和对话任务中的性能与稳定性相比，DPO 如何？
RQ4相比于基于 PPO 的 RLHF，DPO 是否更高效且对超参数和温度更鲁棒？
RQ5在 Plackett-Luce/Bradley-Terry 模型下，哪些理论保证支持 DPO 的有效性？

主要发现

DPO 在给定 KL 限界前沿上实现了最高的奖励，在情感前沿压制了 PPO。
DPO 在摘要和对话任务上与基于 PPO 的 RLHF 相媲美或超越，且超调参需求较少。
DPO 在不同采样温度下保持鲁棒并快速收敛到强表现。
在受控情感设定中，DPO 即使 PPO 访问到真实奖励也优于 PPO。
DPO 使用简单、稳定的训练目标，在高达 6B 参数的语言模型上提供具有竞争力的绩效。

Figure 2: Left. The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization. Right. TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。