QUICK REVIEW

[论文解读] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu|arXiv (Cornell University)|Apr 16, 2024

Digital Rights Management and Security被引用 6

一句话总结

这篇论文在理论上分析 DPO 与 PPO 在 RLHF 对齐中的对比，并在经验上显示 PPO 在对话与代码生成基准上始终优于 DPO，包括 CodeContest 上 34B 模型的现有最先进结果。

ABSTRACT

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at https://github.com/openpsi-project/ReaLHF.

研究动机与目标

评估 Direct Preference Optimization (DPO) 是否在使用 LLM 的 RLHF 中确实优于 Proximal Policy Optimization (PPO)。
识别 DPO 的根本局限性以及影响 RLHF 中 PPO 表现的因素。
在对话和代码生成 RLHF 测试环境中基准测试 DPO 与 PPO，以确定实用的最佳实践。

提出的方法

从理论上分析 DPO 目标及其通过奖赏相关优化和奖赏无关优化之间的闭式联系与 PPO 的关系。
给出一个反例和合成实验以说明 DPO 潜在的偏差和分布外（OOD）风险。
在真实偏好数据集（SafeRLHF、HH-RLHF）和代码生成基准（APPS、CodeContest）上，对多种模型规模进行广泛的经验评估，比较 DPO、迭代 DPO 和 PPO。
对 PPO 进行消融研究，以识别提升 RLHF 性能的关键因素（优势正规化、大批量、以及对参考模型的指数移动平均（EMA）更新）。
探讨数据分布效应、基础模型选择以及迭代标注策略，以缓解 DPO 的分布偏移问题。

实验结果

研究问题

RQ1在真实世界数据分布下，DPO 在 LLM 对齐的 RLHF 中是否确实优于 PPO？
RQ2相比于 PPO，DPO 展现出哪些理论和经验上的局限？
RQ3哪些因素最显著影响 PPO 的 RLHF 性能，是否可以利用这些因素在各基准上超越 DPO？
RQ4基础模型、偏好数据质量和分布偏移在实际中如何影响 DPO 的性能？
RQ5迭代 DPO 或数据筛选策略能否缩小 DPO 与 PPO 在像代码生成这样的挑战性任务上的差距？

主要发现

PPO 在所研究的基准上始终优于 DPO，包括对话和代码生成任务。
当偏好数据分布不覆盖相关输出时，DPO 可能对分布外响应过拟合并表现出偏向性策略。
理论分析表明任何来自 PPO 的解都可以在 DPO 框架中表示，但 DPO 目标允许一个更大的策略类，使得可能出现 PPO 在参考正则化下无法达到的潜在不理想解。
消融研究表明 PPO 通过优势正规化、较大批量以及对参考模型使用指数移动平均（EMA）更新而获益，EMA 在具有挑战性的任务上还带来额外收益。
在 CodeContest 数据集上，使用 34B CodeLlama 基于的模型，PPO 实现了现有最先进的性能，超过 AlphaCode-41B，并在报道的设置中实现了显著的 10@1k 提升（从 16.4% 提升到 22.4%）。
缓解分布偏移（例如通过 Safe 数据的 SFT 或迭代标注）可以提升 DPO 的性能，但即使在接近完美的标注者下，DPO 在困难的代码生成任务上仍然不太具竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。