QUICK REVIEW

[论文解读] ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee|arXiv (Cornell University)|Mar 12, 2024

Multi-Criteria Decision Making被引用 7

一句话总结

ORPO 引入一种无参考、单一的基于比值比的偏好优化方法，在不需要 RLHF 或参考模型的情况下改善对齐的微调，并在多个模型和数据集上实现强指令遵循结果。

ABSTRACT

While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $ ext{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$α$ (7B) and Mistral-ORPO-$β$ (7B).

研究动机与目标

探究监督微调（SFT）在偏好对齐中的作用。
提出一种无参考模型的单一对齐方法（ORPO）。
在从125M到7B的不同模型规模上，在标准基准测试中证明 ORPO 的有效性。
在各种任务中将 ORPO 与 RLHF、DPO 和 SFT 基线进行比较。

提出的方法

定义一个基于比值比的惩罚项，附加在负对数似然损失上。
将 L_SFT 与相对比损失 L_OR 结合，形成 L_ORPO。
对对数优势比取对数的对数-sigmoid，以稳定优化。
在数据集 HH-RLHF 和 UltraFeedback 上，使用模型 Phi-2、Llama-2 和 Mistral 进行评估。
在不同模型规模上，与 SFT、PPO 和 DPO 进行对比。

实验结果

研究问题

RQ1仅用简单惩罚项的情况下，SFT 就能达到偏好对齐的充分性吗？
RQ2无参考模型的比值比目标是否在多种模型规模上提升对齐表现？
RQ3在标准基准上的胜率和奖励分布方面，ORPO 与 RLHF 及 DPO 的对比如何？
RQ4ORPO 对指令遵循能力和多轮任务的影响是什么？
RQ5与基于参考的方法相比，ORPO 是否更具计算效率？

主要发现

ORPO 在指令遵循方面取得强劲表现，超越了一些 7B+ 的 SOTA 模型，在 AlpacaEval 2.0 和 MT-Bench 上。
Mistral-ORPO-α 与 Mistral-ORPO-β（7B）在 AlpacaEval2.0 上达到 11.33% 和 12.20%，在 MT-Bench 上分别为 7.23 和 7.32。
ORPO 在 HH-RLHF 数据集上在所有测试的模型规模上均超越 SFT 和 PPO，胜率最高达到对 SFT 的 78.0%，对 PPO 的 79.4%。
在 UltraFeedback 上，ORPO 对 SFT 的胜率最高达到 80.5%，对 PPO 为 85.8%，且较大模型对 DPO 的对比更强。
ORPO 不需要参考模型，相对于 RLHF/DPO 可减少前向传播次数和计算成本。
在所测试的设置中，奖励分布表明 ORPO 的期望奖励高于 RLHF 和 DPO。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。