QUICK REVIEW

[论文解读] SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng, Mengzhou Xia|arXiv (Cornell University)|May 23, 2024

Constraint Satisfaction and Optimization被引用 8

一句话总结

SimPO 提出了一种简单的、无参考的奖励，基于序列的平均对数概率，加入目标边际，在多个开源基准和模型族中始终优于 DPO。

ABSTRACT

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes.

研究动机与目标

将离线偏好优化作为 RLHF 流程的更简单替代方案。
提出与生成指标对齐的奖励，通过使用长度归一化的平均对数概率。
引入目标奖励边际，以提高获胜与失败响应之间的分离度。
在标准基准上展示基模型与指令微调模型的鲁棒性与性能提升。

提出的方法

将隐式、无参考奖励 r_SimPO(x,y) = (β/|y|) log π_θ(y|x) 定义为使训练与生成对齐。
将目标边际 γ 纳入 Bradley-Terry 目标以要求 r(x,y_w) − r(x,y_l) ≥ γ。
使用带有 BT 排名目标的离线偏好数据进行训练，没有单独的奖励模型或参考策略。
在基模型和指令微调模型（Llama3-8B-Instruct，Mistral-7B）以及基准测试（AlpacaEval 2、Arena-Hard、MT-Bench）上进行评估。
将 SimPO 与 DPO 及其他离线方法进行比较，并对 β（2.0–2.5）和 γ（0.5–1.5）进行调优以获得最佳性能。

实验结果

研究问题

RQ1将训练奖励与生成指标（平均对数似然）对齐是否能优于 DPO？
RQ2移除参考模型并使用长度归一化奖励的影响是什么？
RQ3引入目标奖励边际 γ 如何影响奖励准确性和生成质量？
RQ4SimPO 的增益是否在基模型、指令微调模型以及多种基准上具有广泛适用性？

主要发现

SimPO 在 AlpacaEval 2、Arena-Hard 和 MT-Bench 基准上始终优于 DPO 及相关方法。
在 AlpacaEval 2 上，SimPO 将 LC 胜率在强基线基础上提升最多 6.4 点，在 Arena-Hard 上提升最多 7.5 点。
基于 Llama3-8B-Instruct 的顶级模型在 AlpacaEval 2 上实现 44.7% 的长度控制胜率，在 Arena-Hard 上实现 33.8%，超过多位竞品。
长度归一化至关重要；去除它会导致输出更长且重复，并且奖励对齐变差。
提高边际 γ 能提升奖励准确性，但若设置过高可能降低胜率，这表明奖励标定与生成质量之间存在权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。