QUICK REVIEW

[论文解读] Statistical Rejection Sampling Improves Preference Optimization

Tianqi Liu, Yao Zhao|arXiv (Cornell University)|Sep 13, 2023

Topic Modeling被引用 7

一句话总结

论文提出统计拒绝采样优化（RSO）以从估计的目标最优策略中采样偏好数据，统一 DPO 和 SLiC 损失，并显示 RSO 在各任务和评估中始终优于这些方法。

ABSTRACT

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.

研究动机与目标

使用离线方法在避免完整 RLHF 复杂性的前提下提高语言模型与人类偏好的一致性。
从偏好建模的角度统一 DPO 与 SLiC 的损失公式。
开发一种可扩展的方法，通过统计拒绝采样从估计的最优策略中采样偏好数据。
在多任务和多种评估中展示 RSO 相对于最强离线基线的经验收益。

提出的方法

在 Bradley–Terry 框架下建立偏好数据建模，将最优策略与成对奖励相关联。
训练成对奖励排序模型以估计响应对的偏好概率。
引入统计拒绝采样，使用 SFT 策略作为提议分布从估计的最优策略生成样本，并用奖励模型进行标注。
探索多种损失函数（对数回归与铰链）及数据对构建来拟合最优策略。
在一个共同的偏好建模视角内统一 DPO 与 SLiC 的损失（对数回归 vs 铰链），并比较它们的行为。
通过将 RSO 应用于更大模型（T5-XXL）并使用代理、金标准、AutoSxS 和人类度量进行评估，展示可扩展性。

Figure 1: RSO first fits a pairwise reward-ranking model from human preference data. This model is later applied to generate preference pairs with candidates sampled from the optimal policy, followed by a preference optimization step to align sequence likelihood towards preferences.

实验结果

研究问题

RQ1如何从估计的目标最优策略中采样偏好数据以更好地估计最优策略本身？
RQ2基于拒绝采样的数据生成和奖励模型标注是否比 DPO 和 SLiC 提升策略优化？
RQ3不同的损失形式（对数回归 vs 铰链）和数据分布（直接、SFT-样本排序、RSO-样本排序）如何影响对人类偏好的对齐程度？
RQ4RSO 能否扩展到更大模型并在任务和评估模态上保持或改善对齐？

主要发现

RSO 的不同变体在各任务和评估指标上始终优于 DPO 和 SLiC 基线。
在采样策略中，rso-sample-rank 相对于直接和 sft-sample-rank 方法有收益。
RSO 能扩展到更大策略模型（T5-XXL），并在两个任务上使 AutoSxS 相对于 DPO 出现提升。
人工评估显示，使用 sigmoid-norm 或 hinge-norm 损失的 RSO-sample-rank 相较直接或 sft-sample-rank 基线更受青睐。
gamma（损失温度）和 beta（拒绝采样温度）的超参数选择具有显著影响，适中的取值通常表现最好。

Figure 2: Statistical rejection sampling illustration. There are three curves in the figure: $M$ times SFT policy, reward, optimal policy. The sample is first generated by SFT policy, then gets accepted or rejected depending on whether a uniform random variable locates in acceptance or rejection reg

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。