QUICK REVIEW

[论文解读] Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu|arXiv (Cornell University)|Jun 30, 2023

Topic Modeling被引用 12

一句话总结

PRO 直接通过优化多种回应的概率排序来训练 LLM 以与人类偏好对齐，在若干基线之上取得领先，并在多项评估中接近 ChatGPT/人类表现。

ABSTRACT

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

研究动机与目标

说明需要让 LLMs 实现人类对齐，以减轻有害/误导性内容。
提出 PRO 作为直接替代 PPO 的方法，优化人类偏好排序。
将 Bradley-Terry 比较扩展到更长的人类偏好排序，并推导出 PRO 的可微分损失。
展示 PRO 的数据效率，以及与自引导/self-bootstrapping 与奖励模型嫁接的兼容性。
在不同排序长度和评估方法下，对比多种基线评估 PRO。

提出的方法

通过条件概率的递归乘积（方程式 5）扩展 Bradley-Terry 比较以处理较长的人类偏好排序。
将可微分评分函数 r_pi(x,y^k) 定义为候选 y^k 下的逐词对数似然（方程式 6）。
通过最小化组合损失来训练 LLM：PRO 目标和 SFT 损失的组合（方程式 7）。
使用可微分对比的 PRO 损失（方程式 8）使 LLM 的排序与人类偏好对齐。
可选地嫁接 RLHF 元素，包括经济可控的排序、可区分对比（方程式 9-11）和自我引导增强（方程式 12）。
在 HH-RLHF 数据集上以 LLaMA-7B 作为骨干进行实验，将 PRO 与 SFT、RLHF、CoH、RRHF、BoN 以及强大 LLM 基线进行比较；通过 BLEU、奖励模型、GPT-4 以及人类判断进行评估。

实验结果

研究问题

RQ1PRO 能否在使用较长排序序列时超越基于 PPO 的 RLHF，以使 LLMs 与人类偏好对齐？
RQ2排序长度如何影响对齐质量和评估分数？
RQ3使用更高质量或更丰富多样的候选排序对 PRO 性能有何影响？
RQ4PRO 在自动评估和人类评估中的表现与既有基线相比如何？
RQ5PRO 是否可以有效通过 RLHF 组件增强，以在灵活性与效率之间取得平衡？

主要发现

Sub-set	Method	BLEU	Reward
Harmless_base	PRO	12.05	62.96
Helpful_base	PRO	20.83	48.51
Helpful_online	PRO	28.75	59.02
Helpful_rejection	PRO	27.17	53.28

PRO 即使排序长度为 2 时也优于有竞争力的基线，在 HH-RLHF 原始分数上超越 SFT 6.52 reward 点，超越 RRHF 3.1 点。
更长的排序序列始终提升 PRO 的人类对齐表现。
更高质量且更丰富多样的候选排序（包括 ChatGPT 示例）提升 PRO 表现，达到与参数更少的较大模型相近的 reward 分数。
自我引导提供渐进增益，但来自高质量外部样本的增益更大。
GPT-4 和人类评估在很大程度上偏向 PRO 相对于 RRHF 和基线 Golden 样本，表明与人类偏好的一致性较强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。