QUICK REVIEW

[论文解读] RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Zheng Yuan, Hongyi Yuan|arXiv (Cornell University)|Apr 11, 2023

Topic Modeling被引用 34

一句话总结

RRHF 通过模型下的对数概率对多样本回答进行排序，并使用排序损失加上监督式微调来使语言模型与人类偏好对齐，仅使用 1–2 个模型和多种来源的回答。它在实现和训练要求更简单的情况下达到与 PPO 相当的性能。

ABSTRACT

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner. Codes available at https://github.com/GanjinZero/RRHF.

研究动机与目标

为将大语言模型与人类偏好对齐的一个更简单的 RLHF 替代方法（相对于 PPO）提供动机。
提出 RRHF，它使用来自多样来源的多个响应的对数概率的排序。
证明 RRHF 在较少的模型和超参数下也能实现与 PPO 相当的对齐效果。
展示 RRHF 在 Anthropic 的有用且安全数据集上的有效性并分析采样质量的影响。

提出的方法

从多样来源（如模型、其他大语言模型、人工专家）采样多条响应。
计算在当前模型下每个响应的对数概率，作为长度归一化的得分 p_i（log P_pi(y_i|x,y_i<t)）。
用排序损失 L_rank 进行优化，鼓励对人类奖励 r_i 越高的响应对应更高的 p_i（L_rank = sum_{r_i<r_j} max(0, p_i - p_j)）。
使用对最高奖励响应的监督式微调损失 L_ft 来保持对指令执行的忠实性。
总损失为 L = L_rank + L_ft，排序中无边界项，也不需要单独的值模型或 KL 项。
RRHF 可以被视为 SFT 的扩展以及 PPO 的轻量级替代，避免使用多模型和复杂的超参数调整。

实验结果

研究问题

RQ1RRHF 是否能在最少模型数量的前提下，通过对数概率排序实现与 PPO 相当的对齐效果？
RQ2采样响应的质量如何影响 RRHF 的性能？
RQ3RRHF 是否可以利用多样来源（自我、其他大模型、人工）来学习人类偏好的排序？
RQ4在保持类似结果的前提下，RRHF 是否比 PPO 更易实现和扩展？

主要发现

使用多样采样（DP 或 SP）的 RRHF 在 HH 数据集上达到与 PPO 相当的奖励水平。
RRHF 的性能随采样响应的质量提高而提升，并接近所采样集合的最大奖励。
RRHF 仅需 1–2 个模型，且显著减少了比 PPO 需要的代码量和超参数调优。
排序损失是关键；去掉会降低性能。
迭代训练（RRHF IP-2）相比单次 RRHF 能进一步提升人类评估结果。
使用 RRHF、结合 ChatGPT、InstructGPT、LLaMA 与 Alpaca 样本训练的 Wombat 模型，在类似资源条件下可优于 SFT 基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。