Skip to main content
QUICK REVIEW

[论文解读] Secrets of RLHF in Large Language Models Part II: Reward Modeling

Binghai Wang, Rui Zheng|arXiv (Cornell University)|Jan 11, 2024
Topic Modeling被引用 7
一句话总结

本文通过测量偏好强度、减轻嘈杂/模糊数据,以及通过对比学习和元学习提升泛化能力,以实现迭代 RLHF 的奖励模型改进。

ABSTRACT

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.

研究动机与目标

  • 确定不正确和模糊的偏好数据如何妨碍 RLHF 中的奖励模型。
  • 提出测量并利用偏好强度以提升 RM(奖励模型)质量的方法。
  • 在数据层面和算法层面开发策略(对比学习、元学习)以提升 RM 泛化能力并实现迭代 RLHF。

提出的方法

  • 通过多模型奖励投票形式化偏好强度度量,以区分不正确/模糊/正常数据。
  • 应用标签翻转和标签平滑以缓解嘈杂的偏好并提高 RM 的鲁棒性。
  • 引入受偏好强度引导的自适应边际,在 RM 损失中以增强判别。
  • 将无监督对比损失(SwAV/SimCSE)与 RM 损失结合,以提升特征判别。
  • 引入 MetaRM,以在 PPO 期间政策分布变化时维持 RM 的判别。

实验结果

研究问题

  • RQ1不正确和模糊的偏好数据如何影响 RLHF 中奖励模型的性能?
  • RQ2数据驱动的偏好强度度量能否提升奖励模型的质量与稳定性?
  • RQ3对比学习和元学习是否提升 RM 对分布外数据的泛化能力并实现迭代 RLHF?
  • RQ4哪些训练策略(标签翻转、平滑、自适应边际)在抑制噪声的同时保留偏好中的有用信号效果最好?

主要发现

  • 偏好强度与注释质量以及跨多个奖励模型的一致性相关。
  • 通过翻转或平滑去除/调整低强度/嘈杂数据可提升 RM 的稳定性和 RLHF 结果。
  • 自适应边际和软标注有助于 RM 从强偏好中鲁棒学习并缓解过拟合。
  • 对比学习(尤其 SimCSE)使 PPO 训练更稳定,并在无害/有用评估上略有提升。
  • MetaRM 在 policy 分布变化时提高对回应的辨别力,支持迭代 RLHF。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。