QUICK REVIEW

[论文解读] Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman|arXiv (Cornell University)|Oct 19, 2022

Reinforcement Learning in Robotics被引用 35

一句话总结

这篇论文在实验上推导出强化学习和最佳-之-N（Best-of-N）优化下奖励模型过度优化的扩展定律，使用一个合成的金标准奖励来量化 RM 大小、数据和策略规模的影响。

ABSTRACT

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

研究动机与目标

理解针对代理奖励模型的优化如何影响真实的（金标准）奖励。
表征过度优化如何随奖励模型大小、数据量和策略规模而扩展。
在过度优化和效率方面比较强化学习和 Best-of-N 采样。
探索对 AI 对齐和 Goodhart 法则在 RLHF 中的含义。
提供不同配置下对金标准奖励模型分数的预测性扩展形式。

提出的方法

使用一个固定的金标准奖励模型的合成设置来标注比较并训练代理奖励模型。
使用基于 PPO 的强化学习或 Best-of-N 采样来优化代理奖励模型。
定义距离 d = sqrt(KL(pi || pi_init)) 来量化优化进展，并在 d 上给出扩展形式。
为在 BoN 和 RL 下的金标准 RM 分数 R(d) 拟合函数形式：R_BoN(d) = d(α_BoN − β_BoN d) 与 R_RL(d) = d(α_RL − β_RL log d)。
研究 α、β 如何随代理 RM 参数、数据规模和输出 KL 罚则而变化；进行 RM 分数的重新校准。

实验结果

研究问题

RQ1在不同方法（BoN 与 RL）下，金标准奖励分数如何随优化进展变化？
RQ2BoN 和 RL 的过度优化的函数形式是什么，它们与经验数据的拟合程度如何？
RQ3RM 大小、RM 数据大小和策略大小如何影响扩展系数和峰值金标准分数？
RQ4在 RL 下，KL 罚则对金标准奖励边界和代理-金间距有何影响？
RQ5这些扩展规律对 RLHF 和 AI 对齐理论（如 Goodhart 等）的含义是什么？

主要发现

对于 BoN，金标准奖励的扩展为 R_BoN(d) = d(α_BoN − β_BoN d)，其系数随 RM 大小和数据量平稳变化。
对于 RL，金标准奖励的扩展为 R_RL(d) = d(α_RL − β_RL log d)，其中 α_RL 大致与 RM 大小无关，β_RL 随 RM 属性变化。
在以 KL 距离衡量的优化和过度优化方面，强化学习往往比 BoN 慢。
BoN 与 RL 的 α、β 系数都随代理 RM 参数数量和数据数量平滑扩展，呈近似对数趋势。
RL 中的 KL 罚则会提高代理 RM 的分数，但不会改善金标准 RM 的边界，表明在此设定中显式 KL 罚则的效用有限。
更大的策略并未显著增加过度优化的量，尽管它们提升了整体金标准表现和鲁棒性；代理与金分数之间的差距在不同策略规模下保持大致相同。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。