QUICK REVIEW

[论文解读] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Kenton Tang, Yuzhu Chen|arXiv (Cornell University)|Feb 25, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

简要结论：提出一个在奖励偏移和裁剪后的 KL 正则化下的 RLHF 泛化理论，将泛化误差分解为采样、奖励偏移和 KL 裁剪分量，并提供实际的校准与预算指南。

ABSTRACT

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

研究动机与目标

激发并形式化在奖励偏移与 KL 裁剪下对 RLHF 泛化缺乏理论理解的问题。
提出 RLHF 流程的高概率泛化界限，将误差分解为三种易于解释的来源。
提供实际意义包括最优的 KL 裁剪阈值与跨 prompts、rollouts 与偏好数据的预算分配。
给出特例分析和数据相关的 PAC-Bayes 界限，将理论与常见训练范式联系起来。

提出的方法

将 RLHF 视为一个两阶段过程：奖励模型在偏好数据上训练，策略优化以最大化该奖励。
引入带有逐样本对数比裁剪的裁剪型 KL 正则化，并分析其对目标与稳定性的影响。
利用变换度量与 PAC-Bayes 工具，推导三项式的泛化误差分解（采样误差、奖励偏移误差、KL 裁剪误差）。
在非平凡数据收集设置下，为经验目标相对于总体目标建立高概率界限。
给出固定策略的泛化界限和数据相关的 PAC-Bayes 界限，对后验的统一覆盖范围内成立。
讨论包括在有限假设集合上的均匀先验以及基于 SGD/Ornstein–Uhlenbeck 过程的先验等特例。

实验结果

研究问题

RQ1当奖励模型在与部署策略不同分布的数据上训练时，如何对 RLHF 泛化进行界定？
RQ2在估计并裁剪 KL 正则化时，RLHF 的泛化误差的不同来源是什么？
RQ3应如何选择裁剪阈值与评估预算（prompts、rollouts、偏好数据）以在偏差与方差之间取得平衡？
RQ4是否可以推导出数据相关的界限（PAC-Bayes），以容纳事后模型选择与常见优化范式（如 SGD）？

主要发现

泛化误差可分解为三项：采样误差、奖励偏移误差和 KL 裁剪误差。
评估预算与裁剪阈值下，rollout 与 prompts 的采样误差呈现可辨的放大效应，给出明确界限。
奖励偏移误差由一个卡方覆盖系数放大，该系数捕捉奖励训练数据分布与部署分布之间的偏移。
KL 裁剪引入一个偏差项，取决于对数比尾部，而非通过增加数据而消失。
提出一个面向预算的最优 KL 裁剪阈值，通过基于分位数的规则在偏差与方差之间取得平衡。
给出数据相关的 PAC-Bayes 界限，将风险与后验与先验之间的 KL 散度以及前三种误差来源联系起来。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。