Skip to main content
QUICK REVIEW

[论文解读] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang|arXiv (Cornell University)|Mar 20, 2026
Advanced Bandit Algorithms Research被引用 0
一句话总结

论文在多源不完美偏好下开发强化学习(RL-MSIP)并证明一个介于与M相关的统计收益和对累积不完美预算鲁棒性的回报界的后悔界,同时给出一个匹配的下界和对朴素聚合的反例。

ABSTRACT

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $ ilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $ ildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $ ildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

研究动机与目标

  • 在 RLHF 设置中,利用多源不完美轨迹偏好来驱动 RL 的动机。
  • 量化源数量 M 与累积不完美预算 ω 如何影响后悔。
  • 开发一个自适应不完美水平的算法 RL-MSIP,并实现有利的后悔界。
  • 给出下界和对朴素聚合的反例,以更好地理解不完美的影响。

提出的方法

  • 将多源不完美偏好反馈在 K 个情节中以累积预算 ω 进行形式化。
  • 提出对不完美自适应加权的比较学习来估计比较函数。
  • 使用面向值的转移估计来控制来自反馈的分布漂移。
  • 在仅有偏好反馈的情况下,利用有界的 UCB 实现策略层面的乐观,以平衡探索。
  • 应用次重要性重采样以使加权目标可分析且稳定。

实验结果

研究问题

  • RQ1在含不完美偏好的 RLHF 中,源数量 M 与累积不完美 ω 如何影响后悔?
  • RQ2我们能否设计一个在不完美较小时实现与 M 相关收益、在不完美较大时实现鲁棒性的算法?
  • RQ3在多源不完美偏好下,后悔的基本极限(下界)是什么?
  • RQ4从朴素聚合不完美偏好会带来哪些陷阱,能否量化其影响?
  • RQ5如何估计转移与偏好以在不完美下保持可控的后悔分析?

主要发现

  • RL-MSIP 的后悔近似为 Õ(√(K/M) + ω)。
  • 下界表明后悔至少为 Õ(max{√(K/M), ω})。
  • 存在一个反例,忽略不完美会导致 Õ(min{ω√K, K})。
  • 该方法将不完美自适应加权、面向值的回归、策略层面的乐观以及次重要性采样结合起来。
  • 结果量化了在何时多源反馈可以改进 RLHF,以及不完美如何限制这种改进。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。