QUICK REVIEW

[论文解读] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang|arXiv (Cornell University)|Mar 20, 2026

Advanced Bandit Algorithms Research被引用 0

一句话总结

论文在多源不完美偏好下开发强化学习（RL-MSIP）并证明一个介于与M相关的统计收益和对累积不完美预算鲁棒性的回报界的后悔界，同时给出一个匹配的下界和对朴素聚合的反例。

ABSTRACT

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $ ilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $ ildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $ ildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

研究动机与目标

在 RLHF 设置中，利用多源不完美轨迹偏好来驱动 RL 的动机。
量化源数量 M 与累积不完美预算 ω 如何影响后悔。
开发一个自适应不完美水平的算法 RL-MSIP，并实现有利的后悔界。
给出下界和对朴素聚合的反例，以更好地理解不完美的影响。

提出的方法

将多源不完美偏好反馈在 K 个情节中以累积预算 ω 进行形式化。
提出对不完美自适应加权的比较学习来估计比较函数。
使用面向值的转移估计来控制来自反馈的分布漂移。
在仅有偏好反馈的情况下，利用有界的 UCB 实现策略层面的乐观，以平衡探索。
应用次重要性重采样以使加权目标可分析且稳定。

实验结果

研究问题

RQ1在含不完美偏好的 RLHF 中，源数量 M 与累积不完美 ω 如何影响后悔？
RQ2我们能否设计一个在不完美较小时实现与 M 相关收益、在不完美较大时实现鲁棒性的算法？
RQ3在多源不完美偏好下，后悔的基本极限（下界）是什么？
RQ4从朴素聚合不完美偏好会带来哪些陷阱，能否量化其影响？
RQ5如何估计转移与偏好以在不完美下保持可控的后悔分析？

主要发现

RL-MSIP 的后悔近似为 Õ(√(K/M) + ω)。
下界表明后悔至少为 Õ(max{√(K/M), ω})。
存在一个反例，忽略不完美会导致 Õ(min{ω√K, K})。
该方法将不完美自适应加权、面向值的回归、策略层面的乐观以及次重要性采样结合起来。
结果量化了在何时多源反馈可以改进 RLHF，以及不完美如何限制这种改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。