QUICK REVIEW

[論文レビュー] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang|arXiv (Cornell University)|Mar 20, 2026

Advanced Bandit Algorithms Research被引用数 0

ひとこと要約

この論文は、多源の不完全な好みからの強化学習（RL-MSIP）を構築し、M依存の統計的利得と累積的な不完全性予算に対する頑健性の間を補間する後悔境界を証明し、それに対応する下界と素朴な集約に対する反例を提示します。

ABSTRACT

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $ ilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $ ildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $ ildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

研究の動機と目的

RLHF 設定における多源不完全軌跡好みを動機づける。
ソース数 M と累積的不完全性予算 ω が後悔に与える影響を定量化する。
不完全性レベルに適応するアルゴリズム RL-MSIP を開発し、好ましい後悔を達成する。
下界と naïve 集約の反例を提示し、不完全性の影響を理解を深める。

提案手法

K エピソードにわたる累積予算 ω を用いた多源不完全好みフィードバックを定式化する。
比較関数を推定するための不完全性適応重み付き比較学習を提案する。
フィードバックから生じる分布シフトを抑制するために値をターゲットにした遷移推定を用いる。
好みのみのフィードバック下での探索とバランスを取るため、Bounded UCB で方策レベルの楽観性を実装する。
重み付き目的を解析可能で安定に保つためにサブ重要度サンプリングを適用する。

実験結果

リサーチクエスチョン

RQ1不完全な好みを伴う RLHF で、ソース数 M と累積的不完全性 ω は後悔にどのような影響を与えるか？
RQ2不完全性が小さいときには M に依存する利得を、巨大なときには頑健性を同時に達成するアルゴリズムを設計できるか？
RQ3多源不完全好みの下での後悔の基本的限界（下界）は何か？
RQ4不完全な好みを素朴に集約することで生じる落とし穴は何で、それらの影響を定量化できるか？
RQ5不完全性の下で後悔分析を保つために、遷移と好みをどのように推定すべきか？

主な発見

RL-MSIP は約 Õ(√(K/M) + ω) の後悔を達成する。
下界は後悔が少なくとも Õ(max{√(K/M), ω}) でなければならないことを示す。
不完全性を無視すると Õ(min{ω√K, K}) となる反例が存在する。
不完全性適応重み付け、値をターゲットにした回帰、方策レベルの楽観性、サブ重要度サンプリングを組み合わせたアプローチ。
結果は、多源のフィードバックが RLHF を改善する時期と、不完全性がそれをどの程度妨げるかを定量化する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。