QUICK REVIEW

[논문 리뷰] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang|arXiv (Cornell University)|2026. 03. 20.

Advanced Bandit Algorithms Research인용 수 0

한 줄 요약

이 논문은 다중 소스의 불완전한 궤적 선호를 이용한 RL(RL-MSIP)을 개발하고, M 의존적 통계 이득과 누적 불완전성 예산에 대한 강건성 사이를 보간하는 후회를 보장하며, 일치하는 하한과 순진한 집계에 대한 반례를 제시한다.

ABSTRACT

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $ ilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $ ildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $ ildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

연구 동기 및 목표

RLHF 환경에서 다중 소스의 불완전한 궤적 선호를 활용한 RL에 동기를 부여한다.
소스의 수 M과 누적 불완전성 예산 ω가 후회에 미치는 영향을 정량화한다.
불완전도 수준에 적응하고 우수한 후회를 달성하는 알고리즘 RL-MSIP를 개발한다.
불완전성 효과를 명확히 파악하기 위해 하한과 순진한 집계에 대한 반례를 제시한다.

제안 방법

K 에피소드에 걸친 누적 예산 ω를 가진 다중 소스 불완전 선호 피드백을 형식화한다.
불완전도에 적응하는 가중 비교 학습을 제안하여 비교 함수를 추정한다.
피드백으로부터의 분포 이동을 제어하기 위해 가치-타깃 전이 추정을 사용한다.
선호도 전용 피드백 하에서 탐색을 균형 있게 수행하기 위해 경계된 UCB를 활용한 정책 수준의 낙관성을 구현한다.
가중 목표를 분석 가능하고 안정적으로 유지하기 위해 하위 중요 샘플링을 적용한다.

실험 결과

연구 질문

RQ1불완전한 선호를 가진 RLHF에서 소스 수 M과 누적 불완전성 ω가 후회에 어떤 영향을 미치는가?
RQ2불완전도가 작을 때는 M 의존 이득을 얻고, 불완전도가 큰 경우에도 강건함을 달성하는 알고리즘을 설계할 수 있는가?
RQ3다중 소스 불완전 선호하에서 후회의 기본 한계(하한)는 무엇인가?
RQ4순진하게 불완전 선호를 집계할 때 어떤 함정이 생기며 그 영향력을 정량화할 수 있는가?
RQ5불완전성 하에서 후회 분석을 관리 가능하게 유지하기 위해 전이와 선호도를 어떻게 추정해야 하는가?

주요 결과

RL-MSIP는 대략 Õ(√(K/M) + ω)의 후회를 달성한다.
하한은 후회가 적어도 Õ(max{√(K/M), ω})여야 한다는 것을 보여준다.
불완전성을 무시하는 경우의 Õ(min{ω√K, K})를 초래하는 반례가 존재한다.
이 방법은 불완전도 적응 가중치 부여, 가치-타깃 회귀, 정책 수준의 낙관성, 그리고 하위 중요 샘플링을 결합한다.
결과는 다중 소스 피드백이 언제 RLHF를 개선하는지와 불완전성이 그것을 얼마나 제한하는지 정량화한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.