Skip to main content
QUICK REVIEW

[论文解读] Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Zihao Li, Zhuoran Yang|arXiv (Cornell University)|May 29, 2023
Reinforcement Learning in Robotics被引用 14
一句话总结

本文提出 DCPPO,一种离线 RLHF 方法,能够学习人类行为、从动态离散选择中恢复潜在奖励,并执行悲观值迭代以在单策略覆盖下获得具有理论保证的近似最优策略。

ABSTRACT

In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

研究动机与目标

  • 建模并从离线人类反馈中学习,以识别人类的奖励和 MDP 的最优策略。
  • 利用 Dynamic Discrete Choice (DDC) 捕捉有界理性和前瞻性的人类决策。
  • 开发一个三阶段算法,在数据有限的情况下恢复人类行为、估计奖励并计算近似最优策略。
  • 在单策略覆盖下为带 DDC 的离线 RLHF 提供有限样本理论保证。

提出的方法

  • 阶段 1:在一个函数类内通过最大似然估计(MLE)估计人类行为策略和状态-动作值函数。
  • 阶段 2:利用学习到的值函数通过最小化 Bellman 均方误差来恢复人类奖励,并加入不确定性感知惩罚。
  • 阶段 3:将学习到的奖励代入并执行悲观值迭代以获得近似最优策略,确保对分布转移的鲁棒性。

实验结果

研究问题

  • RQ1在动态离散选择模型下,在没有直接奖励访问的情况下,我们是否可以从离线人类选择中学习到最优策略和潜在奖励?
  • RQ2在有限数据和广义模型类下,我们能多好地界定人类策略和奖励的估计误差?
  • RQ3在单策略覆盖下,将悲观性与奖励估计误差结合是否能给出可证明的次优性保证?

主要发现

  • DCPPO 在一个较小覆盖数假设下,以 O(1/n) 的误差率恢复人类策略和值函数。
  • 奖励可以用一个受椭圆势能项及由于奖励估计引入的额外误差项共同决定的界来估计。
  • 在线性 MDP 下,使用学习到的奖励进行带悲观性的值迭代,在单策略覆盖下达到 O(n^{-1/2}) 的次优性间隙,与标准的悲观离线 RL 结果具有可比性。
  • 在 RKHS 设置下,该框架扩展到基于核的方法,具备不确定性量化并保持有限样本保证。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。