Skip to main content
QUICK REVIEW

[论文解读] Reinforcement Learning from Imperfect Demonstrations

Yang Gao, Huazhe|arXiv (Cornell University)|Feb 14, 2018
Reinforcement Learning in Robotics参考文献 24被引用 98
一句话总结

NAC 通过对 Q 函数归一化,将来自演示和环境奖励的学习统一起来,从而在不完美的演示下实现鲁棒学习,并超越演示者的表现。

ABSTRACT

Robust real-world learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator's performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.

研究动机与目标

  • 通过利用演示和环境交互来推动现实世界中的鲁棒学习。
  • 提出一个统一的目标,避免将监督损失和强化学习损失分离。
  • 在不要求最优性的前提下,从不完美或嘈杂的演示中实现学习。
  • 在演示和基于环境的改进之间展示鲁棒性能。

提出的方法

  • 提出 Normalized Actor-Critic (NAC),通过对 Q 函数归一化来减少演示中未见行动。
  • 从软策略梯度框架派生 NAC 更新,采用统一的损失。
  • 使用目标网络和回放缓冲区来稳定训练,且不需要外部模仿损失。
  • 通过相同目标将演示纳入与环境转移并行的离策略学习中。
  • 表明 NAC 能从不完美的演示学习,并通过交互细化策略。

实验结果

研究问题

  • RQ1NAC 能否同时有效地从演示和基于环境的奖励中学习?
  • RQ2NAC 对次优或嘈杂的演示是否鲁棒?
  • RQ3在驾驶相关任务上,NAC 是否优于模仿+RL 基线?
  • RQ4当演示有限或嘈杂时,NAC 与现有方法相比如何?

主要发现

  • NAC 在驾驶任务上以适量演示优于现有方法,并通过使用奖励而非纯模仿来容忍嘈杂演示。
  • 统一目标使其能够在无辅助监督模仿损失的情况下,从演示和环境中学习。
  • NAC 对不完美的演示保持鲁棒性,并通过环境交互提升至超过演示者的表现。
  • 在玩具和真实驾驶环境中,NAC 即使演示数据有限且奖励选择多样,也能保持出色的表现。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。