QUICK REVIEW

[论文解读] Ranking-Based Reward Extrapolation without Rankings.

Daniel S. Brown, Wonjoon Goo|arXiv (Cornell University)|Jul 9, 2019

Reinforcement Learning in Robotics参考文献 33被引用 4

一句话总结

该论文提出 D-REX，一种基于排序的模仿学习方法，通过向行为克隆策略注入噪声生成合成排序，从而在无需人类提供的排序或奖励的情况下实现超越示范者的表现。该方法通过自动外推示范者能力，实现了 MuJoCo 和 Atari 基准上的最先进性能。

ABSTRACT

The performance of imitation learning is typically upper-bounded by the performance of the demonstrator. Recent empirical results show that imitation learning via ranked demonstrations allows for better-than-demonstrator performance; however, ranked demonstrations may be difficult to obtain, and little is known theoretically about when such methods can be expected to outperform the demonstrator. To address these issues, we first contribute a sufficient condition for when better-than-demonstrator performance is possible and discuss why ranked demonstrations can contribute to better-than-demonstrator performance. Building on this theory, we then introduce Disturbance-based Reward Extrapolation (D-REX), a ranking-based imitation learning method that injects noise into a policy learned through behavioral cloning to automatically generate ranked demonstrations. By generating rankings automatically, ranking-based imitation learning can be applied in traditional imitation learning settings where only unlabeled demonstrations are available. We empirically validate our approach on standard MuJoCo and Atari benchmarks and show that D-REX can utilize automatic rankings to significantly surpass the performance of the demonstrator and outperform standard imitation learning approaches. D-REX is the first imitation learning approach to achieve significant extrapolation beyond the demonstrator's performance without additional side-information or supervision, such as rewards or human preferences.

研究动机与目标

识别模仿学习在何种理论条件下可实现超越示范者的表现。
解决在真实世界模仿学习设置中获取排序演示的实际挑战。
开发一种仅使用未标注演示且无需额外监督即可实现优于示范者表现的方法。
验证自动生成的排序能否有效支持标准模仿学习环境中的奖励外推。

提出的方法

D-REX 在未标注演示上训练行为克隆策略，作为基础策略。
通过以受控方式向行为克隆策略注入噪声，生成合成排序，从而创建用于比较的多样化轨迹。
该方法利用这些噪声诱导的轨迹形成相对偏好信号，从而在无需人工输入的情况下有效生成合成排序。
采用基于排序的模仿学习目标，利用合成排序训练优化后的策略。
最终策略被训练以在源自合成排序的外推奖励信号下实现最大性能。
该方法完全以自监督方式运行，无需人工提供的奖励或偏好注释。

实验结果

研究问题

RQ1在何种理论条件下，模仿学习可实现超越示范者的表现？
RQ2仅通过单个策略的扰动生成的合成排序能否实现优于示范者的表现？
RQ3当排序由自动生成而非人类提供时，基于排序的模仿学习有多有效？
RQ4D-REX 是否能在无奖励函数或偏好信号访问的情况下超越标准模仿学习基线？

主要发现

D-REX 仅使用未标注演示，在标准 MuJoCo 和 Atari 基准上实现了相对于示范者的显著性能提升。
该方法表明，当策略扰动产生多样化且信息丰富的轨迹比较时，实现优于示范者的表现是可能的。
D-REX 在无奖励函数或人类偏好访问的情况下，优于标准行为克隆及其他模仿学习基线。
通过噪声注入生成的合成排序足以支持有效的奖励外推与策略改进。
该方法是首个在无需额外监督或奖励信号的情况下实现显著超越示范者性能外推的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。