QUICK REVIEW

[论文解读] Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

Letian Chen, Rohan Paleja|arXiv (Cornell University)|Oct 17, 2020

Reinforcement Learning in Robotics参考文献 38被引用 31

一句话总结

本文提出 SSRR，一种从次优示范中学习理想化奖励的 IRL 框架，通过用 sigmoid 低通滤波器建模噪声-性能关系并利用 Noisy-AIRL 来训练鲁棒的奖励与策略，优于以往工作。

ABSTRACT

Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with ~0.95 correlation with ground-truth reward versus ~0.75 for prior work. We can then train policies achieving ~200% improvement over the suboptimal demonstration and ~90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.

研究动机与目标

通过使从次优人类示范中学习成为可能，民主化机器人学习。
找出现有次优示范方法失败的原因并提供一个鲁棒的替代方案。
推断一个理想化的奖励函数，捕捉潜在的任务目标。
训练出的策略显著超过给定的次优示范。
展示在机器人乒乓球等真实世界任务中的应用性。

提出的方法

使用 AIRL 从次优示范中获得初始奖励和策略。
向已学习的策略中注入噪声，以生成用于分析的合成噪声轨迹。
使用 sigmoid（低通）曲线将注入的噪声与学得的策略性能表征为函数关系。
拟合一个四参数 sigmoid 来建模噪声-性能关系（方程式 4）。
通过回归轨迹数据、以学得的噪声-性能曲线为引导，训练理想化的奖励函数 R_theta（方程式 5）。
引入 Noisy-AIRL，通过向 AIRL 生成器注入噪声并在判别器损失中使用重要性采样来提高鲁棒性（方程式 6）。

实验结果

研究问题

RQ1次优示范如何偏置 IRL 方法，是否能准确建模退化曲线？
RQ2基于 sigmoid 的噪声引起的性能损失表征是否能提高来自次优数据的奖励回归？
RQ3利用自监督的“噪声进入策略”数据是否能提高奖励函数的准确性和后续策略的性能？
RQ4在从次优示范学习时，Noisy-AIRL 如何影响对协变量偏移的鲁棒性？
RQ5将 SSRR 应用于模拟和真实世界机器人任务时，经验性能提升是多少？

主要发现

SSRR 在模拟任务中获得的奖励函数与真实奖励的相关性约为 0.94–0.97，优于以往工作（约 0.75 区间）。
Noisy-AIRL 改善了初始奖励估计，并为 SSRR 提供了更高质量的合成数据。
使用 SSRR 的奖励函数训练的策略相比次优示范有显著提升（模拟中平均约 163–192%；真实世界乒乓球任务中返回速度提升约 32%，正手旋转提升约 40%）。
SSRR 与 Noisy-AIRL 结合在 MuJoCo 的任务（HalfCheetah、Hopper、Ant）上比 D-REX 的轨迹排序准确性更高。
D-REX 的 Luce-Shepard 基于的假设被证明是在学习次优示范时的适得其反的归纳偏差。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。