QUICK REVIEW

[论文解读] Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Matej Vecerík, Todd Hester|arXiv (Cornell University)|Jul 27, 2017

Reinforcement Learning in Robotics参考文献 20被引用 509

一句话总结

本文将演示数据扩展到 DDPG（DDPGfD），以在机器人插入任务中从稀疏奖励中学习，使用具优先级采样的回放缓冲区、n 步回报和多次更新训练，在仿真和真实硬件上都优于标准 DDPG。

ABSTRACT

We propose a general and model-free approach for Reinforcement Learning (RL) on real robotics with sparse rewards. We build upon the Deep Deterministic Policy Gradient (DDPG) algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill a replay buffer and the sampling ratio between demonstrations and transitions is automatically tuned via a prioritized replay mechanism. Typically, carefully engineered shaping rewards are required to enable the agents to efficiently explore on high dimensional control problems such as robotics. They are also required for model-based acceleration methods relying on local solvers such as iLQG (e.g. Guided Policy Search and Normalized Advantage Function). The demonstrations replace the need for carefully engineered rewards, and reduce the exploration problem encountered by classical RL approaches in these domains. Demonstrations are collected by a robot kinesthetically force-controlled by a human demonstrator. Results on four simulated insertion tasks show that DDPG from demonstrations out-performs DDPG, and does not require engineered rewards. Finally, we demonstrate the method on a real robotics task consisting of inserting a clip (flexible object) into a rigid object.

研究动机与目标

证明演示可以在稀疏奖励的具有挑战性的机器人操控任务中替代奖励塑形。
将演示整合到离策略强化学习框架中，以提高数据效率和学习稳定性。
表明在使用演示时，具优先级回放、n 步回报和重复更新能提升学习效果。
在四个模拟插入任务和一个真实机器人插入任务上验证该方法。

提出的方法

将 DDPG 扩展为在训练开始前将演示转移整合到回放缓冲区。
使用具优先级经验回放以对演示和代理转移进行采样，并偏向更有信息量的经验。
将 1 步回报和 n 步回报的损失结合用于评论家，以在轨迹上传播稀疏奖励。
在每个环境步骤执行多次梯度更新，以提高数据效率并保持稳定性。
对演员网络和评论家网络应用 L2 正则化以提升稳定性。
通过阻抗控制实现真实机器人试验中的安全约束，以限制过大力。

实验结果

研究问题

RQ1演示能否在稀疏奖励的机器人插入任务中替代人工设计的塑形奖励？
RQ2将演示整合到带有优先级回放的离策略框架中，是否能加速学习并优于标准 DDPG？
RQ31 步回报和 n 步回报如何在演示增强的强化学习中传播稀疏奖励？
RQ4不同演示数据量对学习效率和最终性能有何影响？
RQ5仿真任务和真实机器人实验中的结果是否一致？

主要发现

DDPGfD 在所有测试任务中均优于 DDPG，即使 DDPG 使用了精心调整的塑形奖励。
DDPGfD 在稀疏奖励下也能有效学习，通常达到或超过带塑形奖励的性能。
在夹子插入任务中，DDPGfD 的学习比仅靠演示快 2–4 倍，且训练稳定性更广。
单个演示即可解决稀疏奖励的夹子插入任务，演示在 50–100 次后回报递减。
真实机器人实验表明，DDPGfD 在不使用设计奖励的情况下实现鲁棒的插入策略，优于带塑形奖励的 DDPG。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。