QUICK REVIEW

[论文解读] Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski|arXiv (Cornell University)|Jul 5, 2017

Reinforcement Learning in Robotics参考文献 44被引用 352

一句话总结

HER 通过对每个回合使用替代目标进行重放，在稀疏二元奖励下实现样本高效学习，从而在机器人领域的多目标任务中改善离策略强化学习。

ABSTRACT

Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task.

研究动机与目标

说明机器人领域奖励塑形的难点以及从稀疏信号中学习的需求。
介绍一种将目标作为输入纳入的通用策略学习方法。
展示通过对经历进行带有改变目标的重放可显著提升学习效率。
证明在仿真中训练的策略能够迁移到物理机器人。

提出的方法

使用将状态和目标作为输入的通用值函数近似器。
对每个回合进行重放，使用原始目标以及额外目标，如该回合中达到的最终状态（或其他策略）。
将基于离策略RL算法（如 DQN、DDPG、NAF、SDQN）应用于一个通过 hindsight 转换增强的重放缓冲区。
将奖励表述为稀疏二元奖励，或通过不同的回放目标策略来探索奖励。
提供一个算法描述（Algorithm 1）用于将 HER 与离策略 RL 集成。
分析不同目标采样策略对学习的影响（如最终、未来、回合、随机）。

实验结果

研究问题

RQ1带有 hindsight 重放的离策略强化学习能否有效地从稀疏的二元奖励中学习？
RQ2通过对轨迹进行替代目标的重放是否能够在多目标操纵任务中实现学习？
RQ3选择额外回放目标以最大化学习效率的策略有哪些？
RQ4HER 是否能够在不进行微调的情况下实现从仿真到物理机器人的迁移？

主要发现

带 HER 的 DDPG 能解决标准 RL 失败的推动、滑动和拣放任务。
HER 在稀疏奖励下仍然有效，在所测试的任务中甚至优于奖励塑形的替代方法。
在回放中使用 future/episode/partial future 目标可获得更好的性能，尤其是滑动任务。
在增加观测噪声后重新训练，策略可无需微调直接部署在物理的 Fetch 机器人上。
在这些任务中，按测试形式的奖励塑形并未提升性能。
在多目标训练时，即使只有一个目标是感兴趣的，也能加速学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。