QUICK REVIEW

[论文解读] Dealing with Sparse Rewards in Reinforcement Learning

Joshua Hare|arXiv (Cornell University)|Oct 21, 2019

Reinforcement Learning in Robotics参考文献 37被引用 48

一句话总结

本硕士报告综述了强化学习中使用稀疏奖励进行学习的方法，并提出了一种将好奇心驱动的探索与无监督辅助任务相结合的新方法，在视频游戏环境中进行了评估。

ABSTRACT

Successfully navigating a complex environment to obtain a desired outcome is a difficult task, that up to recently was believed to be capable only by humans. This perception has been broken down over time, especially with the introduction of deep reinforcement learning, which has greatly increased the difficulty of tasks that can be automated. However, for traditional reinforcement learning agents this requires an environment to be able to provide frequent extrinsic rewards, which are not known or accessible for many real-world environments. This project aims to explore and contrast existing reinforcement learning solutions that circumnavigate the difficulties of an environment that provide sparse rewards. Different reinforcement solutions will be implemented over a several video game environments with varying difficulty and varying frequency of rewards, as to properly investigate the applicability of these solutions. This project introduces a novel reinforcement learning solution by combining aspects of two existing state of the art sparse reward solutions, curiosity driven exploration and unsupervised auxiliary tasks.

研究动机与目标

在具有稀疏外在奖励的环境中激发并研究强化学习。
对比现有的稀疏奖励强化学习解方案并评估其适用性。
在逐步变难的视频游戏环境中实现并评估稀疏奖励RL方法。
引入一种将好奇心驱动的探索与无监督辅助任务结合的新代理。

提出的方法

回顾基础的 RL 概念（MDP、价值函数、Bellman 方程）和动态规划。
讨论无模型的 RL 方法（蒙特卡洛、时序差分、Q-learning、策略梯度）及其在稀疏奖励下的局限性。
描述前沿的稀疏奖励技术，包括好奇心驱动的探索、无监督辅助任务、随机网络蒸馏和后见经验回放。
呈现并分析 DRL 智能体的实现（A2C、Sync-DDQN、PPO）以及稀疏奖励增强方法（UNREAL-A2C2、RANDAL、RND、ICM）的实现。
在 Classic Control 和 Atari 2600 环境中对智能体进行评估，以比较基线方法和稀疏奖励方法。

实验结果

研究问题

RQ1现有稀疏奖励强化学习方法在不同视频游戏环境中的表现如何？
RQ2将好奇心驱动的探索与无监督辅助任务相结合是否能在稀疏奖励下改善学习？
RQ3在学习效率和最终表现方面，所提出的方法与基线方法（如 A2C、DDQN、PPO）相比如何？
RQ4实现和扩展稀疏奖励 RL 智能体的实际注意事项有哪些（硬件、软件架构、编码器）？

主要发现

证明了稀疏奖励技术在不同难度的视频游戏环境中的适用性。
表明将好奇心驱动的探索与无监督辅助任务结合可以在稀疏奖励下实现具有竞争力的性能。
提供基线与像 UNREAL-A2C2 和 RANDAL 等稀疏奖励增强代理之间的经验比较。
为可扩展的深度强化学习提供实现见解，包括编码器网络和超参数。
为稀疏奖励方法的进一步改进和扩展留出空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。