QUICK REVIEW

[论文解读] The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors

William H. Guss, Codel, Cayden|arXiv (Cornell University)|Apr 22, 2019

Reinforcement Learning in Robotics参考文献 35被引用 47

一句话总结

论文提出 MineRL 竞赛与数据集，通过在 Minecraft 环境中利用人类演示来推动样本高效强化学习，主要的 ObtainDiamond 任务和-held-out 评估在严格资源约束下评估泛化。

ABSTRACT

Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially increasing number of samples, their development is restricted to a continually shrinking segment of the AI community. Likewise, many of these systems cannot be applied to real-world problems, where environment samples are expensive. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we introduce the MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, we introduce: (1) the Minecraft ObtainDiamond task, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods; and (2) the MineRL-v0 dataset, a large-scale collection of over 60 million state-action pairs of human demonstrations that can be resimulated into embodied trajectories with arbitrary modifications to game state and visuals. Participants will compete to develop systems which solve the ObtainDiamond task with a limited number of samples from the environment simulator, Malmo. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures. At the end of each round, competitors will submit containerized versions of their learning algorithms and they will then be trained/evaluated from scratch on a hold-out dataset-environment pair for a total of 4-days on a prespecified hardware platform.

研究动机与目标

推动开发利用人类演示以减少环境样本量的样本高效强化学习方法。
引入 Minecraft ObtainDiamond 任务，作为一个具有挑战性的分层结构环境。
发布 MineRL-v0 数据集，包含超过六千万的人类演示，以使具身代理能够进行模仿学习。
提供两轮竞赛结构并含保留评估，在固定计算预算下确保公平基准测试。

提出的方法

定义需要长远规划与探索的主要 ObtainDiamond 任务，在 Minecraft 中进行。
提供具有丰富注释和层次标签的大规模 MineRL-v0 状态-动作轨迹数据集。
以不同纹理和光照呈现演示，以在不同环境中实现稳健评估。
向参赛者提供基准实现与开源工具（Gym 界面、数据加载器、Docker）。
使用 AICrowd 协调和固定计算环境来强制执行样本效率评估。

实验结果

研究问题

RQ1模仿学习和人类先验能否显著减少解决复杂、稀疏奖励任务所需的环境样本数量？
RQ2强化学习方法在像 Minecraft 这样的分层具身领域中，如何有效利用大规模人类演示数据集？
RQ3在固定计算预算下，不同环境纹理与视觉效果对训练效率与策略性能有何影响？
RQ4在 ObtainDiamond 中，基线 RL 方法在严格的样本和计算约束下与人类表现相比如何？

主要发现

里程碑	奖励	里程碑	奖励
1	32
2	32
3	4
4	64
5	128
6	256
7	1024

初步结果表明，利用人类数据的方法在各个环境中提高了样本效率。
在人类演示的任务上，人类表现优于所有测试的强化学习方法，突显 ObtainDiamond 及相关任务的长时程信用分配的挑战。
Treechop、Navigate（Sparse）及其他环境揭示了强化学习基线与人类表现之间的巨大差距。
专家演示能够在更少的样本下实现更高的奖励，尤其在如 Navigate (Sparse) 这样的探索挑战情境中。
基于模仿的基线（行为克隆、PreDQN）及预训练变体显示出相对于未预训练的 RL 方法的潜在提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。