QUICK REVIEW

[论文解读] Control of Memory, Active Perception, and Action in Minecraft

Junhyuk Oh, Valliappa Chockalingam|arXiv (Cornell University)|May 30, 2016

Reinforcement Learning in Robotics参考文献 38被引用 172

一句话总结

该论文介绍了在 Minecraft 任务上评估的基于内存的深度强化学习架构，这些任务测试部分观测、延迟奖励和主动感知，并显示对未见地图的泛化优于标准 DRL 基线。

ABSTRACT

In this paper, we introduce a new set of reinforcement learning (RL) tasks in Minecraft (a flexible 3D world). We then use these tasks to systematically compare and contrast existing deep reinforcement learning (DRL) architectures with our new memory-based DRL architectures. These tasks are designed to emphasize, in a controllable manner, issues that pose challenges for RL methods including partial observability (due to first-person visual observations), delayed rewards, high-dimensional visual observations, and the need to use active perception in a correct manner so as to perform well in the tasks. While these tasks are conceptually simple to describe, by virtue of having all of these challenges simultaneously they are difficult for current DRL architectures. Additionally, we evaluate the generalization performance of the architectures on environments not used during training. The experimental results show that our new architectures generalize to unseen environments better than existing DRL architectures.

研究动机与目标

在可控的 3D 世界（Minecraft）中激发强化学习，强调部分观测、延迟奖励、高维感知和主动感知。
在设计的认知任务上系统性地比较现有 DRL 架构与新的基于记忆的 DRL 架构。
评估架构对未见或更大地图拓扑的泛化性能。
证明通过利用上下文相关的记忆检索，基于记忆的架构在未见地图上的泛化能力更强。

提出的方法

用 CNN 将观测编码为固定长度的特征向量。
将最近的观测存储到外部记忆中，作为键/值块。
用以上下文向量为条件的软注意力检索记忆。
用三种变体构造上下文向量：MQN（前馈）、RMQN（基于 LSTM）、和 FRMQN（带记忆反馈的 LSTM）。
使用将上下文和检索到的记忆结合的 MLP 来估计行动值。

实验结果

研究问题

RQ1在 Minecraft 任务上，记忆增强的 DRL 架构是否比传统的 DQN/DRQN 更好地处理部分观测、主动感知和基于记忆的推理？
RQ2上下文相关的记忆检索和记忆反馈是否提升对未见或更大地图的泛化？
RQ3在需要记忆指示符、模式和顺序目标的任务中，所提出的架构表现如何？
RQ4与标准基线相比，基于记忆的模型是否在对更大或不同地图拓扑的外推上表现更好？

主要发现

基于记忆的架构（MQN、RMQN、FRMQN）在认知型 Minecraft 任务上通常优于 DQN 和 DRQN。
FRMQN 在未见地图上的泛化能力在各任务中最强，特别是在与指示符相关的模式匹配和具备顺序目标的任务中。
记忆检索被有选择性地、情境性地使用，例如 FRMQN 仅在与决策相关时检索指示符信息。
RMQN 和 FRMQN 在未见地图上表现出比 DRQN 更好的泛化，而 DRQN 在部分观测下处理长期依赖时表现不足。
在各任务中，随着部分观测的增加（例如指示符与目标之间距离增大），记忆增强模型与基线之间的差距扩大。
定性分析显示记忆注意力在决策点聚焦于相关观测，支持主动感知的学得策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。