QUICK REVIEW

[论文解读] Learning to Navigate in Complex Environments

Piotr Mirowski, Razvan Pascanu|arXiv (Cornell University)|Nov 11, 2016

Reinforcement Learning in Robotics被引用 366

一句话总结

本文通过在3D迷宫导航的端到端强化学习代理中引入辅助任务（深度预测和回环闭合）以提升数据效率和性能，在动态目标设定下达到接近人类的表现。

ABSTRACT

Learning to navigate in complex environments with dynamic elements is an important milestone in developing AI agents. In this work we formulate the navigation question as a reinforcement learning problem and show that data efficiency and task performance can be dramatically improved by relying on additional auxiliary tasks leveraging multimodal sensory inputs. In particular we consider jointly learning the goal-driven reinforcement learning problem with auxiliary depth prediction and loop closure classification tasks. This approach can learn to navigate from raw sensory input in complicated 3D mazes, approaching human-level performance even under conditions where the goal location changes frequently. We provide detailed analysis of the agent behaviour, its ability to localise, and its network activity dynamics, showing that the agent implicitly learns key navigation abilities.

研究动机与目标

将导航学习动机化为一个强化学习问题，而不需要显式的SLAM/MSM映射。
通过引入利用多模态输入的辅助任务来提高数据效率和性能。
证明辅助深度预测和回环闭合分类有助于代理在动态迷宫中导航。
分析辅助任务如何影响内部表征和定位能力。
提供关于在导航任务中记忆与表示学习如何出现的见解。

提出的方法

使用带卷积编码器、随后接一个基于LSTM的记忆的演员-评论家（A3C）方法。
引入辅助深度预测，以从RGB输入重建低分辨率深度图。
引入回环闭合预测，利用整合的二维速度信息检测再次访问。
两种深度形式：从卷积特征预测深度（D1）或从顶层LSTM层预测深度（D2）；并与回环闭合损失（L）进行比较。
使用RL损失、深度损失（βd1、βd2）和回环闭合损失（βl）的加权组合进行训练。
在五个3D迷宫环境中进行评估，目标位置静态和随机放置，使用具有不同记忆与输入的Nav A3C架构。

实验结果

研究问题

RQ1辅助任务是否能够在端到端导航策略中提高数据效率和性能？
RQ2作为自监督辅助任务的深度预测是否有助于学习导航的几何信息和障碍物规避？
RQ3回环闭合预测是否促进在动态迷宫中更好地空间定位与记忆整合？
RQ4哪种辅助任务配置（D1、D2、L，或它们的组合）在导航性能和定位方面表现最佳？
RQ5记忆体系结构（带速度、动作和奖励输入的堆叠LSTM）如何影响复杂迷宫中的导航？

主要发现

Maze	Agent	AUC	分数	% 人类	目标	定位准确率	延迟 1:>1	分数
I-Maze	FF A3C*	75.5	98	-	94/100	42.2	9.3s:9.0s	102
I-Maze	LSTM A3C*	112.4	244	-	100/100	87.8	15.3s:3.2s	203
I-Maze	Nav A3C*+ D1 L	169.7	266	-	100/100	68.5	10.7s:2.7s	252
I-Maze	Nav A3C+ D2	203.5	268	-	100/100	62.3	8.8s:2.5s	269
I-Maze	Nav A3C+ D1D2L	199.9	258	-	100/100	61.0	9.9s:2.5s	251
Static 1	FF A3C*	41.3	79	83	100/100	64.3	8.8s:8.7s	84
Static 1	LSTM A3C*	44.3	98	103	100/100	88.6	6.1s:5.9s	110
Static 1	Nav A3C+ D2	104.3	119	125	100/100	95.4	5.9s:5.4s	122
Static 1	Nav A3C+ D1D2L	102.3	116	122	100/100	94.5	5.9s:5.4s	123
Static 2	FF A3C*	35.8	81	47	100/100	55.6	24.2s:22.9s	111
Static 2	LSTM A3C*	46.0	153	91	100/100	80.4	15.5s:14.9s	155
Static 2	Nav A3C+ D2	157.6	200	116	100/100	94.0	10.9s:11.0s	202
Static 2	Nav A3C+ D1D2L	156.1	192	112	100/100	92.6	11.1s:12.0s	192
Random Goal 1	FF A3C*	37.5	61	57.5	88/100	51.8	11.0:9.9s	64
Random Goal 1	LSTM A3C*	46.6	65	61.3	85/100	51.1	11.1s:9.2s	66
Random Goal 1	Nav A3C+ D2	71.1	96	91	100/100	85.5	14.0s:7.1s	91
Random Goal 1	Nav A3C+ D1D2L	64.2	81	81	81/100	83.7	11.5s:7.2s	74.6
Random Goal 2	FF A3C*	50.0	69	40.1	93/100	30.0	27.3s:28.2s	77
Random Goal 2	LSTM A3C*	37.5	57	32.6	74/100	33.4	21.5s:29.7s	51.3
Random Goal 2	Nav A3C+ D1L	62.5	90	52	90/100	51.0	17.9s:18.4s	106
Random Goal 2	Nav A3C+ D2	82.1	103	59	79/100	72.4	15.4s:15.0s	109
Random Goal 2	Nav A3C+ D1D2L	78.5	91	53	74/100	81.5	15.9s:16.0s	102

辅助任务显著加速学习并提升各迷宫的性能，尤其是静态目标的迷宫。
从策略的LSTM预测深度（D2）带来强劲的导航性能和定位收益。
在该设置中，深度预测的分类形式收敛速度快于回归。
回环闭合预测与深度互补，有助于速度积分与空间推理；联合损失通常优于单一任务。
带辅助损失的Nav A3C在静态迷宫接近人类水平，在动态/随机目标迷宫中取得显著分数。
基于内部表示训练的位置解码器指示出更好的定位，与更高的任务奖励相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。